Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfills & deprecating legacy tables #10

Merged
merged 59 commits into from
Nov 1, 2024
Merged
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
99fc4b6
deprecated
max-ostapenko Sep 19, 2024
0dbb96c
backfill draft
max-ostapenko Sep 19, 2024
1c30ac3
cleanup
max-ostapenko Sep 19, 2024
78e4a23
null placeholders
max-ostapenko Sep 19, 2024
6738785
sql fix
max-ostapenko Sep 19, 2024
4d69fc6
fix month range
max-ostapenko Sep 19, 2024
52a8eec
literal table names
max-ostapenko Sep 19, 2024
a7b7a53
backfill tested
max-ostapenko Sep 27, 2024
6514d89
Merge branch 'main' into main
max-ostapenko Sep 27, 2024
96fee15
dates reset
max-ostapenko Sep 27, 2024
a82927c
requests_summary
max-ostapenko Sep 27, 2024
c504878
requests backfill for mid month
max-ostapenko Sep 29, 2024
9dc4cf0
remove legacy pipelines
max-ostapenko Sep 29, 2024
c316e25
checked against new schema
max-ostapenko Sep 29, 2024
e3cf47b
adjusted to a new schema
max-ostapenko Sep 29, 2024
8832ffe
backfill_pages
max-ostapenko Sep 29, 2024
06d6cb4
legacy removed
max-ostapenko Sep 30, 2024
9ba236d
remove legacy datasets
max-ostapenko Sep 30, 2024
a57df2d
Merge branch 'main' into main
max-ostapenko Sep 30, 2024
57da6fb
metrics sorted
max-ostapenko Sep 30, 2024
3683a89
parse features
max-ostapenko Sep 30, 2024
c55adb6
Merge branch 'main' into main
max-ostapenko Sep 30, 2024
6866120
Merge branch 'main' into fiscal-owl
max-ostapenko Oct 14, 2024
2870012
lint
max-ostapenko Oct 14, 2024
992802f
jscpd off
max-ostapenko Oct 14, 2024
23c29b9
update js variable names
max-ostapenko Oct 14, 2024
14b9585
other cm format
max-ostapenko Oct 14, 2024
b176ee3
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Oct 14, 2024
4138af1
Merge branch 'main' into fiscal-owl
max-ostapenko Oct 18, 2024
4cafc6f
pages completed
max-ostapenko Oct 19, 2024
d1dfd49
summary_pages completed
max-ostapenko Oct 19, 2024
23a522d
Merge branch 'main' into main
max-ostapenko Oct 19, 2024
3940d6a
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Oct 19, 2024
1244e95
without other headers
max-ostapenko Oct 20, 2024
4179197
Merge branch 'fiscal-owl' of https://github.com/HTTPArchive/dataform …
max-ostapenko Oct 20, 2024
e55d8b4
fix
max-ostapenko Oct 20, 2024
c7afc11
fix
max-ostapenko Oct 20, 2024
e03a353
fix
max-ostapenko Oct 20, 2024
4a6101a
actual reprocessing queries
max-ostapenko Oct 20, 2024
4eb39ae
fix
max-ostapenko Oct 20, 2024
8d54b1b
requests complete
max-ostapenko Oct 20, 2024
a38efe0
fix casts
max-ostapenko Oct 20, 2024
86fff73
wptid from summary
max-ostapenko Oct 20, 2024
e030acc
Update definitions/output/all/backfill_requests.js
max-ostapenko Oct 20, 2024
d9ce5fb
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Oct 20, 2024
c8e2343
summary update
max-ostapenko Oct 21, 2024
6032434
only valid other headers
max-ostapenko Oct 21, 2024
bc0a104
Merge branch 'main' into main
max-ostapenko Oct 21, 2024
193027e
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Oct 21, 2024
e61df0a
move tables
max-ostapenko Oct 21, 2024
b2e7b7d
fix json parsing
max-ostapenko Oct 21, 2024
14816f8
fix summary metrics
max-ostapenko Oct 22, 2024
2898c82
crawl pipeline updated
max-ostapenko Oct 22, 2024
d94ca11
update dependents
max-ostapenko Oct 22, 2024
01101db
response_bodies adjustment
max-ostapenko Nov 1, 2024
47ebb36
Merge branch 'main' into main
max-ostapenko Nov 1, 2024
ff5b06e
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Nov 1, 2024
530ecd4
Merge branch 'fiscal-owl' of https://github.com/HTTPArchive/dataform …
max-ostapenko Nov 1, 2024
5ac19db
lint
max-ostapenko Nov 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/linter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,6 @@ jobs:
env:
DEFAULT_BRANCH: main
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VALIDATE_JSCPD: false
VALIDATE_JAVASCRIPT_PRETTIER: false
VALIDATE_MARKDOWN_PRETTIER: false
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are

### Crawl results

Tag: `crawl_results_all`
Tag: `crawl_complete`

- httparchive.all.pages
- httparchive.all.parsed_css
- httparchive.all.requests
- httparchive.crawl.pages
- httparchive.crawl.parsed_css
- httparchive.crawl.requests

### Core Web Vitals Technology Report

Expand All @@ -39,6 +39,9 @@ Consumers:

Tag: `crawl_results_legacy`

- httparchive.all.pages
- httparchive.all.parsed_css
- httparchive.all.requests
- httparchive.lighthouse.YYYY_MM_DD_client
- httparchive.pages.YYYY_MM_DD_client
- httparchive.requests.YYYY_MM_DD_client
Expand All @@ -51,7 +54,7 @@ Tag: `crawl_results_legacy`

1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription

Tags: ["crawl_results_all", "blink_features_report", "crawl_results_legacy"]
Tags: ["crawl_complete", "blink_features_report", "crawl_results_legacy"]

2. [bq-poller-cwv-tech-report](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-east4/bq-poller-cwv-tech-report?authuser=7&project=httparchive) Scheduler

Expand Down
3 changes: 2 additions & 1 deletion definitions/extra/test_env.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
const date = constants.currentMonth
operate('test')

// List of resources to be copied to the test environment. Comment out the ones you don't need.
const resourcesList = [
Expand All @@ -15,7 +16,7 @@ const resourcesList = [
resourcesList.forEach(resource => {
operate(
`test_table ${resource.datasetId}_dev_dev_${resource.tableId}`
).queries(`
).dependencies(['test']).queries(`
CREATE SCHEMA IF NOT EXISTS ${resource.datasetId}_dev;
DROP TABLE IF EXISTS ${resource.datasetId}_dev.dev_${resource.tableId};

Expand Down
2 changes: 1 addition & 1 deletion definitions/output/all/pages.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ publish('pages', {
clusterBy: ['client', 'is_root_page', 'rank'],
requirePartitionFilter: true
},
tags: ['crawl_results_all']
tags: ['crawl_results_legacy']
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE date = '${constants.currentMonth}';
Expand Down
21 changes: 5 additions & 16 deletions definitions/output/all/parsed_css.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,10 @@ publish('parsed_css', {
clusterBy: ['client', 'is_root_page', 'rank', 'page'],
requirePartitionFilter: true
},
tags: ['crawl_results_all']
tags: ['crawl_results_legacy']
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE date = '${constants.currentMonth}';
`).query(ctx => `
SELECT *
FROM ${ctx.ref('crawl_staging', 'parsed_css')}
WHERE date = '${constants.currentMonth}'
AND client = 'desktop'
${constants.devRankFilter}
`).postOps(ctx => `
INSERT INTO ${ctx.self()}
SELECT *
FROM ${ctx.ref('crawl_staging', 'parsed_css')}
WHERE date = '${constants.currentMonth}'
AND client = 'mobile'
${constants.devRankFilter};
DROP SNAPSHOT TABLE IF EXISTS ${ctx.self()};

CREATE SNAPSHOT TABLE ${ctx.self()}
CLONE ${ctx.ref('crawl', 'parsed_css')};
`)
274 changes: 0 additions & 274 deletions definitions/output/all/reprocess_pages.js

This file was deleted.

Loading