Deprecated. The next version of the data pipeline: https://github.com/HTTPArchive/dataform
The new HTTP Archive data pipeline built entirely on GCP
Table of contents generated with markdown-toc
This repo handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive
dataset in BigQuery.
A secondary pipeline is responsible for populating the Technology Report Firestore collections.
There are currently two main pipelines:
- The
all
pipeline which saves data to the newhttparchive.all
dataset - The
combined
pipline which saves data to the legacy tables. This processes both thesummary
tables (summary_pages
andsummary_requests
) andnon-summary
pipeline (pages
,requests
,response_bodies
....etc.)
The secondary tech_report
pipeline saves data to a Firestore database (e.g. tech-report-apis-prod
) across various collections (see TECHNOLOGY_QUERIES
in constants.py)
The pipelines are run in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion, based on the code in the main
branch which is deployed to GCP on each merge.
The data-pipeline
workflow as defined by the data-pipeline-workflows.yaml file, runs the whole process from start to finish, including generating the manifest file for each of the two runs (desktop and mobile) and then starting the four dataflow jobs (desktop all, mobile all, desktop combined, mobile combined) in sequence to upload of the HAR files to the BigQuery tables. This can be rerun in case of failure by publishing a crawl-complete message, providing no data was saved to the final BigQuery tables.
The four dataflow jobs can also be rerun individually in case of failure, but the BigQuery tables need to be cleared down first (including any lingering temp tables)
The dataflow jobs can also be run locally, whereby the local code is uploaded to GCP for that particular run.
sequenceDiagram
participant PubSub
participant Workflows
participant Monitoring
participant Cloud Storage
participant Cloud Build
participant BigQuery
participant Dataflow
PubSub->>Workflows: crawl-complete event
loop until crawl queue is empty
Workflows->>Monitoring: check crawl queue
end
rect rgb(191, 223, 255)
Note right of Workflows: generate HAR manifest
break when manifest already exists
Workflows->>Cloud Storage: check if HAR manifest exists
end
Workflows->>Cloud Build: trigger job
Cloud Build->>Cloud Build: list HAR files and generate manifest file
Cloud Build->>Cloud Storage: upload HAR manifest to GCS
end
rect rgb(191, 223, 255)
Note right of Workflows: check BigQuery and run Dataflow jobs
break when BigQuery records exist for table and date
Workflows->>BigQuery: check all/combined tables for records in the given date
end
loop run jobs until retry limit is reached
Workflows->>Dataflow: run flex template
loop until job is complete
Workflows-->Dataflow: wait for job completion
end
end
end
sequenceDiagram
autonumber
actor developer
participant Local as Local Environment / IDE
participant Dataflow
participant Cloud Build
participant Workflows
developer->>Local: create/update Dataflow code
developer->>Local: run Dataflow job with DirectRunner via run_*.py
developer->>Dataflow: run Dataflow job with DataflowRunner via run_pipeline_*.sh
developer->>Cloud Build: run build_flex_template.sh
developer->>Workflows: update flexTemplateBuildTag
sequenceDiagram
actor developer
participant Local as Local Environment / IDE
participant Dataflow
participant PubSub
participant Workflows
alt run Dataflow job from local environment using the Dataflow runner
developer->>Local: clone repository and execute run_pipeline_*.sh
else run Dataflow job as a flex template
alt from local environment
developer->>Dataflow: clone repository and execute run_flex_template.sh
else from Google Cloud Console
developer->>Dataflow: use the Google Cloud Console to run a flex template as documented by GCP
end
else trigger a Google Workflows execution
alt
developer->>PubSub: create a new message containing a HAR manifest path from GCS
else
developer->>Workflows: rerun a previously failed Workflows execution
end
end
Dataflow jobs can be triggered several ways:
- Locally using bash scripts (this can be used to test uncommited code, or code on a non-`main`` branch)
- From the Google Cloud Console in Dataflow section by choosing to run a flex template (this can be used to run commited code for a particular dataflow pipeline only)
- From the Google Cloud Console in Workflow section by choosing to execute a failed
data-pipeline
workflow again (this can be used to rerun failed parts of the workflow after reason for failure is fixed) - By publishing a Pub/Sub message to run the whole workflow (this kicks off the whole workflow and not just the pipeline so is good for the batch kicking off jobs when done, or to rerun the whole process manually when the manifest file was not generated)
This method is best used when developing locally, as a convenience for running the pipeline's python scripts and GCP CLI commands.
# run the pipeline locally
./run_pipeline_combined.sh
./run_pipeline_all.sh
# run the pipeline using a flex template
./run_flex_template all [...]
./run_flex_template combined [...]
./run_flex_template tech_report [...]
This method is useful for running individual dataflow jobs from the web console since it does not require a development environment.
Flex templates accept additional parameters as mentioned in the GCP documentation below, while custom parameters are defined in flex_template_metadata_*.json
https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#specify-options
Steps:
- Locate the desired build tag (e.g. see
flexTemplateBuildTag
in the data-pipeline.workflows.yaml) - From the Google Cloud Console, navigate to the Dataflow > Jobs page
- Click "CREATE JOB FROM TEMPLATE" at the top of the page.
- Provide a "Job name"
- Change region to
us-west1
(as that's where we have most compute capacity) - Choose "Custom Template" from the bottom of the "Dataflow template" drop down.
- Browse to the template directory by pasting
httparchive/dataflow/templates/
into the "Template path", ignoring the error saying this is not a file, and then clicking Browse to choose the actual file from that directory. - Choose the pipeline type (e.g. all or combined) for the chosen build tag (e.g.
data-pipeline-combined-2023-02-10_03-55-04.json
- choose the latest one forall
orcombined
) - Expand "Optional Parameters" and provide an input for the "GCS input file" pointing to the manifests file (e.g.
gs://httparchive/crawls_manifest/chrome-Jul_1_2023.txt
for Desktop Jul 2023 orgs://httparchive/crawls_manifest/android-Jul_1_2023.txt
for Mobile for July 2023). - (Optional) provide values for any additional parameters
- Click "RUN JOB"
This method is useful for running the entire workflow from the web console since it does not require a development environment. It is useful when the part of the workflow failed for known reasons that have since been resolved. Prevous steps should be skipped as the workflow checks if they have already been run.
Steps:
- From the Google Cloud Console, navigate to the Workflow > Workflows page
- Select the
data-pipeline
workflow - In the Actions column click the three dots and select "Execute again"
This method is best used for serverlessly running the entire workflow, including logic to
- block execution when the crawl is still running, by waiting for the crawl's Pub/Sub queue to drain
- skip jobs where BigQuery tables have already been populated
- automatically retry failed jobs
Publishing a message containing the crawl's GCS path(s) will trigger a GCP workflow, including generating the HAR zip file for that run.
# single path
gcloud pubsub topics publish projects/httparchive/topics/crawl-complete --message "gs://httparchive/crawls/android-Nov_1_2022"
# multiple paths must be comma separated, without spaces
gcloud pubsub topics publish projects/httparchive/topics/crawl-complete --message "gs://httparchive/crawls/chrome-Feb_1_2023,gs://httparchive/crawls/android-Feb_1_2023"
Note that this can be run for an individual crawl (first example), or for both crawls (second example).
Running the combined
pipeline will produce summary and non-summary tables by default.
Summary and non-summary outputs can be controlled using the --pipeline_type
argument.
# example
./run_pipeline_combined.sh --pipeline_type=summary
./run_flex_template.sh combined --parameters pipeline_type=summary
This pipeline can read individual HAR files, or a single file containing a list of HAR file paths.
# Run the `all` pipeline on both desktop and mobile using their pre-generated manifests.
./run_flex_template.sh all --parameters input_file=gs://httparchive/crawls_manifest/*-Nov_1_2022.txt
# Run the `combined` pipeline on mobile using its manifest.
./run_flex_template.sh combined --parameters input_file=gs://httparchive/crawls_manifest/android-Nov_1_2022.txt
# Run the `combined` pipeline on desktop using its individual HAR files (much slower, not encouraged).
./run_flex_template.sh combined --parameters input=gs://httparchive/crawls/chrome-Nov_1_2022
Note the run_pipeline_combined.sh
and run_pipeline_all.sh
scriprts uses the parameters in the scripts and these cannot be overridden with command line parameters. These are often useful for local testing of changes (local testing still results in the processing happening in GCP but using code copied from locally).
To save to different tables for testing, temporarily edit the modules/constants.py
to prefix all the tables with experimental_
(note the experimental_parsed_css
is current production table so use experimental_gc_parsed_css
instead for now).
The pipeline can read a manifest file (text file containing GCS file paths separated by new lines for each HAR file). Follow the example to generate a manifest file:
# generate manifest files
nohup gsutil ls gs://httparchive/crawls/chrome-Nov_1_2022 > chrome-Nov_1_2022.txt 2> chrome-Nov_1_2022.err &
nohup gsutil ls gs://httparchive/crawls/android-Nov_1_2022 > android-Nov_1_2022.txt 2> android-Nov_1_2022.err &
# watch for completion (i.e. file sizes will stop changing)
# if the err file increases in size, open and check for issues
watch ls -l ./*Nov*
# upload to GCS
gsutil -m cp ./*Nov*.txt gs://httparchive/crawls_manifest/
- GCP DataFlow & Monitoring metrics - TODO: runtime metrics and dashboards
- Dataflow temporary and staging artifacts in GCS
- BigQuery (final landing zone)
GitHub actions are used to automate the build and deployment of Google Cloud Workflows and Dataflow Flex Templates. Actions are triggered on merges to the main
branch, for specific files, and when other related GitHub actions have completed successfully.
- Deploy Dataflow Flex Template will trigger when files related to the data pipeline are updated (e.g. python, Dockerfile, flex template metadata). This will build and upload the new builds (where they can be used) and update the data-pipeline workflows YAML with the latest build tag (based on datetime) and open a PR to merge that (so the new builds will be used by the batch).
- Deploy Cloud Workflow action will trigger when the data-pipeline workflows YAML is updated, or when the Deploy Dataflow Flex Template action has completed successfully.
PRs with a title of Bump dataflow flex template build tag
should be merged providing they are only updating the build datetime in the flexTemplateBuildTag
. Check it has not zeroed the build datetime out (this can happen if the job errors in unusual ways).
GCP's documentation for creating and building Flex Templates are linked here
The following files are used for building and deploying Dataflow Flex Templates:
- .gcloudignore excludes files from uploading to GCS for Cloud Build
- build_flex_template.sh a helper script to initiate the Cloud Build
- cloudbuild.yaml is the configuration file for Cloud Build to create containers and template files in GCS (artifacts listed further below)
- Dockerfile used to create the job graph and start the Dataflow job
- flex_template_metadata_all.json and flex_template_metadata_combined.json define custom parameters to be validated when the template is run
- run_flex_template.sh a helper script to run a Flex Template pipeline
Cloud Build is used to create Dataflow flex templates and upload them to Artifact Registry and Google Cloud Storage
- Cloud Build linked here
- Artifact Registry images linked here
- Flex templates in GCS gs://httparchive/dataflow/templates
The GitHub Actions can be triggered manually from the repository by following the documentation here for Manually running a workflow.
flowchart LR
Start((Start))
End((End))
A{Updating Dataflow?}
B[Run 'Deploy Dataflow Flex Template']
DDFTA[['Deploy Dataflow Flex Template' executes]]
C{Updating Cloud Workflows?}
D[Run 'Deploy Cloud Workflow']
DCWA[['Deploy Cloud Workflow' executes]]
Start --> A
Start --> C
A --> B
B -->DDFTA
DDFTA -->|automatically triggers| DCWA
C --> D
D --> DCWA
DCWA --> End
Alternatively, a combination of bash scripts and the Google Cloud Console can be used to manually deploy Cloud Workflows and Dataflow Flex Templates.
flowchart LR
Start((Start))
End((End))
A{Updating Dataflow?}
B[Run build_flex_template.sh]
C{Updating Cloud Workflows?}
D[Note the latest build tag from the script output]
E[Update the 'data-pipeline' workflow via the Cloud Console]
Start --> A
Start --> C
A -->|Yes| B
A -->|No| C
B --> D
D --> E
C -->|Yes| E
E --> End
This can be started by makling changes locally and then running the run_pipeline_all.sh
or run_pipeline_combined.sh
scripts (after changing input paramters in those scripts). Local code is copied to GCP for each run so your shell needs to be authenticated to GCP and have permissions to run.
To save to different tables for testing, temporarily edit the modules/constants.py
to prefix all the tables with experimental_
(note the experimental_parsed_css
is current production table so use experimental_gc_parsed_css
instead for now).
- Error logs can be seen in Error reporting GCP
- Jobs can be seen in the Dataflow -> Jobs screen of GCP.
- Workflows can be seen in the Workflows -> Workflows screen of GCP.
Since this pipeline uses the FILE_LOADS
BigQuery insert method, failures will leave behind temporary tables.
Use the saved query below and replace the dataset name as desired.
https://console.cloud.google.com/bigquery?sq=226352634162:82dad1cd1374428e8d6eaa961d286559
FOR field IN
(SELECT table_schema, table_name
FROM lighthouse.INFORMATION_SCHEMA.TABLES
WHERE table_name like 'beam_bq_job_LOAD_%')
DO
EXECUTE IMMEDIATE format("drop table %s.%s;", field.table_schema, field.table_name);
END FOR;
Initially this pipeline was developed to stream data into tables as individual HAR files became available in GCS from a live/running crawl. This allowed for results to be viewed faster, but came with additional burdens. For example:
- Job failures and partial recovery/cleaning of tables.
- Partial table population mid-crawl led to consumer confusion since they were previously accustomed to full tables being available.
- Dataflow API for streaming inserts burried some low-level configuration leading to errors which were opaque and difficult to troubleshoot.
The work item requesting state read is no longer valid on the backend
This log message is benign and expected when using an auto-scaling pipeline https://cloud.google.com/dataflow/docs/guides/common-errors#work-item-not-valid
Various parsing issues due to unhandled cases
New file formats from responses will be noted in WARNING logs
Using ported custom logic from legacy PHP rather than standard libraries produces missing values and inconsistencies