-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Do not merge] Upsert metric for execution #85
base: master
Are you sure you want to change the base?
Conversation
5d40402
to
cc54d3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, BQ doesn't support upserts directly. I found a couple of answers on the forums and stack overflow recommending using MERGE
to mimic an upsert. Would this be more consistent than check/delete/insert? https://www.googlecloudcommunity.com/gc/Data-Analytics/How-to-UPSERT-in-BigQuery/m-p/547777
I see that thread too. But from the BigQuery official tutorial, Merge statement won't work (link). But the Storage Write API works (please see here). If we want to adapt the method, we need to do following 3 steps: 1) open a stream; 2) send a protocol buffer; 3) close the stream (see here). That seems a big change to our existing system. I doubt if we should onboard it at this moment. Thoughts? |
f43279f
to
e35e0e1
Compare
7472154
to
0f50195
Compare
0f50195
to
c55e681
Compare
c55e681
to
f825fbe
Compare
Description
Context: at high level, in prod stage,it should only produce 1 metric record for an execution (i.e. a DAG is scheduled on daily basis, and there should only have 1 record from day 2024-1-4, and 1 record from day 2024-1-5 etc. In the current implementation, if we clear the old execution, and a new metric record will be inserted into the BigQuery table. So duplicate metrics occur for a single execution.
Upsert metric for execution by deleting the existing ones (if any) and insert new metrics.
Tests
Please describe the tests that you ran on Cloud VM to verify changes.
Instruction and/or command lines to reproduce your tests: ...
Upload to Airflow and test
List links for your tests (use go/shortn-gen for any internal link): ...
If new record or existing records can be deleted (tested both cases), we get logs - link:
If within 90min buffer period and we cannot delete, we will get logs - link task will fail:
Checklist
Before submitting this PR, please make sure (put X in square brackets):