Datadog Gold Partner logo

BigQuery: Snapshot dataset with Cloud Workflow

By guillaume blaquiere.Jul 25, 2022

Article-BigQuery- Snapshot dataset with Cloud Workflow-1

Data are the golden mine of any company and a data warehouse is the best place to store and query those data. On Google Cloud, BigQuery is the most popular tool: serverless, scalable, pay as you use; it can process exabytes of data in seconds.

BigQuery features offer DML statements: Data Manipulation languages, to insert, update or delete data. And because your golden source is precious, you want to be sure that DML does not to break the current data value

You have to backup your data in BigQuery

Table snapshot feature

Table snapshot feature is exactly designed for that. You can perform a snapshot of your data now, or at any millisecond in the past 7 days (the time travel feature max duration).

The feature takes a snapshot of your data and only tracks the changes of updated or deleted fields. That’s mean:

  • If your snapshotted data never changes, the snapshot costs nothing.
  • If you change all your data, you will pay for the equivalent storage of your snapshotted tables

The principle is the same as a Git repository and the commit changes history.

Table snapshot limitation

There are a few existing limitations, but they aren’t really blocking. The most frustrating thing is the scope:

You can snapshot only a table!

Your tables belong to an homogeneous set of tables, (raw ingested data, datamarts, staging transformation,…), and those tables are grouped in a dataset.

Therefore, a table snapshot alone makes no sense, but the whole dataset must be backed-up.

Cloud Workflow solution

I have no doubt that the BigQuery team is aware of that real world limitation and that feature will be released one day.
But until then, Cloud Workflows can help us with that requirement.

The architecture

That basic example takes a source dataset (the dataset to snapshot) as parameter, and, optionally, the target dataset (where to store the snapshots). Then, the steps are the following

  • Check if the target dataset exists. If not, create it with the same name as the source dataset with _backup at the end
    For that example, I set an expiration time to 7 days to automatically delete the snapshots after that duration.
  • List the tables in the source dataset
  • Loop over the tables and create a snapshot for each of them in the target dataset.
    To improve performances, the Parallel feature of Cloud Workflows is used
  • If there is a nextPageToken in the list table result, jump to the List Table step with the new pageToken value sets.
The code

Here is a code sample.

main:
params: [param]
steps:
- assignStep:
assign:
- sourceDataset: ${param.source_dataset}
- targetDataset: ${default(map.get(param, "target_dataset"),sourceDataset + "_backup")}
- projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
- location: "US"
- maxResult: 100
- pageToken: ""
- check-target-dataset:
try:
call: googleapis.bigquery.v2.datasets.get
args:
projectId: ${projectId}
datasetId: ${targetDataset}
except:
as: e
steps:
- known_errors:
switch:
- condition: ${e.code == 404} #Create dataset
steps:
- create-target-dataset:
call: googleapis.bigquery.v2.datasets.insert
args:
projectId: ${projectId}
body:
location: ${location}
datasetReference:
datasetId: ${targetDataset}
projectId: ${projectId}
defaultTableExpirationMs: 604800000 #Snapshot expire after 7 days
next: list-tables
- unhandled_exception:
raise: ${e}
- list-tables:
call: googleapis.bigquery.v2.tables.list
args:
datasetId: ${sourceDataset}
projectId: ${projectId}
maxResults: ${maxResult}
pageToken: ${pageToken}
result: listResult
- perform-snapshots:
parallel:
for:
value: table
in: ${listResult.tables}
steps:
- snapshot:
call: googleapis.bigquery.v2.jobs.insert
args:
projectId: ${projectId}
body:
configuration:
copy:
destinationTable:
datasetId: ${targetDataset}
projectId: ${projectId}
tableId: ${table.tableReference.tableId}
operationType: "SNAPSHOT"
sourceTables:
- projectId: ${projectId}
datasetId: ${sourceDataset}
tableId: ${table.tableReference.tableId}
writeDisposition: "WRITE_EMPTY"
- check-iterate-pages:
switch:
- condition: ${default(map.get(listResult, "nextPageToken"),"") != ""}
steps:
- loop-over-pages:
assign:
- pageToken: ${listResult.nextPageToken}
next: list-tables

Deploy your workflow with that command

gcloud workflows deploy <workflowName> \
--source=<workflowFileName> \
--service-account=<runtimeServiceAccountEmail>

The runtime service account must have the BigQuery admin role on the target project to be able to create the target dataset and the snapshots; only be a data viewer on the source dataset.

And run it

gcloud workflows run <workflowName> \
--data='{"source_dataset":"<YourDataset>"}'

You can use Cloud Scheduler to invoke that workflow periodically

Protect your most precious assets

That workflow is pretty straightforward. There are also limitations. For instance:

  • You can’t run the workflow 2 times in a row because the snapshots will exist and you will have an Already Exists error.
    You can add the date-of-the-day in the target dataset for example
  • The source project and the target project are the same.
    You should want to have a snapshot project for instance.
  • The target dataset region is hardcoded (and MUST be the same as the source dataset)
    You could get the source dataset, take the region and create the target dataset in the same region as the source one.

Anyway, that a base example to protect what is the most precious in your company: your data assets


The original article published on Medium.

Related Posts