By Adrian Trzeciak.Oct 20, 2022
I had a chance to utilize Dataform (open source CLI version) about a year ago while creating analytics solution for a retailer. At that point in time I ended up wrapping the most important functionality as a REST API in Go and used Cloud Scheduler to invoke the necessary transformations based on tags. Every part of that solution was deployed on GCP with Terraform. Well, Google has chosen almost the same approach with Cloud Workflows as the wrapper around Cloud Scheduler.
It seems like Google quietly released a preview version of Dataform during Next ’22. Prior to that, Dataform was either available solely as open source CLI version or — if you were early to the party — as a web application with beautiful dependency graphs and such.
Why Dataform?
Why should you consider using Dataform? The price for cloud storage has dropped significantly throughout the years making raw data affordable to keep, which resulted in a transition from good-old ETL (where transformation would happen before the load) to brand-new ELT (where transformations are applied after the data has been loaded). Data transformations play a critical role for many data-driven organizations. Dataform provides you with following functionality which I suggest you take a look at:
- Everything from table definitions, views to transformations’ SQL is stored in SCM
- Transformations can be validated with assertions
- Incremental table functionality makes sure that only the rows that have been added since the last transformation are processed (requires proper partitioning of source and destination tables to minimize the cost).
- JS snippets can be utilized in order to reuse pieces of code that you don’t want to copy-and-paste across transformations
- Possibility to separate the environment by using variables (even though it doesn’t seem to be supported in the console in Preview).
Folder structure
Since a Dataform project can quickly grow significantly in size, my suggestion is to start off with a proper folder structure:
definitions
folder is where you store thesqlx
files. That folder should at least be split intoreporting
where the Dataform created tables and views are defined andsources
where you reference BigQuery datasets and tables to get the data from.includes
folder is where you store the JavaScript files for reusable SQL blocks. From the experience you might be doing similar transformations for different purposes.
Google Cloud set-up
Once you have your git repo with proper folder structure, some defined sources, and tables or views to be created, you’ll have to do following:
- Enable
dataform.googleapis.com
for the project responsible for running Dataform compilations - Enabling that API will provide you with a service account in format
service-{PROJECT_NUMBER}@gcp-sa-dataform.iam.gserviceaccount.com
which will perform the operations. Make sure to grant the service account necessary permissions to create and read the source tables inside corresponding datasets. - In order to enable git integration, a personal access token is required — stored in Secret Manager. Make sure that the above mentioned service account has necessary permissions to access the secret version.
The Dataform dashboard should look like following once connected to a Git repository:
When pressing the name of the repository, the possibility to create a development workspace appears. Here’s a definition of the development workspace:
A development workspace is an editable copy of the files within a Git repository. Development workspaces let you follow a Git-based development process inside Dataform. Edits you make to files in a development workspace start off as uncommitted changes that you can commit, and then push.
In our case, since all the code is already in GitHub, we’ll just pull the contents of the main
branch into our newly created workspace.
Dataform set-up
There are two files that are necessary to be inside the repo in order to make Dataform work.
First of them is dataform.json
file including at least following config:
This basically tells Dataform that we’re doing our transformations in BigQuery, project kubectl-blogposts
. Default dataset is dataform
and we plan to create the resources in the US
location.
Another file necessary for the Dataform to work properly is the package.json
generated by running npm i @dataform/core
.
Data data data
For the purpose of this article, I chose the publicly available IMDB dataset (you can find it inside the Analytics Hub of BigQuery and add it to your project). In this tutorial, we’ll create a view which contains a user and analytics friendly table that poses information about movie, release year, directors, actors, average rating and number of reviews.
We first start by defining following sources:
definitions/sources/name_basics.sqlx
containing a mapping from person’s id to the actual namedefinitions/sources/title_basics.sqlx
containing general information about the titledefinitions/sources/title_crew.sqlx
containing information about who directed the moviedefinitions/sources/title_principals.sqlx
containing actor informationdefinitions/sources/title_ratings.sqlx
containing information about the rating
A typical definition sqlx-file contains:
declaration
typeschema
or dataset in BigQuery termsname
name of the source table
The directors
The directors for each movie live in a comma separated field inside title_crew
table. This gives us an opportunity to create a view definitions/reporting/directors.sqlx
that we later can reference in the final product. Writing the SQL, it is now possible to reference title_crew
table by utilizing ref
syntax: ${ ref('title_principals')
.
Dataform will then under the hood convert that reference to project_id.dataset_name.table_name
. Our directors view will cross join and unnest the directors
column from title_crew
table and provide us with a view that’s simpler to work with.
Putting it all together
Since this article is not about explaining the SQL, Dataform will create a definitions/reporting/movies.sqlx
view based on following code as an end result:
Once the code from main branch is pulled into the development workspace, feel free to execute all actions (which basically means that Dataform will create all resources no matter which tags they have been assigned):
Inspecting the execution you’ll notice that all .sqlx
files have been processed to a destination. For each table/view/definition it is possible to see the details and generated SQL code.
Dataform was the tool that sparked my data engineering curiosity and I have to admit that it’s great to see it being part of Google Cloud now. While still a bit hidden and unpolished, Dataform among other things enables proper release pipelines for your data engineers.
Feel free to drop me a question if you have one! Link to the repo.
The original article published on Medium.