Datadog Gold Partner logo

How to: Dataform in Google Cloud-Part 1

By Adrian Trzeciak.Oct 20, 2022

Article How to Dataform in Google Cloud part1 1

I had a chance to utilize Dataform (open source CLI version) about a year ago while creating analytics solution for a retailer. At that point in time I ended up wrapping the most important functionality as a REST API in Go and used Cloud Scheduler to invoke the necessary transformations based on tags. Every part of that solution was deployed on GCP with Terraform. Well, Google has chosen almost the same approach with Cloud Workflows as the wrapper around Cloud Scheduler.

It seems like Google quietly released a preview version of Dataform during Next ’22. Prior to that, Dataform was either available solely as open source CLI version or — if you were early to the party — as a web application with beautiful dependency graphs and such.


Why Dataform?

Why should you consider using Dataform? The price for cloud storage has dropped significantly throughout the years making raw data affordable to keep, which resulted in a transition from good-old ETL (where transformation would happen before the load) to brand-new ELT (where transformations are applied after the data has been loaded). Data transformations play a critical role for many data-driven organizations. Dataform provides you with following functionality which I suggest you take a look at:

  • Everything from table definitions, views to transformations’ SQL is stored in SCM
  • Transformations can be validated with assertions
  • Incremental table functionality makes sure that only the rows that have been added since the last transformation are processed (requires proper partitioning of source and destination tables to minimize the cost).
  • JS snippets can be utilized in order to reuse pieces of code that you don’t want to copy-and-paste across transformations
  • Possibility to separate the environment by using variables (even though it doesn’t seem to be supported in the console in Preview).

Folder structure

Since a Dataform project can quickly grow significantly in size, my suggestion is to start off with a proper folder structure:

  • definitions folder is where you store the sqlx files. That folder should at least be split into reporting where the Dataform created tables and views are defined and sources where you reference BigQuery datasets and tables to get the data from.
  • includes folder is where you store the JavaScript files for reusable SQL blocks. From the experience you might be doing similar transformations for different purposes.

Google Cloud set-up

Once you have your git repo with proper folder structure, some defined sources, and tables or views to be created, you’ll have to do following:

  • Enable dataform.googleapis.com for the project responsible for running Dataform compilations
  • Enabling that API will provide you with a service account in format service-{PROJECT_NUMBER}@gcp-sa-dataform.iam.gserviceaccount.com which will perform the operations. Make sure to grant the service account necessary permissions to create and read the source tables inside corresponding datasets.
  • In order to enable git integration, a personal access token is required — stored in Secret Manager. Make sure that the above mentioned service account has necessary permissions to access the secret version.

The Dataform dashboard should look like following once connected to a Git repository:

Article How to Dataform in Google Cloud part1 2

When pressing the name of the repository, the possibility to create a development workspace appears. Here’s a definition of the development workspace:

A development workspace is an editable copy of the files within a Git repository. Development workspaces let you follow a Git-based development process inside Dataform. Edits you make to files in a development workspace start off as uncommitted changes that you can commit, and then push.

In our case, since all the code is already in GitHub, we’ll just pull the contents of the main branch into our newly created workspace.

Dataform set-up

There are two files that are necessary to be inside the repo in order to make Dataform work.

First of them is dataform.json file including at least following config:

This basically tells Dataform that we’re doing our transformations in BigQuery, project kubectl-blogposts . Default dataset is dataform and we plan to create the resources in the US location.

Another file necessary for the Dataform to work properly is the package.json generated by running npm i @dataform/core.

Data data data

For the purpose of this article, I chose the publicly available IMDB dataset (you can find it inside the Analytics Hub of BigQuery and add it to your project). In this tutorial, we’ll create a view which contains a user and analytics friendly table that poses information about movie, release year, directors, actors, average rating and number of reviews.

We first start by defining following sources:

  • definitions/sources/name_basics.sqlx containing a mapping from person’s id to the actual name
  • definitions/sources/title_basics.sqlx containing general information about the title
  • definitions/sources/title_crew.sqlx containing information about who directed the movie
  • definitions/sources/title_principals.sqlx containing actor information
  • definitions/sources/title_ratings.sqlx containing information about the rating

A typical definition sqlx-file contains:

  • declaration type
  • schema or dataset in BigQuery terms
  • name name of the source table
The directors

The directors for each movie live in a comma separated field inside title_crew table. This gives us an opportunity to create a view definitions/reporting/directors.sqlx that we later can reference in the final product. Writing the SQL, it is now possible to reference title_crew table by utilizing ref syntax: ${ ref('title_principals').

Dataform will then under the hood convert that reference to project_id.dataset_name.table_name. Our directors view will cross join and unnest the directors column from title_crew table and provide us with a view that’s simpler to work with.

Putting it all together

Since this article is not about explaining the SQL, Dataform will create a definitions/reporting/movies.sqlx view based on following code as an end result:

Once the code from main branch is pulled into the development workspace, feel free to execute all actions (which basically means that Dataform will create all resources no matter which tags they have been assigned):

Article How to Dataform in Google Cloud part1 3

Inspecting the execution you’ll notice that all .sqlx files have been processed to a destination. For each table/view/definition it is possible to see the details and generated SQL code.

Article How to Dataform in Google Cloud part1 4
Successful workflow execution

Dataform was the tool that sparked my data engineering curiosity and I have to admit that it’s great to see it being part of Google Cloud now. While still a bit hidden and unpolished, Dataform among other things enables proper release pipelines for your data engineers.

Feel free to drop me a question if you have one! Link to the repo.


The original article published on Medium.

Related Posts