Solved: Dataform yaml for dataset location

aaron_harkins

Hi All,

I'm new to Dataform and GCP but used dbt at my previous company. Unless I'm completely forgetting something, we structured each folder to have it's own yaml file that determined what dataset (in GCP terms) that table would go to. Right now in Dataform, there's one json file that I'm not sure how to overwrite. Even when attempting to do it manually in the sqlx file in the config, it won't allow me.

Does anyone have any understanding of this or any documentation? I've been finding it hard to find things, especially trainings videos, for Dataform.

Thanks

ms4446

In Dataform, setting up the project structure and configuring datasets for your SQLX files can indeed be a bit different from how dbt handles it. In Dataform, much of the configuration is controlled through the dataform.json file at the root of your project, which may be what you're encountering.

Here's how you can manage this in Dataform:

Project Configuration (dataform.json): This file contains global settings for your Dataform project. It includes the default dataset (schema in dbt terms) where your tables will be created unless specified otherwise in the SQLX files. This looks something like:

 

{
  "warehouse": "bigquery",
  "defaultSchema": "your_default_dataset",
  "assertionsSchema": "your_assertions_dataset",
  "dataformCoreVersion": "1.x.x"
}

Overriding Default Settings in SQLX Files: If you want to specify a different dataset for a particular table or view, you can set this in the SQLX file itself using the config block. Here’s how you might configure it:

 

config {
  type: "table",
  schema: "specific_dataset",
  description: "Description of what this model represents"
}

SELECT ...

In this block, schema corresponds to the dataset in BigQuery where this table/view will be created.

Tips for Larger or More Complex Projects:

Consistent Naming Conventions: Consider using a consistent naming convention for your datasets (e.g., prefixing them with team or functional area) to help organize your data and configuration.
Environment-Specific Configuration: Leverage environment variables or create separate dataform.json files for different environments (development, staging, production) to easily manage configuration changes across environments.
Troubleshooting Configuration Issues: If configuration overrides aren't being respected, double-check the dataset names for typos and ensure that the config block is correctly placed before any SQL statements within the SQLX file.

Dataform's official documentation is a valuable resource: Google Cloud Dataform Documentation. This includes guides on setting up your development environment, writing and running transformations, and more. While external resources might be less plentiful compared to dbt, the official documentation provides a comprehensive starting point.

View solution in original post

ms4446