Quickstart for dbt Cloud and Amazon Athena
- 1 Introduction
- 2 Getting started
- 3 Configure Amazon Athena
- 4 Set up security access to Athena
- 5 Configure the connection in dbt Cloud
- 6 Set up a dbt Cloud managed repository
- 7 Initialize your dbt project and start developing
- 8 Build your first model
- 9 Change the way your model is materialized
- 10 Delete the example models
- 11 Build models on top of other models
- 12 Add tests to your models
- 13 Document your models
- 14 Commit your changes
- 15 Deploy dbt
Introduction
In this quickstart guide, you'll learn how to use dbt Cloud with Amazon Athena. It will show you how to:
- Create an S3 bucket for Athena query results.
- Creat an Athena database.
- Access sample data in a public dataset.
- Connect dbt Cloud to Amazon Athena.
- Take a sample query and turn it into a model in your dbt project. A model in dbt is a select statement.
- Add tests to your models.
- Document your models.
- Schedule a job to run.
You can check out dbt Fundamentals for free if you're interested in course learning with videos.
Prerequisites
- You have a dbt Cloud account.
- You have an AWS account.
- You have set up Amazon Athena.
Related content
- Learn more with dbt Learn courses
- CI jobs
- Deploy jobs
- Job notifications
- Source freshness
Getting started
For the following guide you can use an existing S3 bucket or create a new one.
Download the following CSV files (the Jaffle Shop sample data) and upload them to your S3 bucket:
Configure Amazon Athena
- Log into your AWS account and navigate to the Athena console.
- If this is your first time in the Athena console (in your current AWS Region), click Explore the query editor to open the query editor. Otherwise, Athena opens automatically in the query editor.
- Open Settings and find the Location of query result box field.
- Enter the path of the S3 bucket (prefix it with
s3://
). - Navigate to Browse S3, select the S3 bucket you created, and click Choose.
- Enter the path of the S3 bucket (prefix it with
- Save these settings.
- In the query editor, create a database by running
create database YOUR_DATABASE_NAME
. - To make the database you created the one you
write
into, select it from the Database list on the left side menu. - Access the Jaffle Shop data in the S3 bucket using one of these options:
- Manually create the tables.
- Create a glue crawler to recreate the data as external tables (recommended).
- Once the tables have been created, you will able to
SELECT
from them.
Set up security access to Athena
To setup the security access for Athena, determine which access method you want to use:
- Obtain
aws_access_key_id
andaws_secret_access_key
(recommended) - Obtain an AWS credentials file.
AWS access key (recommended)
To obtain your aws_access_key_id
and aws_secret_access_key
:
- Open the AWS Console.
- Click on your username near the top right and click Security Credentials.
- Click on Users in the sidebar.
- Click on your username (or the name of the user for whom to create the key).
- Click on the Security Credentials tab.
- Click Create Access Key.
- Click Show User Security Credentials and
Save the aws_access_key_id
and aws_secret_access_key
for a future step.
AWS credentials file
To obtain your AWS credentials file:
- Follow the instructions for configuring the credentials file usin the AWS CLI
- Locate the
~/.aws/credentials
file on your computer- Windows:
%USERPROFILE%\.aws\credentials
- Mac/Linux:
~/.aws/credentials
- Windows:
Retrieve the aws_access_key_id
and aws_secret_access_key
from the ~/.aws/credentials
file for a future step.
Configure the connection in dbt Cloud
To configure the Athena connection in dbt Cloud:
- Click your account name on the left-side menu and click Account settings.
- Click Connections and click New connection.
- Click Athena and fill out the required fields (and any optional fields).
- AWS region name — The AWS region of your environment.
- Database (catalog) — Enter the database name created in earlier steps (lowercase only).
- AWS S3 staging directory — Enter the S3 bucket created in earlier steps.
- Click Save
Configure your environment
To configure the Athena credentials in your environment:
- Click Deploy on the left-side menu and click Environments.
- Click Create environment and fill out the General settings.
- Your dbt version must be set to
Versionless
to use the Athena connection.
- Your dbt version must be set to
- Select the Athena connection from the Connection dropdown.
- Fill out the
aws_access_key
andaws_access_id
recorded in previous steps, as well as theSchema
to write to. - Click Test connection and once it succeeds, Save the environment.
Repeat the process to create a development environment.
Set up a dbt Cloud managed repository
When you develop in dbt Cloud, you can leverage Git to version control your code.
To connect to a repository, you can either set up a dbt Cloud-hosted managed repository or directly connect to a supported git provider. Managed repositories are a great way to trial dbt without needing to create a new repository. In the long run, it's better to connect to a supported git provider to use features like automation and continuous integration.
To set up a managed repository:
- Under "Setup a repository", select Managed.
- Type a name for your repo such as
bbaggins-dbt-quickstart
- Click Create. It will take a few seconds for your repository to be created and imported.
- Once you see the "Successfully imported repository," click Continue.
Initialize your dbt project and start developing
Now that you have a repository configured, you can initialize your project and start development in dbt Cloud:
- Click Start developing in the IDE. It might take a few minutes for your project to spin up for the first time as it establishes your git connection, clones your repo, and tests the connection to the warehouse.
- Above the file tree to the left, click Initialize dbt project. This builds out your folder structure with example models.
- Make your initial commit by clicking Commit and sync. Use the commit message
initial commit
and click Commit. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. - You can now directly query data from your warehouse and execute
dbt run
. You can try this out now:- Click + Create new file, add this query to the new file, and click Save as to save the new file:
select * from jaffle_shop.customers
- In the command line bar at the bottom, enter
dbt run
and click Enter. You should see adbt run succeeded
message.
- Click + Create new file, add this query to the new file, and click Save as to save the new file:
Build your first model
You have two options for working with files in the dbt Cloud IDE:
- Create a new branch (recommended) — Create a new branch to edit and commit your changes. Navigate to Version Control on the left sidebar and click Create branch.
- Edit in the protected primary branch — If you prefer to edit, format, or lint files and execute dbt commands directly in your primary git branch. The dbt Cloud IDE prevents commits to the protected branch, so you will be prompted to commit your changes to a new branch.
Name the new branch add-customers-model
.
- Click the ... next to the
models
directory, then select Create file. - Name the file
customers.sql
, then click Create. - Copy the following query into the file and click Save.
with customers as (
select
id as customer_id,
first_name,
last_name
from jaffle_shop.customers
),
orders as (
select
id as order_id,
user_id as customer_id,
order_date,
status
from jaffle_shop.orders
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final
- Enter
dbt run
in the command prompt at the bottom of the screen. You should get a successful run and see the three models.
Later, you can connect your business intelligence (BI) tools to these views and tables so they only read cleaned up data rather than raw data in your BI tool.
FAQs
To check out the SQL that dbt is running, you can look in:
- dbt Cloud:
- Within the run output, click on a model name, and then select "Details"
- dbt Core:
- The
target/compiled/
directory for compiledselect
statements - The
target/run/
directory for compiledcreate
statements - The
logs/dbt.log
file for verbose logging.
- The
By default, dbt builds models in your target schema. To change your target schema:
- If you're developing in dbt Cloud, these are set for each user when you first use a development environment.
- If you're developing with dbt Core, this is the
schema:
parameter in yourprofiles.yml
file.
If you wish to split your models across multiple schemas, check out the docs on using custom schemas.
Note: on BigQuery, dataset
is used interchangeably with schema
.
Nope! dbt will check if the schema exists when it runs. If the schema does not exist, dbt will create it for you.
Nope! The SQL that dbt generates behind the scenes ensures that any relations are replaced atomically (i.e. your business users won't experience any downtime).
The implementation of this varies on each warehouse, check out the logs to see the SQL dbt is executing.
If there's a mistake in your SQL, dbt will return the error that your database returns.
$ dbt run --select customers
Running with dbt=0.15.0
Found 3 models, 9 tests, 0 snapshots, 0 analyses, 133 macros, 0 operations, 0 seed files, 0 sources
14:04:12 | Concurrency: 1 threads (target='dev')
14:04:12 |
14:04:12 | 1 of 1 START view model dbt_alice.customers.......................... [RUN]
14:04:13 | 1 of 1 ERROR creating view model dbt_alice.customers................. [ERROR in 0.81s]
14:04:13 |
14:04:13 | Finished running 1 view model in 1.68s.
Completed with 1 error and 0 warnings:
Database Error in model customers (models/customers.sql)
Syntax error: Expected ")" but got identifier `your-info-12345` at [13:15]
compiled SQL at target/run/jaffle_shop/customers.sql
Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
Any models downstream of this model will also be skipped. Use the error message and the compiled SQL to debug any errors.
Change the way your model is materialized
One of the most powerful features of dbt is that you can change the way a model is materialized in your warehouse, simply by changing a configuration value. You can change things between tables and views by changing a keyword rather than writing the data definition language (DDL) to do this behind the scenes.
By default, everything gets created as a view. You can override that at the directory level so everything in that directory will materialize to a different materialization.
-
Edit your
dbt_project.yml
file.-
Update your project
name
to:dbt_project.ymlname: 'jaffle_shop'
-
Configure
jaffle_shop
so everything in it will be materialized as a table; and configureexample
so everything in it will be materialized as a view. Update yourmodels
config block to:dbt_project.ymlmodels:
jaffle_shop:
+materialized: table
example:
+materialized: view -
Click Save.
-
-
Enter the
dbt run
command. Yourcustomers
model should now be built as a table!infoTo do this, dbt had to first run a
drop view
statement (or API call on BigQuery), then acreate table as
statement. -
Edit
models/customers.sql
to override thedbt_project.yml
for thecustomers
model only by adding the following snippet to the top, and click Save:models/customers.sql{{
config(
materialized='view'
)
}}
with customers as (
select
id as customer_id
...
) -
Enter the
dbt run
command. Your model,customers
, should now build as a view.- BigQuery users need to run
dbt run --full-refresh
instead ofdbt run
to full apply materialization changes.
- BigQuery users need to run
-
Enter the
dbt run --full-refresh
command for this to take effect in your warehouse.
FAQs
dbt ships with five materializationsThe exact Data Definition Language (DDL) that dbt will use when creating the model’s equivalent in a data warehouse.: view
, table
, incremental
, ephemeral
and materialized_view
.
Check out the documentation on materializations for more information on each of these options.
You can also create your own custom materializations, if required however this is an advanced feature of dbt.
Start out with viewsA view (as opposed to a table) is a defined passthrough SQL query that can be run against a database (or data warehouse)., and then change models to tables when required for performance reasons (i.e. downstream queries have slowed).
Check out the docs on materializations for advice on when to use each materializationThe exact Data Definition Language (DDL) that dbt will use when creating the model’s equivalent in a data warehouse..
You can also configure:
- tags to support easy categorization and graph selection
- custom schemas to split your models across multiple schemas
- aliases if your viewA view (as opposed to a table) is a defined passthrough SQL query that can be run against a database (or data warehouse)./tableIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells. name should differ from the filename
- Snippets of SQL to run at the start or end of a model, known as hooks
- Warehouse-specific configurations for performance (e.g.
sort
anddist
keys on Redshift,partitions
on BigQuery)
Check out the docs on model configurations to learn more.
Delete the example models
You can now delete the files that dbt created when you initialized the project:
-
Delete the
models/example/
directory. -
Delete the
example:
key from yourdbt_project.yml
file, and any configurations that are listed under it.dbt_project.yml# before
models:
jaffle_shop:
+materialized: table
example:
+materialized: viewdbt_project.yml# after
models:
jaffle_shop:
+materialized: table -
Save your changes.
FAQs
If you delete a model from your dbt project, dbt does not automatically drop the relation from your schema. This means that you can end up with extra objects in schemas that dbt creates, which can be confusing to other users.
(This can also happen when you switch a model from being a viewA view (as opposed to a table) is a defined passthrough SQL query that can be run against a database (or data warehouse). or tableIn simplest terms, a table is the direct storage of data in rows and columns. Think excel sheet with raw values in each of the cells., to ephemeral)
When you remove models from your dbt project, you should manually drop the related relations from your schema.
You might have forgotten to nest your configurations under your project name, or you might be trying to apply configurations to a directory that doesn't exist.
Check out this article to understand more.
Build models on top of other models
As a best practice in SQL, you should separate logic that cleans up your data from logic that transforms your data. You have already started doing this in the existing query by using common table expressions (CTEs).
Now you can experiment by separating the logic out into separate models and using the ref function to build models on top of other models:
-
Create a new SQL file,
models/stg_customers.sql
, with the SQL from thecustomers
CTE in our original query. -
Create a second new SQL file,
models/stg_orders.sql
, with the SQL from theorders
CTE in our original query.models/stg_customers.sqlselect
id as customer_id,
first_name,
last_name
from jaffle_shop.customersmodels/stg_orders.sqlselect
id as order_id,
user_id as customer_id,
order_date,
status
from jaffle_shop.orders -
Edit the SQL in your
models/customers.sql
file as follows:models/customers.sqlwith customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final -
Execute
dbt run
.This time, when you performed a
dbt run
, separate views/tables were created forstg_customers
,stg_orders
andcustomers
. dbt inferred the order to run these models. Becausecustomers
depends onstg_customers
andstg_orders
, dbt buildscustomers
last. You do not need to explicitly define these dependencies.
FAQs
To run one model, use the --select
flag (or -s
flag), followed by the name of the model:
$ dbt run --select customers
Check out the model selection syntax documentation for more operators and examples.
Within one project: yes! To build dependencies between resources (such as models, seeds, and snapshots), you need to use the ref
function, and pass in the resource name as an argument. dbt uses that resource name to uniquely resolve the ref
to a specific resource. As a result, these resource names need to be unique, even if they are in distinct folders.
A resource in one project can have the same name as a resource in another project (installed as a dependency). dbt uses the project name to uniquely identify each resource. We call this "namespacing." If you ref
a resource with a duplicated name, it will resolve to the resource within the same namespace (package or project), or raise an error because of an ambiguous reference. Use two-argument ref
to disambiguate references by specifying the namespace.
Those resource will still need to land in distinct locations in the data warehouse. Read the docs on custom aliases and custom schemas for details on how to achieve this.
There's no one best way to structure a project! Every organization is unique.
If you're just getting started, check out how we (dbt Labs) structure our dbt projects.
Add tests to your models
Adding tests to a project helps validate that your models are working correctly.
To add tests to your project:
-
Create a new YAML file in the
models
directory, namedmodels/schema.yml
-
Add the following contents to the file:
models/schema.ymlversion: 2
models:
- name: customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id -
Run
dbt test
, and confirm that all your tests passed.
When you run dbt test
, dbt iterates through your YAML files, and constructs a query for each test. Each query will return the number of records that fail the test. If this number is 0, then the test is successful.
FAQs
Out of the box, dbt ships with the following tests:
unique
not_null
accepted_values
relationships
(i.e. referential integrity)
You can also write your own custom schema data tests.
Some additional custom schema tests have been open-sourced in the dbt-utils package, check out the docs on packages to learn how to make these tests available in your project.
Note that although you can't document data tests as of yet, we recommend checking out this dbt Core discussion where the dbt community shares ideas.
Running tests on one model looks very similar to running a model: use the --select
flag (or -s
flag), followed by the name of the model:
dbt test --select customers
Check out the model selection syntax documentation for full syntax, and test selection examples in particular.
To debug a failing test, find the SQL that dbt ran by:
-
dbt Cloud:
- Within the test output, click on the failed test, and then select "Details"
-
dbt Core:
- Open the file path returned as part of the error message.
- Navigate to the
target/compiled/schema_tests
directory for all compiled test queries
Copy the SQL into a query editor (in dbt Cloud, you can paste it into a new Statement
), and run the query to find the records that failed.
No! You can name this file whatever you want (including whatever_you_want.yml
), so long as:
- The file is in your
models/
directory¹ - The file has
.yml
extension
Check out the docs for more information.
¹If you're declaring properties for seeds, snapshots, or macros, you can also place this file in the related directory — seeds/
, snapshots/
and macros/
respectively.
Once upon a time, the structure of these .yml
files was very different (s/o to anyone who was using dbt back then!). Adding version: 2
allowed us to make this structure more extensible.
Resource yml files do not currently require this config. We only support version: 2
if it's specified. Although we do not expect to update yml files to version: 3
soon, having this config will make it easier for us to introduce new structures in the future
We recommend that every model has a test on a primary keyA primary key is a non-null column in a database object that uniquely identifies each row., that is, a column that is unique
and not_null
.
We also recommend that you test any assumptions on your source data. For example, if you believe that your payments can only be one of three payment methods, you should test that assumption regularly — a new payment method may introduce logic errors in your SQL.
In advanced dbt projects, we recommend using sources and running these source data-integrity tests against the sources rather than models.
You should run your tests whenever you are writing new code (to ensure you haven't broken any existing models by changing SQL), and whenever you run your transformations in production (to ensure that your assumptions about your source data are still valid).
Document your models
Adding documentation to your project allows you to describe your models in rich detail, and share that information with your team. Here, we're going to add some basic documentation to our project.
-
Update your
models/schema.yml
file to include some descriptions, such as those below.models/schema.ymlversion: 2
models:
- name: customers
description: One record per customer
columns:
- name: customer_id
description: Primary key
tests:
- unique
- not_null
- name: first_order_date
description: NULL when a customer has not yet placed an order.
- name: stg_customers
description: This model cleans up customer data
columns:
- name: customer_id
description: Primary key
tests:
- unique
- not_null
- name: stg_orders
description: This model cleans up order data
columns:
- name: order_id
description: Primary key
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id -
Run
dbt docs generate
to generate the documentation for your project. dbt introspects your project and your warehouse to generate a JSONJSON (JavaScript Object Notation) is a minimal format for semi-structured data used to capture relationships between fields and values. file with rich documentation about your project.
- Click the book icon in the Develop interface to launch documentation in a new tab.
FAQs
If you need more than a sentence to explain a model, you can:
- Split your description over multiple lines using
>
. Interior line breaks are removed and Markdown can be used. This method is recommended for simple, single-paragraph descriptions:
version: 2
models:
- name: customers
description: >
Lorem ipsum **dolor** sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat.
- Split your description over multiple lines using
|
. Interior line breaks are maintained and Markdown can be used. This method is recommended for more complex descriptions:
version: 2
models:
- name: customers
description: |
### Lorem ipsum
* dolor sit amet, consectetur adipisicing elit, sed do eiusmod
* tempor incididunt ut labore et dolore magna aliqua.
- Use a docs block to write the description in a separate Markdown file.
If you're using dbt Cloud to deploy your project and have the Team or Enterprise plan, you can use dbt Explorer to view your project's resources (such as models, tests, and metrics) and their lineageData lineage provides a holistic view of how data moves through an organization, where it’s transformed and consumed. to gain a better understanding of its latest production state.
Access dbt Explorer in dbt Cloud by clicking the Explore link in the navigation. You can have up to 5 read-only users access the documentation for your project.
dbt Cloud developer plan and dbt Core users can use dbt Docs, which generates basic documentation but it doesn't offer the same speed, metadata, or visibility as dbt Explorer.
Commit your changes
Now that you've built your customer model, you need to commit the changes you made to the project so that the repository has your latest code.
If you edited directly in the protected primary branch:
- Click the Commit and sync git button. This action prepares your changes for commit.
- A modal titled Commit to a new branch will appear.
- In the modal window, name your new branch
add-customers-model
. This branches off from your primary branch with your new changes. - Add a commit message, such as "Add customers model, tests, docs" and and commit your changes.
- Click Merge this branch to main to add these changes to the main branch on your repo.
If you created a new branch before editing:
- Since you already branched out of the primary protected branch, go to Version Control on the left.
- Click Commit and sync to add a message.
- Add a commit message, such as "Add customers model, tests, docs."
- Click Merge this branch to main to add these changes to the main branch on your repo.
Deploy dbt
Use dbt Cloud's Scheduler to deploy your production jobs confidently and build observability into your processes. You'll learn to create a deployment environment and run a job in the following steps.
Create a deployment environment
- In the upper left, select Deploy, then click Environments.
- Click Create Environment.
- In the Name field, write the name of your deployment environment. For example, "Production."
- In the dbt Version field, select the latest version from the dropdown.
- Under Deployment connection, enter the name of the dataset you want to use as the target, such as "Analytics". This will allow dbt to build and work with that dataset. For some data warehouses, the target dataset may be referred to as a "schema".
- Click Save.
Create and run a job
Jobs are a set of dbt commands that you want to run on a schedule. For example, dbt build
.
As the jaffle_shop
business gains more customers, and those customers create more orders, you will see more records added to your source data. Because you materialized the customers
model as a table, you'll need to periodically rebuild your table to ensure that the data stays up-to-date. This update will happen when you run a job.
- After creating your deployment environment, you should be directed to the page for a new environment. If not, select Deploy in the upper left, then click Jobs.
- Click Create one and provide a name, for example, "Production run", and link to the Environment you just created.
- Scroll down to the Execution Settings section.
- Under Commands, add this command as part of your job if you don't see it:
dbt build
- Select the Generate docs on run checkbox to automatically generate updated project docs each time your job runs.
- For this exercise, do not set a schedule for your project to run — while your organization's project should run regularly, there's no need to run this example project on a schedule. Scheduling a job is sometimes referred to as deploying a project.
- Select Save, then click Run now to run your job.
- Click the run and watch its progress under "Run history."
- Once the run is complete, click View Documentation to see the docs for your project.
Congratulations 🎉! You've just deployed your first dbt project!
FAQs
If you're using dbt Cloud, we recommend setting up email and Slack notifications (Account Settings > Notifications
) for any failed runs. Then, debug these runs the same way you would debug any runs in development.