Post

Introduction to DBT Tool & Install and Running in EC2 machine

What is DBT?

DBT (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. DBT handles turning these select statements into tables and views. DBT does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

There’s two main ways of working with DBT

DBT Core is the software that takes a DBT project (.sql and .yml files) and a command and then creates tables/views in your warehouse. DBT Core includes a command line interface (CLI) so that users can execute DBT commands using a terminal program. DBT Core is open source and free to use.

DBT Cloud is an application that helps teams use DBT. DBT Cloud provides a web-based IDE to develop DBT projects, a purpose-built scheduler, and a way to share DBT documentation with your team. DBT Cloud offers a number of features for free, as well as additional features in paid tiers (check out the pricing here).

Database Connections​

DBT connects to your data warehouse to run data transformation queries. As such, you’ll need a data warehouse with source data loaded in it to use DBT. DBT natively supports connections to Snowflake, BigQuery, Redshift and Postgres data warehouses

Project:

A DBT project is a directory of .sql and .yml files, which DBT uses to transform your data. At a minimum, a DBT project must contain:

  • A project file: A DBT_project.yml file tells DBT that a particular directory is a DBT project, and also contains configurations for your project.
  • Models: A model is a single .sql file. Each model contains a single select statement that either transforms raw data into a dataset that is ready for analytics, or, more often, is an intermediate step in such a transformation. A project may also contain a number of other resources, such as snapshots, seeds, tests, macros, documentation, and sources.

Developing locally with the Command Line Interface (CLI)​

To use the CLI:

Prerequisite - Create the EC2 Machine

Steps to Set up DBT core (i) Install DBT in Amazon EC2 Machine. For this demo installing DBT-snowflake Inorder to install DBT we need pip so first install pip

1
2
sudo apt update
→ sudo apt install python3-pip
1
2
3
Install the DBT with adapter 
pip install \  DBT-core \  DBT-postgres \  DBT-redshift \  DBT-snowflake \  DBT-bigquery 
pip install DBT-snowflake

Verify the DBT version

1
dbt -version

Result:

DBTVersion

(ii) Initialise the DBT init (DBT init [project-name])and provide the connection details or it can be manually updated too (/root/.DBT/profiles.yml)

1
dbt init

Result:

DBTInit

(iii) After the DBT init command the list of module folder will be created ls -ltr

Result:

(iv) To verify the credential details go to profile.yml file

1
2
cd /root/.DBT
cat profiles.yml

Result:

Credentials

(v) run DBT debug to verify the connectivity

Command:

1
dbt debug

Result:

DBTDebug

(vi) Verify the dbt_profile .yml file contains all the details of the project

Command:

1
cat dbt_profile.yml

Result:

dbt-profile

Model:

A model is a select statement. Models are defined in .sql files (typically in your models directory): Each .sql file contains one model / select statement The name of the file is used as the model name Models can be nested in subdirectories within the models directory When you execute the DBT run command, DBT will build this model in your data warehouse by wrapping it in a create view as or create table as statement.

Materializations:

Materializations are strategies for persisting DBT models in a warehouse. There are four types of materializations built into DBT. They are:

  • table
  • view
  • incremental
  • ephemeral

(vii) Under the model please write the Select statement and declare the materialised as view or table

1
cat my_first_dbt_model.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
    

with source_data as (

    select 1 as id, 'NJ' as state, '2020-02-01 00:01:00.000'::timestamp as updated_at
    union all
    select null as id, 'CT' as state, '2020-01-01 00:00:00.000'::timestamp as updated_at
    union all
    select 4 as id, 'VT' as state, '2020-01-01 00:00:00.000'::timestamp as updated_at

)

select *
from source_data

(vii) Run DBT run command to execute the

Command:

1
dbt run

Result:

dbt-run

(viii) Verify the first and second model created in the snowflake.

1
Select * from "ANALYTICS"."DBT"."MY_FIRST_DBT_MODEL";

dbt-qryresult

Summary

In this post, we discussed how to set up a DBT on an EC2 instance and connect to the snowflake and implemented a simple transformation logic by creating a new table in the Database.

Use Case:

DBT tool is used to Create the table or view based on the select query. With DBT, data teams work directly within the warehouse to produce trusted datasets for reporting, ML modeling, and operational workflows.

If you enjoy the article, Please Subscribe.

If you love the article, Please consider supporting me by buying a coffee for $1.

Buy Me A Coffee

This post is licensed under CC BY 4.0 by the author.