Raw data should be consolidated from multiple sources and transformed for better organization and insights. Transformed data is easier to consume for humans and machines alike. However, the data pipelines built using ETL processes have been known to present businesses with unique challenges. In fact, a small defect in one step of the ETL process can result in hours of intervention that will affect the quality of data, shake the confidence of your customer, and make maintenance difficult.
Airbyte was developed to address the problem of moving data from applications, databases, and APIs to data warehouses and data lakes.
Behind the scenes, Data Build Tool (or dbt) is in charge of transforming your data.
What is Airbyte?
Airbyte is an open-source data pipeline platform that performs data integration. It syncs data from different applications, APIs, and databases into data warehouses and data lakes. We can build and maintain custom connectors using Airbyte to support applications, refresh them, and schedule updates. Since existing ETL platforms are becoming difficult to maintain, an increasing number of companies are turning to Airbyte. Also, currently, the ETLs have a volume-based pricing model which is driving the cost up. The other concerns with the existing ETL platforms are security, lack of visibility into ETL systems,etc. The need of the hour is an open-source model that increases the efficiency across the board.
What is Dbt?
Dbt (data build tool) helps data analytics engineers transform data in their warehouses by simply writing select statements. Dbt turns these select statements into tables and views. However, dbt doesn’t perform any extractions or loads (as in ELT) and is only responsible for transformations.
A dbt project is a directory of .sql and .yml files. The directory must contain at a minimum:
- Models: A model is a single .sql file. Each model contains a single select statement that either transforms raw data into a dataset that is ready for analytics or, more often, is an intermediate step in such a transformation.
- A project file: It is a dbt_project.yml file that configures and defines your dbt project.
Why use dbt with Airbyte for data transformation?
While Airbyte can help us with the data ingestion, dbt is used for transforming that data, validating the data loaded into an SQL-speaking data warehouse. Airbyte sets up a dbt docker instance and automatically generates a dbt project for us.
Together they form a dream team ready to solve your varied data integration needs and help in building a single source of truth bringing data from multiple sources.
Database connections
Dbt connects to your data warehouse to run data transformation queries. As such, you’ll need a data warehouse with source data loaded in it to use dbt. Dbt natively supports connections to Snowflake, BigQuery, Redshift, and Postgres data warehouses, and there are a number of community-supported adapters for other warehouses.
In this blog, we will see how we can use dbt in conjunction with Airbyte and make use of their capabilities. The steps are organized as follows:
- dbt project setup/installation
- Initialize a dbt project (sample files) using dbt CLI
- Set up a Git repository
- Setting up the source & destination, sync and study the logs
- Profile configuration
- Create or update models
- Check the connection
- Add project to Github
- Add created transformation to the Airbyte connector
- Verify models in the destination database
Step 1: Dbt project setup/installation
As dbt core is written in Python, it should be installed with PIP
1 |
>pip install dbt-<adapter> |
The required adapters are:
1 2 3 4 5 6 |
pip install \ dbt-core \ dbt-postgres \ dbt-redshift \ dbt-snowflake \ Dbt-bigquery |
We can use the following command if we use Postgres:
1 |
>pip install dbt-postgres |
The above command will install dbt-core and dbt-postgres only.
To check the version:
1 |
>dbt --version |
A successful installation will lead to the following screen:
Step 2: Initialize a dbt project (sample files) using dbt CLI
You can use dbt init to generate sample files/folders. In particular, dbt init project_name will create the following:
- .dbt/profiles.yml file if one does not already exist
- new folder called [project_name]
- directories and sample files necessary to get started with dbt
Caution: Since dbt init generates a directory named project_name, and in order to avoid any conflict, you should not have any existing folder with an identical name.
1 |
>dbt init <project_name> |
The result is a directory with the following sample files.
For now, we are going to work with the below-mentioned files
Also, it will create (or update) the file ~/.dbt/profiles.yml, where database connection settings are stored.
As we can see, most of the project is just empty directories. This lets us understand what the most important parts are:
- dbt_project.yml is the description of the project’s global settings.
- models – It is a directory with the description of models we will work with.
Step 3: Set Up a Git Repository
You can use any existing repo or create a new repository from within the newly created directory.
You can do that as below:
This will commit your code into github.
Step 4: Setting up the source & destination to trigger the sync and study the logs:
For the purpose of this blog, we will be using a sample covid data source available in the form of a csv file as given below.
Here are some examples of public API CSV:-
https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv
For the destination, we are using a local snowflake database.
Add a local Snowflake database as the airbyte destination as shown below making sure that “Basic normalization” option is enabled:
After setting up the connectors, we can trigger the sync and study the logs:
Notice that the process ran in the /tmp/workspace/110/0 folder.
Note this log file # (110/0) for future reference.
Extract these SQL files, modify them and run it yourself manually outside Airbyte!
By using the below command run these commands in Ubuntu/Linux terminal.
>docker cp airbyte-server:/tmp/workspace/${NORMALIZE_WORKSPACE}/build/run/airbyte_utils/models/generated/ models/
>find models
Note:- Replace ${NORMALIZE_WORKSPACE} ————–> 110/0 the file number obtained in the previous step.
Next, run the below command to see the created SQL files.
Example Output:
models/airbyte_tables/quarantine/covid_epidemiology_f11.sql
Let’s inspect the generated SQL file by running:
cat models/airbyte_tables/quarantine/covid_epidemiology_f11.sql
Example Output:
Remove ‘create statement’ from the generated SQL file and add this statement:
1 2 3 |
{{ config ( materialized="table" )}} |
The strategy of materialization is used for persisting dbt models in the warehouse. When using the table materialization, the model is rebuilt as a table on each run, via a create table as statement. You can also use other options like incremental which allow dbt to insert or update records into a table since the last time it was run.
Step 5: Profile configuration
Open profiles.yml in the .dbt folder. We need to update this with our current database choice which is Snowflake.
Update the profile configuration details using the following link:
https://docs.getdbt.com/reference/warehouse-profiles/snowflake-profile
Open it and copy the code, paste it into the profiles.yml file, and update the credentials of your snowflake_db credentials in profiles.yml
Step 6: Models
Now open the created project folder in IDE(here, it is pycharm).
Update the names as shown in the above image.
Next, open the model’s folder, and open the examples folder within it. Now, create a .sql file.
Paste the previously generated SQL file into this .sql file. Then, save it.
Step 7: Check the connection
Now, check the connections using the (dbt debug) command.
If everything goes well, you can see the below output:
Step 8: Add this project to Github
Open Github and project Repository, commit your changes and refresh it. Now, you can see the created project on Github. Copy the repository link.
Step 9: Adding created transformation to the Airbyte connector
Open the Airbyte connecter which we previously used for extracting log files and add transformations as shown below:
After entering the details including the git repo url, we copied earlier , click save transformations. Also verify the dbt version being used. Next, click ‘Sync Now’.
If it is successful, then check the destination database for the newly added dataset/tables.
If it is not successful check log files, if the error is about the docker image URL as shown below
Update Docker image URL with dbt installed in custom transformations as shown below
(fishtownanalytics/dbt:1.0.4) → (fishtownanalytics/dbt:1.0.0)
Step 10: Verify models in the destination database
You can see the data created in the destination database.
Airbyte provides the most straightforward way to ingest data from different sources. The biggest benefit is you can skip writing custom ETL code for different data sources.
The last few years have witnessed unprecedented digital transformation in all industries, and all companies are accumulating more data than ever before. However, only those companies that are able to use this data optimally will have the competitive advantage and win.
That said, the more the data, the more difficult it is to manage data portability, data accuracy, and data security. But, with Airbyte (used along with dbt), you will be able to collect and move data seamlessly irrespective of the tools you use.