Modern Data Stack
December 14, 2021
Today we’re pleased to announce Continual Integration for dbt. We believe this is a radical simplification of the machine learning (ML) process for users of dbt and presents a well-defined path that bridges the gap between data analytics and data science. Read on to learn more about this integration. Also, you can find us in the #tools-continual channel in the dbt Community on Slack.
Continual is an automated operational AI platform built for the Modern Data Stack. It offers a streamlined process for deploying AI use cases into production for users of cloud data warehouses. Continual is easily accessible to all data users, enables quick prototyping and versioning of ML work, and provides a simple production workflow that follows software engineering best practices. Whether you are an ML expert or new to AI, Continual allows you to iterate quickly on your problems and operationalize your results without rigging up complicated data pipelines, getting lost in notebook diffs, or becoming a kubernetes expert.
The modern data stack is rapidly democratizing data and analytics, but deploying AI at scale into business operations, products, or services remains a challenge for most companies. Continual provides a fresh take on this problem. Powered by a declarative approach to operational AI and end-to-end automation, Continual enables modern data and analytics teams to build continually improving machine learning models directly on their cloud data warehouse without complex engineering.
We believe that AI should be pervasive in all organizations and with the right tooling, i.e. Continual, this can be a reality for every organization.
From the dbt website:
dbt is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone who knows SQL can build production-grade data pipelines.
dbt has emerged as a lynchpin to the modern data stack, and it is easy to see why. Simply put:
The field of data science is still relatively new but it's been rapidly and continuously growing, largely independently from the analytics crowd. This creates several issues in a company’s data organization:
In conversations with organizations where data science teams do in fact use dbt, we’ve still noticed that there is often an awkward transition between dbt workflows and ML workflows. This creates unnecessary friction that we believe can be resolved with the right tooling. Continual integration for dbt bridges the gap between analytics and ML for data teams and establishes a common workflow across roles. Not only does this provide tight integration between data engineering/analytics and machine learning workflows, but it also means that users of all types can begin harnessing the power of ML in their own work.
Furthermore, Continual is co-designed with the modern data stack and is aligned with the core principles of dbt. We believe we synergize in several key areas:
With dbt and Continual, now anyone who knows SQL can build production-grade data machine learning pipelines!
"dbt was built on the idea that the unlock for data teams is a collaborative workflow that brings more people into the knowledge creation process. Continual brings this same viewpoint to machine learning, adding new capabilities to the analytics engineers' tool belt. We’re excited to partner with Continual to help bring operational AI to the dbt community." -- Nikhil Kothari, Head of Technology Partnerships at dbt Labs
Let’s get to the nuts and bolts of the integration. If you’re new to Continual you’ll need to complete a few simple steps to get started:
We’re now ready to use Continual with dbt! Using Continual on a dbt project is easy. Just follow these steps:
Configuration for the Continual integration must be defined in meta config files in dbt. meta fields can be defined in three places:
For example, the following is an example of including your configuration in a schema.yml file:
However, we could similarly define this in the customer_churn.sql file:
In the above example, we’re telling Continual that the table created by dbt represents a predictive model. In this model, we’ll be using the column ‘churn’ as the target, and the column ‘ID’ as the index for the model. When this information is passed to Continual, it will be able to fetch the data from the data warehouse, run experiments and select a winning ML model, and generate predictions back in the data warehouse. Users can additionally provide more advanced configurations in the meta fields that will do everything from controlling how frequently models and predictions are refreshed and how the AutoML engine operates. Refer to our documentation for full details.
Users can execute Continual directly on top of a configured dbt project by using the Continual CLI. The command is as follows (Note: you’ll want to make sure you execute ‘dbt run’ prior to executing continual run!):
This is meant to mirror dbt run and indeed, many of the same parameters are supported. In particular, users can use common options such as :
There are, of course, continual-specific commands as well. The following are most important:
Continual supports the use of isolated environments. These are conceptually similar to branches in git. For dbt users, Continual will automatically create and execute the workflow in a branch tied to your dbt profile configuration. In particular, the environment name in Continual will be the same as the profile name in dbt, and Continual will use the schema defined in the profile to build out your Continual resources (like predictions). If you don’t want to mix your dbt schemas with the predictions created by Continual, we recommend setting a new profile specifically for Continual. The use of environments is crucial to building out a coherent production process and makes it very simple to integrate Continual with your CI/CD workflows.
Something that we believe deserves special attention is that Continual had some advanced integration with dbt via exposures and sources. These can be enabled either project-wide or on a model-by-model basis with the configuration options ‘create_exposures’ and ‘create_sources’. The former will create an exposure file in your dbt project that contains all the dependencies for your ML model. When building your documentation, dbt will now include this as a resource in your project and you will be able to easily review the lineage of your models, as seen below (Note: you can always additionally review lineage in the Continual UI at any time):
It’s not uncommon for users to want to run some analysis on predictions after they are created. We’ve made this process easy by allowing you to tell Continual to build a source file in your dbt project. This makes it easy to quickly reference the tables created by Continual, as well as starting to include these resources in your dbt documentation and lineage. Even more, you can use ’create_stubs’ to create stub .sql files in your dbt project. These files act as skeletons that refer to the new source files which you can use to begin running analyses on your predictions.
To see a worked example of the dbt integration, please refer to our documentation.
With Continual and dbt, users can start tackling AI use cases and scaling their workloads. dbt is a fabulous tool that has done wonders for the data analytics field and is often credited with turning data analysts into analytics engineers. Similarly, we believe Continual is another step in the process that turns analytic engineers into machine learning engineers. You can find us in the #tools-continual channel in the dbt Community on Slack.
Discover the easiest path to operational ML on Databricks.