Modern Data Stack
October 28, 2021
In our previous article, The Future of the Modern Data Stack, we examined the motivations of the modern data stack, its current state, and looked optimistically into the future to see where it is headed. If you’re new to the modern data stack, we highly recommend giving the aforementioned article a read. A question we often get from new adopters of the modern data stack is “What tech should we be looking into?”. It’s a great question, as there are many different components to the modern data stack and as its popularity grows, many companies aspire to re-brand and jump on the bandwagon. We thought providing a roadmap to the modern data stack would be a great resource for any who are just starting to get acquainted with the ecosystem.
The modern data stack is a collection of cloud-native tools that are centered around a cloud data warehouse and together comprise a data platform. The benefits of adopting a modern data stack are many:
The modern data stack is really the resurgence of the data warehouse as a primary data store for data workloads. After several decades of dominance, the data warehouse began to fall out of fashion in the era of “big data” as data lakes briefly rose to prominence. Data Lakes ultimately proved too complex and costly for most organizations, and the fast adoption of cloud infrastructure in the 2010s provided a great opportunity for the data warehouse to make a comeback, this time built for the cloud and incorporating many of the technical aspects from the big data movement. From there, it was inevitable for an ecosystem of tools to assemble to reenvision data workflows for the cloud era.
What makes a product part of the modern data stack? In our previous article we laid down some guiding principles, which we’ll also use here. Specifically, to be part of the “modern data stack”, a technology must be:
As the modern data stack continues to grow and evolve, many new technologies and vendors are entering the conversation. Below is our take on the current main functional areas of the modern data stack and the main vendors in each category. Below, we’ll dive into each one in a little more detail.
This is where it all starts! You can’t get started in the modern data stack without a data warehouse to store your data. Snowflake is currently the leader in this area, but every cloud vendor has its own offering and BigQuery and Redshift are commonly used as the foundation for the modern data stack. Databricks can be a disrupter here, as its SQL offering has potential to lure in larger enterprises who are looking to simplify their Hadoop-era data pipelines but not abandon Apache Spark entirely. One thing is certain: the future lives in the cloud and speaks SQL.
A data warehouse is only as good as the data that it holds, and it’s only valuable if you can actually get useful data into it. It’s essential for every modern data stack to have a data integration tool, and there are several to choose from. Fivetran and Stitch have been around the longest and have the most traction in terms of helping customers get data into their cloud data warehouse, but Airbyte is a new technology that is open-sourced and is quickly picking up a dedicated fanbase. One advantage of these offerings over legacy ETL tooling is that they put a lot of engineering effort into understanding the underlying source systems APIs and make importing data as easy as a few clicks of a button. Given the complexity of some sources, like Salesforce, it’s impressive that you can go from zero to production in under a day with these tools. Nobody should be writing these integration pipelines themselves anymore.
Another aspect of data integration is event tracking or the “customer data platform”. These focus primarily on ingesting events pertaining to customer behavior and additionally offer functionality around transforming your data and loading it into your cloud data warehouse or directly into destinations like Salesforce, Hubspot, or Marketo. Although there is some cross over functionality with the pure data integration tools above, certain use cases are better solved with an event tracker and it is not uncommon to see customers happily leveraging both. Segment is the most established vendor in this field, but Snowplow is an open-source alternative with its fair share of supporters, and Rudderstack is a newer entry who is gaining a lot of steam of late after Segment was acquired by Twilio.
Main Tools to Consider: dbt.
When it comes to data transformation on the modern data stack, there’s really only one tool in town: dbt. dbt has a huge, thriving community, is used by thousands of companies, and is conveniently open-sourced. There was a short blip when Dataform looked like it might challenge dbt’s reign, but following an acquisition by Google, it’s pretty hard to find companies selecting using Dataform over dbt. We’ve even yet to talk to a BigQuery customer who is not using dbt. What about your ETL vendor of yore? Let’s be honest, only dbt has a modern developer workflow and data warehouse centric design that meets the criteria for being part of the modern data stack.
Main Tools to Consider: Continual.
AI is a new entry to the modern data stack. We think it is the next logical step for companies who are taking the journey down the modern data stack: they already have well curated datasets, great processes for ingesting new datasets, and easy ways to connect insights back to the business. The next piece of the puzzle is a tool that enables your team to transform into machine learning engineers and start tackling AI use cases. Continual is the first AI/ML platform co-designed with the modern data stack. It has tight dbt integration and allows users of all profiles to come into the data warehouse and start operationalizing AI in days, not months. We believe we’re the perfect complement for any company that's looking to get extra value out of the data they are already collecting in their data warehouse. To date, we believe we are the only AI tool that actually lives up to the tenets of the modern data stack, although we’d love some company! But complex MLOps platforms for experts only or point-and-click AI tools without an operational focus need not apply.
The data analytics and BI market has always been one of the most hotly contested categories in the data ecosystem, and it’s no different in the modern data stack. Although Tableau has large market share overall, Looker and Mode were positioned as cloud-native early on and have entrenched themselves deeply into the modern data stack. Tableau’s close proximity to Salesforce is actually a bonus for many customers, so they are still widely used. Preset represents the open source tool of choice in the community – now available as a cloud managed service – and ThoughtSpot has an interesting viewpoint around search-enabled BI that shouldn’t be ignored.
Reverse ETL is the flip side of the Data Integration category: tools making it easy to get data out of your data warehouse and back into the applications that your business uses. Census and Hightouch both have a lot of momentum and a strong offering. Competition is pushing them to move fast and more companies are experiencing the benefits by the day. For event tracking use cases, you may also want to just keep the entire workflow contained within whichever event tracking vendor is used above, but these point-to-point integrations can miss out on many of the benefits of a data warehouse centric design.
Data Governance is key to any data organization. It is an essential evolution the modern data stack needs to undergo in order to fully mature and make itself attractive to large enterprises. We’re breaking this down into two main categories: data cataloging, i.e. understanding what data exists in the data warehouse and the relations therein, and data observability, which allows you to actively monitor data in the warehouse. Both are crucial technologies to deploy as your data practice grows and becomes more complex. In the former category, Alation is an older catalog with a lot of market share that is relevant for the modern data stack crowd as it has always had a large focus on data warehousing, although there are many new startups that are offering excellent options for modern data stack practitioners: Atlan is an impressive catalog tool that also contains lineage and data quality functionality, and Stemma and Acryl Data are both excellent options built on top of open source tools, Amundsen (Lyft) and DataHub (Linkedin), respectively. The data observability category is perhaps more cluttered than data cataloging with even less signal, but our early evaluation of the field has us excited about Monte Carlo, BigEye, Datafold, and Metaplane. We would evaluate all of them before making a difficult decision.
The modern data stack is still growing and evolving rapidly. We’ll plan to update this ecosystem periodically as we notice new trends that have matured enough for inclusion as well as to update vendors who are breaking through as having a significant share of the market. As a teaser, here are some areas that we are keeping a close eye on:
Product Analytics: This fills a similar space as the customer data platforms. Built for product teams, product analytics can supercharge your understanding of your business's products, who uses them, and how they are used. It’s not yet mainstream in the modern data stack, but it’s easy to see how this could become a staple in a lot of stacks.
Notebooks: Although notebooks are super popular with data scientists, they haven’t really broken into the modern data stack in a convincing way. In a sql-centered world, do we need notebooks? Several companies are working on this premise, and it’s not hard to envision the modern data stack opening itself up to additional languages while still staying centered on the data warehouse.
Real-time/streaming: To date, the core of the modern data stack has been on batch applications. We think in a few years this will look entirely different and handling real-time/streaming use cases on the modern data stack will not only be popular, but it will also be common. Several companies are working to pave the way for that future now.
Application Serving & Data Sharing: As we covered in our original article, we think both of these areas are ripe for innovation, whether from existing vendors or as new offerings.
Discover the easiest path to operational ML on Databricks.