Modern Data Stack
April 12, 2022
Welcome to the Spring 2022 Edition of the Modern Data Stack Ecosystem. In this article, we’ll provide an in-depth look at the Modern Data Stack (MDS) ecosystem, updated from our Fall 2021 edition. We also highly recommended our article, The Future of the Modern Data Stack, to anyone who is new to the MDS and wants to learn about its history. And if you want to sign up for our future Modern Data Stack updates, be sure to share your email address with us here or at the end of this blog.
In our Fall article, we laid down criteria for inclusion on this list. Here’s a quick recap for those who don’t wish to spend the extra click:
Technology must be:
We’re happy to report that there have been no significant changes in the past six months that have changed our inclusion criteria. Adoption of and investments in MDS technologies continue to grow, with recent fundraising rounds from dbt Labs, Airbyte, Rudderstack, Hightouch, Census, Atlan, Hex, and others, showing that interest is not waning in this ecosystem. This excitement even seemed to be palpable at the recent Data Council Austin conference.
We put together this article to highlight the most crucial components of the MDS and the main tools and vendors in each component. This list is not meant to be exhaustive; instead, we’re focused on compiling feedback and observations from our work with MDS customers to provide a CliffsNotes-style guide that should help people exploring the MDS to understand where to start. In many ways, we think compiling a list of everyone who claims to be part of MDS is perhaps more hurtful than helpful, so here is an openly opinionated take on where the MDS is at and is headed. If we missed your favorite tech, let us know.
In this edition, we’ll also incorporate our thoughts from the field about each category. This stems from our experience in helping customers navigate through the modern data stack, actually using the technology, and hearing from customers who they are excited about.
While compiling this article, we also noticed three main threads that we think are extremely topical for the current state of the MDS and deserve some additional commentary:
1. Expanding Beyond SQL:
To date, the MDS has put the cloud data warehouse at its core, and it is natural to place a large emphasis on SQL as a result. SQL has become a rallying point for younger data teams that are looking for cloud-native technology and hoping to skip the complexity that more mature teams are constantly mired in. However, as these teams grow and expand, their needs evolve and they may get to a point where using SQL as the only data language is limiting progress. The appeal of SQL is that it is the lowest common denominator in the data world – most data roles should be able to operate in it and thrive – but as business needs shift, data teams may need more tools to adequately address all requirements.
Tools in the MDS are noticing this and adapting. Whether it’s dbt considering non-SQL support in the DAG or Snowflake beginning to leverage Python in-DB via Snowpark, this is a noticeable trend that we expect many more vendors will begin to adopt moving forward.
2.Implementing Real-Time Use Cases:
As a whole, the MDS is entering its teenage years. It has a good foundation, many early adopters have several years of success under their belt, and technical innovations continue to flourish. If we were to travel three or four years into the past, we’d marvel at the capabilities that we have in the stack right now around reverse ETL, machine learning, data governance, etc. The products that have emerged and matured in these categories provide many capabilities that enable modern data teams to extract great value and tackle new use cases with a minimal amount of effort and added complexity.
Real-time use cases are another area where the MDS is poised to move in the future. There have been many new vendors who have appeared in the past couple of years who are focused on solving real-time use cases (see below), and we’re starting to see established companies going down that path – for example, Fivetran’s latest acquisition of HVR signals that itis thinking a lot more about real-time problems these days.
3. Competing Against Legacy Data Stacks:
In my original article on the Future of the Modern Data Stack, I raised the question of whether or not companies with larger and older data teams would hop onto the MDS bandwagon:
[F]or large enterprises with a variety of data, complex data teams, and decades of old tech in the closet, they may find the current state of the modern data stack too narrow in its use cases and lacking enticing capabilities, like AI, data governance, and streaming. The modern data stack needs to appeal to large enterprises in order for it to survive past being just the latest data platform trend.
Certainly, vendors have been rising to the call and building out many new capabilities for the MDS, but for an established company, the question is not just “Can I do this on the MDS?” but rather “Do the benefits of moving this to the MDS outweigh the cost of migrating our workload(s)?” If a data team is new and currently doing data integration by writing scripts that dump data into a cloud data warehouse (as many teams do at the beginning), it’s a no-brainer to adopt a tool like Fivetran or Airbyte to provide robustness in that process, but the conclusion is less clear if the team already has baked data integration practices and tools that it has been using for many years. To their credit, legacy vendors are often doing at least the bare minimum to make integrations into the MDS possible and stave off competition.
The stickiness of legacy systems continues to be a thorn in the side of the MDS and likely presents an obstacle for adoption within larger companies who are looking for more than just the existence of functionality and need a compelling reason to “rip and replace” existing tools and architecture. We believe this will continue to force MDS vendors to innovate at a breakneck speed and provide integrations and capabilities that far exceed what is currently present in legacy systems. It is not often that we find movements in technology where the future of a single company is dependent upon how the ecosystem does as a whole, but this seems to very much be the case with the MDS and we don’t think that any individual component is immune from this effect.
Without further ado, we present…
Although the above doesn’t reflect much change from the fall – we’ve added Firebolt on the bubble as they do seem to be picking up some momentum as a disruptor in the cloud data warehouse (CDW) space – the data platform category is one where the trend of “expanding beyond SQL” gets highlighted best. It appears as if the market continues to further entrench into two camps: the SQL-centric CDW and the more flexible (and complex) data lake/lakehouse. Although the MDS is more closely associated with the former, there is no reason why we have to stay contained entirely within a SQL mindset, and if innovations like dbt possibly supporting non-SQL workflows come to pass, this may enable expanding the MDS into greenfield easier in the future.
Let it also be said that this entrenchment, at least in my eyes, appears to come mainly from customers themselves. That is, companies tend to operate with a philosophy that puts them into one camp or the other. Unsurprisingly, younger companies with small data teams tend to want to stick to SQL as much as possible – this lowers the barrier of entry into the data workflow and makes it easier to hire people who can actually contribute – and more established companies are already dealing with multiple layers of complexity and desire flexibility to allow people to just get stuff done – it’s too late to get the cat back in the bag. As you read the last few sentences, you likely also had a knee-jerk reaction about which methodology is the “correct” one to use. Case in point!
Vendor-wise, Snowflake is the overwhelming favorite for those in the CDW camp, and Databricks for those in the data lake/lakehouse camp. Both continue to ramp up efforts to be seen as a more legitimate option in the other’s camp: Snowflake wants to be seen as more than “just SQL” and Databricks wants to be seen as having an offering on par with the rest of the CDW market. As such, we’ll probably be able to enjoy continued innovations and investments from Databricks into data warehousing workflows and likewise from Snowflake into downstream applications (application building, real-time use cases, machine learning/data science, arbitrary python execution, etc).
Data Integration On the Bubble: Hevo Data
The data integration piece of the MDS is one of the oldest in the stack, but it continues to be hotly contested. In my conversations with customers, it sounds like a lot of decisions here are based on speed and price. This can create a precipitous situation for vendors that quickly becomes a race to the bottom. Airbyte has seen a lot of good traction lately as being an open-source option in the field, and they raised a lot of money to try to quickly build a community and take over the market. Meanwhile, Fivetran has withheld its leader status with a huge round of funding and acquiring HVR. HVR adds a nice set of capabilities around change data capture (CDC), which can be seen as a critical component for some real-time use cases (consider that Fivetran already supports streaming platforms like Apache Kafka and has hinted that streaming Fivetran is in the works). From a product standpoint, these seem to be wise additions from Fivetran to stay ahead of the pack and continue solving their client’s headaches.
In the event-tracking space, Rudderstack’s latest round of funding gives it a large war chest to go after the market Segment left behind. Twilio’s acquisition of Segment in 2020 tied the technology very closely to a marketing tool, whereas Rudderstack is placing a bet on the bulk of this work continuing to be the domain of the data team, necessitating a more flexible experience that better feeds downstream processes like machine learning and customer engagement. Courting the larger data team has proven to be a good strategy in the past and we think this is a smart move.
Main Tools: dbt
After we published our Fall edition of the MDS ecosystem, we received some feedback asking “How is it possible that dbt has no competition?” Enough that I thought I'd spend some time here addressing it. There’s no doubt in my mind that dbt is the transformation tool of choice, and the company’s community following is so large that even trying to challenge its status in the MDS seems like a fairly large waste of time and effort for anyone who is starry-eyed at the thought of being a disruptor. However, this is not to say it’s all smooth sailing for dbt, which does face considerable competition in the face of legacy systems. Any company that has a data team that stretches back more than five years likely adopted data transformation tools and processes that were not dbt. A small number of these may have already converted to dbt, but I’d venture to guess that most of these are still on legacy systems.
How successful dbt is at converting larger and older data teams to the Modern Data Stack likely has ripple effects for all vendors in the space. David Jayatillake recently shared his opinion that dbt is basically the center of mass for the MDS, and I think in many ways, he’s right. Although Snowflake and Fivetran have larger valuations than dbt Labs, dbt is where the community is at and this is definitely one of the main forces propelling the MDS forward. It’s not uncommon for us to talk to F500 companies here at Continual and discover that dbt does not appear to be in use. These companies have decades of experience with data, tooling, and workflows to support, and they are generally averse to bringing in new technology if it is not obviously providing quick benefits. If I’m a data leader at a large company, I probably like dbt, but I also have several ETL and data prep tools, plus a sizable team who knows how to use them and I don’t want to force them to retool. In other words, it’s not an obviously bad decision to stay with what we’re currently doing. I suspect this is at least partially why we’re seeing things like dbt explore non-SQL workloads (easier to port over legacy workflows), and innovations like providing a metric store (a value add over legacy systems).
So, there still remains this lingering question of whether the MDS will break into the large enterprise space. Investors appear to be heavily betting on this. I think there are two main drivers that will enable this:
Snowflake’s increasing penetration in the largest enterprises suggests the first is well underway. Once these dominos have fallen into place, it will likely be a mad dash to build out the ideal data workflow using the existing MDS pieces.
Main Tools: Continual
Similarly to dbt, I'd like to provide some justification for why Continual is the only ML tool on the list, as this was an often asked question in response to our Fall edition as well. And, as we saw with dbt, the crux of the argument is not that there are not any other options out there, but rather that Continual is the main option within the modern data stack. There is certainly no shortage of ML tools out there, as this has been one of the most heavily invested in and overhyped fields in recent years, but Continual’s approach to ML on the MDS continues to be rather unique.
We spent an entire blog discussing how Continual is different from legacy systems. However, we rarely find ourselves competing with these types of ML platforms amongst early adopters who are evaluating and using Continual. Rather, we’re more often competing with cloud-native ML technology (Google Vertex AI, AWS Sagemaker, etc), and we’re often winning because we have focused on providing a compelling workflow. We often bill ourselves as “ML for the Modern Data Stack”, but we always see ourselves first and foremost as “operational AI”, and a key part of that is the declarative workflow that just feels so satisfying once you get the hang of it. Legacy ML platforms are often mired in technical debt prior to the great cloud explosion of the late-2010s, and we seem to be a favorite for those looking to streamline ML in the cloud.
We’re also in an interesting spot where our take on ML is differentiated enough that we talk to a non-trivial amount of clients who have adopted no other tools in the modern data stack. We find this interesting, but the most common complaint we hear is that the data team is not happy with its ML workflow or how much time/effort it takes to build and maintain production ML. In fact, in a recent discussion with a colleague from a large consulting firm, he mentioned that “the ML workflow problem is still an unsolved problem for legacy solutions”. How about that!
This raises another point: there are actually many different types of companies who are trying to leverage ML, and most of them are not well served by existing solutions. There are at least four distinct types of companies using ML out there:
The need for a strong operational ML system in all of these categories is rather apparent in the current state.
At Continual, we’ve often said that we are part of a new generation of “data-centric” ML tools. This was recently highlighted in a good summary by Andreessen Horowitz. Continual is uniquely serving the MDS community currently. Other data-centric tools, like Scale, Labelbox, and Snorkel, are currently focused more on non-tabular use cases and offer a lot of specialized functionality. Many ML tools do not provide a feature store, which we believe immediately disqualifies them as operational platforms. And, although there is an increasing market for standalone feature stores, we think the complexity involved with integrating this into your stack and workflow far surpasses the technical abilities of a company using the MDS. Continual offers an end-to-end platform that lets users utilize a feature store without having to worry about building, maintaining, or integrating it into the rest of the ML workflow. Combine that with a simple declarative workflow, and we think it’s a short learning curve to get non-data scientists to buildtheir own ML workflows.
The buzz for the last year in the data analytics space has largely been around headless BI. Although headless BI is not just a metric store, for many in the MDS, this is the main attraction and there is a decent amount of conflation between the two. Headless BI, in general, fits in conceptually with the MDS as tools like dbt and Continual are already turning data engineering and machine learning workflows into a declarative experience, and headless BI is the next iteration (It’s always interesting how a good idea can permeate across many different layers of a data stack). A metric store can be viewed as akin to a feature store – instead of each team maintaining their own analytical transforms directly in the BI tool, a metric store is an abstract layer that allows greater collaboration in the data team, operates as a source of truth, and is able to serve up metrics quickly to a visualization tool.
The big news here is that dbt has launched experimental support of a metric store, centered on the cloud data warehouse. In our last article, we were already impressed with the offerings of Transform (who also recently open-sourced their metrics layer) and Metriql, and dbt is well positioned to become a large player for metric stores on the modern data stack. We’re now at a point where we believe we have a gluttony of options for metric stores and the technology is far enough along and being adopted widely enough that we can include this as a constant member of the MDS. Some emerging metric stores like Supergrain have even pivoted away as dbt entered the market.
On the BI front, we’ve added Lightdash, Glean, Sigma, and Superset on the bubble. Lightdash and Superset is a good option for those looking to stay strictly within the open-source community, and Lightdash additionally touts a nice dbt integration. Google’s acquisition of Looker and exodus of the founding Looker team has opened the door for BI solutions that appeal to modern data teams. This category is likely to see a significant transformation over the next few years as large-scale incumbents like Tableau, PowerBI, and Thoughtspot see increasing competition from innovative modern data stack native startups.
On the Bubble: Hevo Data
Reverse ETL/Data Operationalization has been booming over the last couple of years in the MDS, and it continues to be a hot topic for investors and practitioners alike. Census, Hightough, and Rudderstack all led large funding rounds recently, and enough companies are adopting this technology that we’ve long considered it part of the “core” of the MDS. Two big questions exist for technology in this category: 1) Will vendors in this space get disrupted by data integrators? and 2) Is reverse ETL really the “last mile”.
The first question is already being answered in the affirmative. Rudderstack recently launched a reverse ETL product, Airbyte acquired Grouparoo, and smaller players like Hevo Data already provide the capability for ingest and egress into the CDW. For customers looking to keep the stack as concise as possible, there’s definitely an attraction to using one tool instead of two, but it’ll be on the shoulders of pure reverse ETL tools to stay enough ahead of the pack that using a separate tool is justified. Or, who knows, maybe the reverse ETL vendors will start disrupting data integration as well.
Is reverse ETL really the last mile? For a lot of use cases, that may be the case, but some may require special tooling to properly complete the data workflow. My worry here is that as these tools begin to work natively off data in the CDW, the need to move data out of the warehouse vanishes and the value of this set of tooling may be lessened. Supergrain, a data warehouse native personalized marketing startup, is a recent example of this trend. We’ll discuss this more below.
On the Bubble: Flyte
In the trend of shifting the MDS into non-SQL workloads and providing better capabilities for supporting legacy systems, we think adding Data Orchestration tools as a category makes a lot of sense. Even though the adoption of data orchestration tools generally comes later on in the lifespan of a data team, this is something that we’ve been noticing more and more MDS customers adopting as their workloads increase in complexity and the team reaches a higher maturity level.
Although there has been lots of debate on the bundling vs unbundling of airflow, we think that the reality is that large enterprises need a centralized orchestration layer to impose some order amongst all the chaos. Although certain aspects of the data pipeline may regulate themselves, relying on a completely unbundled approach across dozens of tools is a recipe for disaster and is likely a non-starter for any sufficiently complex organization. Small teams with modest pipelines can likely survive without an orchestrator, but many enterprise customers leverage these as the glue between all the different tools used across the organization and I don’t foresee that changing. Furthermore, these are the types of tools that are perfectly suited to help stitch together legacy tools into an MDS-centric workflow and should reduce total overall friction as new MDS tools are adopted. In this sense, they are critical tools to help larger organizations to begin leveraging MDS technology without completely disrupting their existing processes.
In terms of options, there are lots of viable solutions for data orchestration on the modern data stack. Airflow, Dagster, and Prefect are all great tools, each with its own pros and cons. Dagster has been putting the most effort into meshing with the MDS, and the recently released software-defined assets bring a declarative interface to the tool. As more tools in the MDS begin adopting a declarative approach, we’re beginning to reach a critical mass where it’s almost possible to define an entire end-end-end data workflow declaratively.
Last year Prukalpa Sankar wrote an excellent article explaining why we are in the third generation of data catalogs. The Gen 3 data catalog is built for the cloud, designed for a wide variety of users, and is meant to be active. Active metadata seeks to break data catalogs from their siloed design and allow metadata to be embedded back into data workflows. This essentially turns the data catalog into a two-way street where it not only collects information about data, but can also serve that information up for any tools that request that information.
In the past, we’ve talked about how data governance needs to be native to the MDS in order to get broader adoption. This is one of those things that is a hard requirement for large organizations, and there is currently a plethora of options to choose from. Our initial picks from the Fall have been doing well and are all worth checking out when the need arises.
An interesting development in the data governance space is Astronomer’s acquisition of Datakin. Astronomer is the main caretaker for Airflow, and Datakin is a data lineage and observability platform. Governance tools are complicated and they are often bundled with tools in the wrong part of the data workflow. Hopefully, by now we understand that data governance needs to have a single source of truth and can’t be broken up between different tools. Trying to bundle it with a data platform (which is common) or ETL/ELT tools (which is also fairly common) is difficult because they usually don’t have visibility into all aspects of the data workflow and a company literally cannot afford to have gaps in its governance process. However, bundling it with a data orchestration tool is rather brilliant: the orchestration tool theoretically could be used to control all aspects of the data workflow, and if it’s also providing governance capabilities, users may be forced to use it in order to comply with governance requirements.
It’s still too early to tell what happens with this acquisition/integration, but it has the potential to be a disruptor in the governance space and could set Airflow/Astronomer to remain a critical component of the MDS in the future and fend off competition from new tools and startups.
As more types of users are drawn into the MDS and the demand for non-SQL workloads increases, MDS-friendly notebooks will become much more important for companies looking for a collaborative environment that supports different user profiles. The crux for many of these tools will be solving the development to production gap which has plagued many notebooks in the past, but the declarative nature of dbt and Continual provide a good foundation with which to accomplish this. Hex and Deepnote have both completed fundraising rounds in the past six months and look well poised to start converting new users to the MDS.
Real-time use cases are likely still out of reach for many customers of the MDS, but we suspect this will change as data teams mature and grow, and as more enterprise companies come into the MDS. This is the most important category not yet mainstream in the MDS, but we believe it’s an inevitable inclusion. The companies above are doing some great things for real time use cases and are well worth investigating for any companies looking to venture into the far reaches of the ecosystem.
There’s a line of thought around the recent CDW movement that every SaaS software category that exists will be disrupted to be entirely redesigned with a CDW at its core. Historically, SaaS tools were often constructed in a way so that data was siloed and not easily accessible. Customers must then spend a lot of time getting data and moving it to a central location to process, analyze, and tackle more advanced tasks like machine learning. But, what if all tools just worked natively with the CDW? This would mean that data these tools created would already be in the CDW and they would also know how to read data from other tools that also used the CDW. Given how much time and effort is spent inside an organization just moving data around, this approach does seem revolutionary at times.
Customer Engagement Platforms are a good first investment for companies looking to get more value out of their CDW. Virtually every company has customers and needs to understand how to better segment, target, and engage with them, and these tools make it as easy as can be. MessageGears and Braze are both well-trusted tools in this category and Supergrain is a newcomer that is also worth a look.
Snowflake’s acquisition of Streamlit this winter garnered a lot of attention, and not only because many were perplexed by its $800M price tag. The company is placing a large bet that easing the complexity of building data applications on top of the CDW will allow customers to harvest value from their investments faster and grow more quickly. This is surely a smart strategy and Streamlit comes with an enthusiast community of users which is nothing to sneeze at. It’s worth noting that Streamlit is not the only player in town, and other tools like Columns.ai and Anvil are worth checking out as well.
Companies have long been looking for easy ways to monetize their data, and just about every data platform out there has some ability to do this today. What doesn’t seem to be well explored is a cross-platform tool that allows companies to proliferate their data out to many platforms at once, as well as maintain and monitor those data shares. This seems like a really obvious application for a startup in the cloud space, but sadly we’ve yet to encounter anyone who’s actually undertaking this effort. Contact us if you know of anyone working towards this!
We hope you enjoyed this update to our overview of the modern data stack ecosystem. One thing is for certain: a lot will change over the next few months. If you’d like to stay posted on the latest in the modern stack, you can subscribe to our blog below to get updates delivered to you.
Discover the easiest path to operational ML on Databricks.