Why You Need a Feature Store

Machine Learning

September 28, 2021

Feature stores have arrived in 2021 as an essential piece of technology for operationalizing AI. Despite the enthusiasm for feature stores in high-tech companies, they are still absent from most legacy ML platforms and can be relatively unknown in many enterprise companies. We discussed how feature stores are critical to the data-first approach of next-gen ML platforms in our previous blog, but they are important enough to get their own treatment in a full article. Here we’ll cover the common features of a feature store, as well as the pros and cons of adopting one in your own work. You can also check out the replay of my presentation Building a SQL-Centric Feature Store at a recent Feature Stores for ML Global Meetup.

What is a Feature Store? 

The definition of a feature store is often ambiguously defined. We’ll first provide a concrete definition and then discuss its common features and benefits. 

Simply put, a feature store is a data management system that manages and serves features to machine learning models. What is a feature, you ask? In layperson’s terms, a feature is a descriptive attribute that’s relevant to predicting how something will behave with the world.  For example, a customer’s purchase history is relevant to predicting what that customer will buy next while the first letter of the customer’s name is not. For the more technically-inclined, a feature is one input into a machine learning model. In the common use case of tabular data, a feature corresponds to a column in a table. For a customer churn model, for instance, a good feature might be the number of customer page views over the last 30 days.  Or for a sales forecasting use case, a feature might be whether a particular day is a holiday.

Features are often collected into feature sets. Using our example of tabular data again, a feature set would be the table itself. In this context, a feature store understands the data (i.e. features) needed by a model, either for model training or to serve a prediction, as well as how that data connects to other data and the transformations required to get it into the right format. Hopefully, this sounds simple! But, the truth is that designing and building a feature store is actually quite a complex undertaking. 

The feature store concept was most prominently publicized by Uber’s Michaelangelo ML platform, where it played an important part in helping Uber operationalize ML. In their own words:

We found great value in building a centralized Feature Store in which teams around Uber can create and manage canonical features to be used by their teams and shared with others… At the moment, we have approximately 10,000 features in Feature Store that are used to accelerate machine learning projects, and teams across the company are adding new ones all the time. Features in the Feature Store are automatically calculated and updated daily.

At the time, companies like Uber who were rapidly innovating to make ML an essential part of their business were discovering several obstacles with traditional approaches to ML: 

  1. Data Scientists spend an inordinate amount of time transforming data before they can start building models. 
  2. What’s worse, when they start a new use case, they often find there is no clean data to use. 
  3. Notebook-based data science can make it difficult to track and manage the data that is being used. And, it’s less clear how to take work done into a notebook and transform that into a production job. 
  4. People on the data team who are not data scientists have close to a zero chance of accomplishing anything with the ML tools at their disposal. 
  5. Online and offline requirements for data lack a unifying data layer.

Feature stores help with these issues, and more. We’ll discuss this in more detail in the following section.  

Not long after Uber’s Michaelangelo blog post, Google & Gojek released Feast, an open-sourced feature store. In the years that followed, many other companies began discussing their internal ML platform and it wasn’t uncommon to see a feature store or a similar component that is central to an operational ML platform. Given that so many companies are struggling to find value out of their ML investments, it stands to reason that feature stores may be a key component to simplifying operational ML.  

What are Common Features of a Feature Store? 

In the last few years, feature stores have begun to creep into commercial offerings, whether as a standalone product, like Tecton, an add-on to existing platforms, like Sagemaker and Databricks, or as part of an operational platform, like Continual. There is now a lot of variety in the feature store space, and they come in all shapes and sizes. In this section, we will break down what we believe are essential features for a feature store. 

Feature Transformations

One of the most important aspects of a feature store is tracking which transformations happen to which data sources. This can be rather mundane stuff like simple joins and aggregations, or more complex stuff like window operations. Additionally, advanced capabilities may be offered in automated feature engineering, such as feature encodings, date extraction, or even deep feature synthesis. It’s important to note here that a feature store need not A) actually perform these operations or B) store the resulting data. Other parts of the ML platform could be responsible for this work.  But whatever the physical mechanism, it should be easy for data teams to add and use features from a shared repository.   

Support for Online and Offline Mode 

Apart from managing features, the other main task of a feature store is to serve those features to models. This either happens in real-time during an online prediction or offline during batch predictions or batch trainings. There’s an important distinction between these two use cases. Offline mode requires generating a (possibly large) dataset of many records across many different features and providing it to the model for training/prediction. The speed at which it does this is generally not super pressing, but it needs to be able to handle large data sets. The online mode requires generating features for one or more records to make a real-time prediction. Speed is very important, and there are use cases where this entire process needs to happen in under a second. A key observation here is that these two use cases require much different architecture, but it’s the job of the feature store to abstract away this complexity and provide a simple serving layer. 

Entity Resolution 

If designed properly, a feature store can allow data professionals to stop solving problems in terms of tables and columns, but rather in terms of actual business objects like customers and products. This turns a customer churn problem into an exercise about fetching and joining tables like customer_info, customer_transactions, customer_interactions, product_info, product_usage, etc. into one of linking to the customer and product entities. A feature store can understand all the relevant data for these business objects, or entities, in an organization and knows how to correctly connect everything together.  

Feature Lineage 

As the feature store begins to grow and capture more of the features utilized for ML work, it’s natural to begin asking questions like “Which models are affected if I update this feature?” It’s not surprising that feature stores may surface much of this type of information in feature lineage. This allows users an easy way to visualize how features/feature sets/entities are connected to each other, the column/tables/transformations involved, and any associated models. As feature discoverability and re-use is an important selling point of the feature store, feature lineage is crucial to allow users to understand how different features are used and the relationships that exist in the feature store. 

Point-in-Time Correctness or Feature Time Travel 

Although it’s easy to get accustomed to the kaggle-style of data science which presents ML problems as static flat files, it’s crucial to remember that not only do real-life problems often involve many different data sources, but they additionally often have a time component to them. Reasoning about time can quickly become complex and stump even the most tenured of data scientists. The risk for mishandling temporal data in an ML problem can be grave, as you can begin to leak data from the future and invalidate any results. 

Feature stores are not only helpful with managing complex transformations and understanding the relationship between feature sets and entities, but they also provide a solution for temporal data as well. When training or serving data is constructed in the feature store, it does so in a time-consistent manner -- this is generally known as point-in-time correctness or “feature time travel.” The idea is that for every record in the data set, we can associate a timestamp with that record. The feature store will then be able to reconstruct all the supporting features connected to that record as it existed in the point of time specified by the timestamp. The end result is that the data is always time-consistent and users are saved from having to track timestamps across multiple feature sets or entities. 

To understand why this is so important, consider a standard customer churn prediction.  While a user’s current features – such as their activity on your website over the last 30 days – are what matters to predict future churn, their historical behavior is what matters for model training.  The feature store is able to construct the correct data sets for both use cases without intervention from the user. 

Model Encapsulation

As we saw above, feature stores provide a means to serve predictions for online and offline use cases. In this day and age, it’s not uncommon for an ML platform to have some way to deploy a model as an API endpoint. This often demos well: simply train and deploy a model, then construct a JSON object with all the features needed and the endpoint will pass into the model for a quick evaluation. This looks good, but in real production use cases you’ll often find that the downstream application developer is not too impressed that it’s his responsibility to keep track of which features are needed. What’s even more annoying is that any time the model is updated model with new features, the application also has to be updated to accommodate the new API as well. 

Feature stores provide a much more elegant solution to this problem. Instead of expecting applications to provide all the base features needed for a prediction, they should instead simply provide the ids for the entities (e.g. customers, drivers, products) involved in the prediction. The feature store can then look up any features it needs to make the prediction and return the result to the application. Not only does this simplify the workflow significantly for the application developer by hiding the model’s exact feature dependencies, but it also can ensure that there is no training/serving skew

Although the feature store is dynamic and is always getting new data, it’s expected that there will likely be some lag in how up-to-date the data is in the feature store, depending on the architecture. Some use cases will necessitate that applications can provide feature overrides when they know they have the most recent data and wish to leverage those features instead of what is in the feature store (for example, a mobile application will always know a user’s current location better than the feature store will).  A feature store still allows these client features to be selectively added without exposing the full data requirements to the application developer.

Data and Model Drift 

Data and model drift are increasingly important concepts in MLOps and provide a way for users to understand when the state of their data is undergoing change. For operational use cases, the feature store is a natural place to begin tracking additional information about features, such as: What is the distribution of the data? How many nulls are there? What are the boundaries of the features? Feature stores can periodically profile the latest data and an operational system can then make use of this information as it begins to train models and make predictions. How does the data at prediction time compare to when the model was trained? If it has changed by a substantial amount, that may be a sign that it is time to retrain. 

The Pros and Cons of Using a Feature Store 

Now that we have a better understanding of the capabilities of a feature store, we can talk about the benefits of using them. 

  • Enforces isolation between data science and data engineering and MLOps:  ML workflows tend to be complex and comprise many different steps. A feature store naturally decouples data engineering from data science, as it provides a destination for the output of data engineering workflows and a natural starting point for ML use cases. Similarly, feature stores separate model training from feature serving which provides a natural barrier between data science and MLOps. This imposes structure around the ML workflow which yields efficiencies and avoids complex “pipeline jungles.” 
  • Enables feature sharing: By enabling feature re-use and discoverability, feature stores help organizations better collaborate on ML use cases. Domain experts can quickly register features in the feature store, which can then be readily used in a variety of use cases across teams. The end result is that robust feature sharing eliminates duplicate feature engineering efforts and makes everyone more efficient. 
  • Prevents data leakage and training/serving skew: Data leakage and training/serving skew are two of the most common obstacles when trying to get ML use cases into production. Feature stores provide elegant solutions for both of these issues, as we previously discussed.  
  • Accelerates use case adoption: When a feature store is part of an operational ML system, updating a use case is typically as simple as modifying a feature set. Users can quickly prototype and iterate on use cases and push the best models into production in a snap. This wouldn’t be possible without a feature store. 
  • Democratize ML: Many ML tools make bold claims to help democratize the ML workflow, but at the end of the day users are often stuck writing code or building complex pipelines, not to mention trying to understand complicated ML concepts. Since a feature store can speak the language of the business via entities, this allows many users to begin to contribute to ML use cases, either via registering new features for use or explicitly building models via a data-first operational ML platform.  

Of course, not everything is perfect. Here are some things to consider before jumping into the feature store world:

  • Difficult to build and maintain: Feature stores are fantastic for the ML workflow, but they can be very challenging to build and maintain. Although many high-tech companies are leveraging them, years of development work went into creating them and this is not something that many organizations have an appetite for. I strongly encourage you to check out “buy” options before building your own. 
  • Requires expertise to use: Not all feature stores are created equal. A quick scan through the documentation of some can reveal that these are pretty complex systems that will take time for your organization to master. Ideally, try to select a feature store that is designed for ease of use and plays well with your existing data stack, as it will allow your team to iterate faster and bring more data professionals in to collaborate.
  • Can be “yet another” integration point: If you’ve been keeping up with my blog posts, you’ll know I’m not an advocate of ML systems that require complex integrations that accrue significant technical debt. Standalone feature stores can be tricky to effectively integrate into your ML workflow and don’t fully realize the potential of a truly data-first approach to operational AI/ML.
  • Avoid if research-focused: If your ML team is completely focused on research topics and doesn’t worry about operationalizing its work, a feature store is very likely overkill and you probably can avoid the complexity without missing much. 
  • Beware of imposters! Since feature stores are very trendy, many claim to “have the same functionality” even though they won’t call it a feature store. Don’t fall for these traps and accept no imitations! Here’s a quick list of things that are not feature stores:  data catalogs, static data collections, ETL/ELT tools, data warehouse.

Additionally, although a feature store is critical to an operational ML platform, having a feature store does not necessarily mean you also have an operational platform (It’s a necessary, not sufficient, condition). 

Bringing Feature Stores to the Modern Data Stack 

There are many different form factors feature stores take today.  As part of Continual, we deliver a feature store that builds directly on your cloud data warehouses and enables a fully declarative workflow that radically simplifies operational AI.  By putting data first and considering the end-to-end ML workflow, Continual enables data teams to deploy continually improving predictive models – from customer churn to inventory forecasts  – in hours or days, not weeks or months.  This is the true potential of feature stores when fully integrated into a declarative AI workflow.

To see the difference that having a feature store can make on your ML workflows, get started with a trial of Continual today. You can also check out the replay of my presentation on Building a SQL-Centric Feature Store at a recent Feature Stores for ML Global Meetup.

Sign up for more articles like this

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Machine Learning
Choosing the Right Evaluation Metric

Model performance depends on the metric being used. Understanding the strengths and limitation of each metric and interpreting the value correctly is key to building high performing models. In part 1, we cover four evaluation metrics commonly used for regression problems and demonstrate how to use them when building models on Continual.

Feb 2, 2022
Machine Learning
Where's the CI/CD in ML?

While many have called for stronger adherence to software development best practices in machine learning (ML) and artificial intelligence (AI) as well, today’s ML practitioners still lack simple tools and workflows to operate the ML deployment lifecycle on a level on par with software engineers. This article takes a trip down memory lane to explore the benefits of the CI/CD toolset and the detriment of their unfortunate absence in today’s ML development lifecycle. 

Dec 13, 2021
Sign up for freeBook a demo