September 28, 2021
Feature stores have arrived in 2021 as an essential piece of technology for operationalizing AI. Despite the enthusiasm for feature stores in high-tech companies, they are still absent from most legacy ML platforms and can be relatively unknown in many enterprise companies. We discussed how feature stores are critical to the data-first approach of next-gen ML platforms in our previous blog, but they are important enough to get their own treatment in a full article. Here we’ll cover the common features of a feature store, as well as the pros and cons of adopting one in your own work. You can also check out the replay of my presentation Building a SQL-Centric Feature Store at a recent Feature Stores for ML Global Meetup.
The definition of a feature store is often ambiguously defined. We’ll first provide a concrete definition and then discuss its common features and benefits.
Simply put, a feature store is a data management system that manages and serves features to machine learning models. What is a feature, you ask? In layperson’s terms, a feature is a descriptive attribute that’s relevant to predicting how something will behave with the world. For example, a customer’s purchase history is relevant to predicting what that customer will buy next while the first letter of the customer’s name is not. For the more technically-inclined, a feature is one input into a machine learning model. In the common use case of tabular data, a feature corresponds to a column in a table. For a customer churn model, for instance, a good feature might be the number of customer page views over the last 30 days. Or for a sales forecasting use case, a feature might be whether a particular day is a holiday.
Features are often collected into feature sets. Using our example of tabular data again, a feature set would be the table itself. In this context, a feature store understands the data (i.e. features) needed by a model, either for model training or to serve a prediction, as well as how that data connects to other data and the transformations required to get it into the right format. Hopefully, this sounds simple! But, the truth is that designing and building a feature store is actually quite a complex undertaking.
The feature store concept was most prominently publicized by Uber’s Michaelangelo ML platform, where it played an important part in helping Uber operationalize ML. In their own words:
We found great value in building a centralized Feature Store in which teams around Uber can create and manage canonical features to be used by their teams and shared with others… At the moment, we have approximately 10,000 features in Feature Store that are used to accelerate machine learning projects, and teams across the company are adding new ones all the time. Features in the Feature Store are automatically calculated and updated daily.
At the time, companies like Uber who were rapidly innovating to make ML an essential part of their business were discovering several obstacles with traditional approaches to ML:
Feature stores help with these issues, and more. We’ll discuss this in more detail in the following section.
Not long after Uber’s Michaelangelo blog post, Google & Gojek released Feast, an open-sourced feature store. In the years that followed, many other companies began discussing their internal ML platform and it wasn’t uncommon to see a feature store or a similar component that is central to an operational ML platform. Given that so many companies are struggling to find value out of their ML investments, it stands to reason that feature stores may be a key component to simplifying operational ML.
In the last few years, feature stores have begun to creep into commercial offerings, whether as a standalone product, like Tecton, an add-on to existing platforms, like Sagemaker and Databricks, or as part of an operational platform, like Continual. There is now a lot of variety in the feature store space, and they come in all shapes and sizes. In this section, we will break down what we believe are essential features for a feature store.
One of the most important aspects of a feature store is tracking which transformations happen to which data sources. This can be rather mundane stuff like simple joins and aggregations, or more complex stuff like window operations. Additionally, advanced capabilities may be offered in automated feature engineering, such as feature encodings, date extraction, or even deep feature synthesis. It’s important to note here that a feature store need not A) actually perform these operations or B) store the resulting data. Other parts of the ML platform could be responsible for this work. But whatever the physical mechanism, it should be easy for data teams to add and use features from a shared repository.
Apart from managing features, the other main task of a feature store is to serve those features to models. This either happens in real-time during an online prediction or offline during batch predictions or batch trainings. There’s an important distinction between these two use cases. Offline mode requires generating a (possibly large) dataset of many records across many different features and providing it to the model for training/prediction. The speed at which it does this is generally not super pressing, but it needs to be able to handle large data sets. The online mode requires generating features for one or more records to make a real-time prediction. Speed is very important, and there are use cases where this entire process needs to happen in under a second. A key observation here is that these two use cases require much different architecture, but it’s the job of the feature store to abstract away this complexity and provide a simple serving layer.
If designed properly, a feature store can allow data professionals to stop solving problems in terms of tables and columns, but rather in terms of actual business objects like customers and products. This turns a customer churn problem into an exercise about fetching and joining tables like customer_info, customer_transactions, customer_interactions, product_info, product_usage, etc. into one of linking to the customer and product entities. A feature store can understand all the relevant data for these business objects, or entities, in an organization and knows how to correctly connect everything together.
As the feature store begins to grow and capture more of the features utilized for ML work, it’s natural to begin asking questions like “Which models are affected if I update this feature?” It’s not surprising that feature stores may surface much of this type of information in feature lineage. This allows users an easy way to visualize how features/feature sets/entities are connected to each other, the column/tables/transformations involved, and any associated models. As feature discoverability and re-use is an important selling point of the feature store, feature lineage is crucial to allow users to understand how different features are used and the relationships that exist in the feature store.
Although it’s easy to get accustomed to the kaggle-style of data science which presents ML problems as static flat files, it’s crucial to remember that not only do real-life problems often involve many different data sources, but they additionally often have a time component to them. Reasoning about time can quickly become complex and stump even the most tenured of data scientists. The risk for mishandling temporal data in an ML problem can be grave, as you can begin to leak data from the future and invalidate any results.
Feature stores are not only helpful with managing complex transformations and understanding the relationship between feature sets and entities, but they also provide a solution for temporal data as well. When training or serving data is constructed in the feature store, it does so in a time-consistent manner -- this is generally known as point-in-time correctness or “feature time travel.” The idea is that for every record in the data set, we can associate a timestamp with that record. The feature store will then be able to reconstruct all the supporting features connected to that record as it existed in the point of time specified by the timestamp. The end result is that the data is always time-consistent and users are saved from having to track timestamps across multiple feature sets or entities.
To understand why this is so important, consider a standard customer churn prediction. While a user’s current features – such as their activity on your website over the last 30 days – are what matters to predict future churn, their historical behavior is what matters for model training. The feature store is able to construct the correct data sets for both use cases without intervention from the user.
As we saw above, feature stores provide a means to serve predictions for online and offline use cases. In this day and age, it’s not uncommon for an ML platform to have some way to deploy a model as an API endpoint. This often demos well: simply train and deploy a model, then construct a JSON object with all the features needed and the endpoint will pass into the model for a quick evaluation. This looks good, but in real production use cases you’ll often find that the downstream application developer is not too impressed that it’s his responsibility to keep track of which features are needed. What’s even more annoying is that any time the model is updated model with new features, the application also has to be updated to accommodate the new API as well.
Feature stores provide a much more elegant solution to this problem. Instead of expecting applications to provide all the base features needed for a prediction, they should instead simply provide the ids for the entities (e.g. customers, drivers, products) involved in the prediction. The feature store can then look up any features it needs to make the prediction and return the result to the application. Not only does this simplify the workflow significantly for the application developer by hiding the model’s exact feature dependencies, but it also can ensure that there is no training/serving skew.
Although the feature store is dynamic and is always getting new data, it’s expected that there will likely be some lag in how up-to-date the data is in the feature store, depending on the architecture. Some use cases will necessitate that applications can provide feature overrides when they know they have the most recent data and wish to leverage those features instead of what is in the feature store (for example, a mobile application will always know a user’s current location better than the feature store will). A feature store still allows these client features to be selectively added without exposing the full data requirements to the application developer.
Data and model drift are increasingly important concepts in MLOps and provide a way for users to understand when the state of their data is undergoing change. For operational use cases, the feature store is a natural place to begin tracking additional information about features, such as: What is the distribution of the data? How many nulls are there? What are the boundaries of the features? Feature stores can periodically profile the latest data and an operational system can then make use of this information as it begins to train models and make predictions. How does the data at prediction time compare to when the model was trained? If it has changed by a substantial amount, that may be a sign that it is time to retrain.
Now that we have a better understanding of the capabilities of a feature store, we can talk about the benefits of using them.
Of course, not everything is perfect. Here are some things to consider before jumping into the feature store world:
Additionally, although a feature store is critical to an operational ML platform, having a feature store does not necessarily mean you also have an operational platform (It’s a necessary, not sufficient, condition).
There are many different form factors feature stores take today. As part of Continual, we deliver a feature store that builds directly on your cloud data warehouses and enables a fully declarative workflow that radically simplifies operational AI. By putting data first and considering the end-to-end ML workflow, Continual enables data teams to deploy continually improving predictive models – from customer churn to inventory forecasts – in hours or days, not weeks or months. This is the true potential of feature stores when fully integrated into a declarative AI workflow.
To see the difference that having a feature store can make on your ML workflows, get started with a trial of Continual today. You can also check out the replay of my presentation on Building a SQL-Centric Feature Store at a recent Feature Stores for ML Global Meetup.
Model performance depends on the metric being used. Understanding the strengths and limitation of each metric and interpreting the value correctly is key to building high performing models. In part 1, we cover four evaluation metrics commonly used for regression problems and demonstrate how to use them when building models on Continual.
While many have called for stronger adherence to software development best practices in machine learning (ML) and artificial intelligence (AI) as well, today’s ML practitioners still lack simple tools and workflows to operate the ML deployment lifecycle on a level on par with software engineers. This article takes a trip down memory lane to explore the benefits of the CI/CD toolset and the detriment of their unfortunate absence in today’s ML development lifecycle.