Putting a two-layered recommendation system into production. Bonus: we reveal the dataset!

FunCorp

May 20, 2022

Recommendation systems will always stay relevant — users want to see personalized content, the best of the catalog (in the case of our iFunny app — trending memes and jokes). Our team is testing dozens of hypotheses on how a smart feed can improve user experience. This article will tell you how we implemented the second-ranking level of the model above the collaborative one: what difficulties we encountered, and how they affected the metrics.

Usually, a matrix decomposition, such as implicit.ALS, is used to help improve the feed. In this method, for each user and each object, we get the embeddings, and the content, whose embeddings are the closest (in cosine measure) to the user’s embeddings, ends up in the top recommendations.

The method works quickly but does not consider all the information (for example, it will be challenging to add the sex and age of the user in the matrix factorization).

Two-level recommendation systems attempt to use more complex algorithms without abandoning the “timeless classics.”

Implementation of a two-layered recommendation system

We need to “assemble” a top solution from various algorithms. What we have:

A user-item matrix of user’s interactions with content
Tabular user features (gender, age, etc.)
Tabular item features (tags, meme text, etc.)

We can feed the user and item features to a boosting algorithm — let’s take LigitGBM, which works excellent with tabular data (even better than neural networks). We can also send our matrix there if we consider the pair [user_id, item_id] as a categorical cardinality feature (Users * Items) and apply one-hot encoding. But two problems arise here:

If the user has not interacted with the item before, this feature will be null
With too high cardinality, boosting will not handle that many splits

Let’s recall that we have ALS (Alternating Least Square) matrix factorization, so we count the scalar product of the user’s embedding with the embedding of each content. We have effectively introduced the high cardinality feature [user_id, item_id] and solved both problems above. Now the scalar product can be used as a powerful item feature in boosting and taking the “tabular” features into account. It turns out that we effectively used all the available data.

How do you apply these mechanics in production? We want to show the user the top 100 memes from hundreds of thousands in the catalog. Although the boosting is cool, it takes quite a long time with inference — hundreds of thousands of pieces of content won’t get through. That’s why we make recommendations in two stages:

From hundreds of thousands of units, we select the “candidates” — the top 100 units whose embeddings are close to the user’s embedding.
We rank 100 elements with the help of a booster.

The method is relatively fast and proven — that’s how all the top systems work. For example, read the article by the creators of Milvus — the company makes a product for candidates selection. If there is a lot of data, there can be several levels of candidate selection.

But, of course, not without problems. We’ll describe them below.

Model validation

Choosing a suitable validation scheme (i.e., the way to split a dataset into train and validation) is critical for ML pipelines. Ideally, the system should be that the change of metrics on proof coincides with the shift in metrics online during model operation (i.e., improved metrics on validation = guaranteed improvement of metrics online).

We considered the following options:

Random split — training on 90% of the dataset, using the remaining 10% for validation
Time split — we cut off several hours from the dataset
This article had an interesting idea: break the interactions into sessions and validate the last content in the session

Due to the specifics of the application, new content appears every hour, so we have to constantly solve the problem of a “cold start” when we need to recommend content with a small number of interactions. That is why the second way looks the most preferable, and we chose it. Besides, this scheme was used in production (we log this metric in grafana), making it easier to compare offline and online metrics.

One of the difficulties of comparing matrix factorization and hybrid recommender is different targeting. For example, at the first level, we are trying to predict the interaction itself, where the matrix factorization loss is trying to approximate the user-item matrix in the best possible way. In contrast, the second level often uses ranking loss functions.

The ranking loss did not show a significant increase in quality, so we decided to use LogLoss. On the second level, we solve the classification problem and learn to determine the content with which there will be a positive interaction: likes, shares, and reposts.

For an easier and more precise comparison of output, we made a visualization of the recommended content on streamlit:

Feature Engineering

The primary motivation to switch to the two-level model was the possibility of enriching the model with new features. In addition to the matrix factorization from the first level, we can distinguish two other types of elements:

“Tabular” — the constants that do not change over time: gender and age of the user, the type of content (picture/video), and others;
“Statistical” — various counters that change when you interact with the service: the average length of the session, the number of visits to the application, the number of likes in different content categories, etc.

Everything was evident with the “tabular” features: during inference, the model takes content attributes from an online repository (in our case, Mongo), and offline, while training, we use a copy of the same data transferred to Clickhouse.

Using window functions and other means, statistical features were initially counted in Clickhouse. It turned out very quickly that the product solution was “tuned” to pull data from Kafka, while the prototype was supposed to use data from Clickhouse. Different sources are potential problems, so it was decided to use Kafka as a data source for data preparation in MVP.

Initially, the task was supposed to use many different statistical features. Still, to speed up the deployment of the MVP in the production, we decided to stop at the simplest ones, for example:

For a meme, we count with smile_rate — the ratio of the number of smiles to the total number of content views.
For the user, we count img_smile_rate (for pictures) and video_smile_rate (for video) to understand which type of content the user prefers.

For the future, we have agreed with the ML-engineering team to develop a service from which we can get features for inference and training. There will be flexible pipelines for adding new features and delivering these features to the model.

At the end of talking about features, I’ll note an important point: before deployment into production, it is essential to discuss transformation with ML engineers. For example, device type (iPhone, iPad, Android) is a definite feature with OneHot coding, and if you use different coding during training and inference, the model will not work correctly. To mitigate the risks, we created a page in Confluence where we described the input data transformations and checked the recommendations of several text users (the ranking of the model trained in Jupyter should not differ from the ranking of the model in production).

Catboost vs. LightGBM

When the features are ready, choosing a specific boosting implementation is necessary. There are many of them, but the most popular now is Catboost and LightGBM. Each framework has specifics in preparing features, so I recommend a little lifehack — after studying the documentation, take a look at the actual code:

We have refused the CatBoost, as splits in the trees worked strangely on our data (see the last level in the picture — the split goes to the same value in different decision tree nodes).

As a result, LightGBM started up immediately. We decided to use it. The only confusing thing was the distribution of the predicted score. In the matrix factorization from the first level, we saw a distribution with a “heavy tail” from 0 to 1.

The LightGBM had a much lower variance in the score distribution:

The chart shows that the model rarely predicts values close to one, which is strange. However, LightGBM’s ranking metrics were better, so we decided to ignore the oddities in the scoring distribution.

How to avoid data leakage

To understand which leakage we are talking about, let us recall how the first- and second-level connections work in our hybrid recommender system:

Learn how to pick out top content with the maximum score
Feed content score and additional features to the second-level model input

An important point: if you train the first level on the whole dataset first and then the second level, you will get a leakage in the data. At the second level, the content score of matrix factorization will take into account the targeting information, so the following rule is used:

We divide the dataset into factorization_train (90% training) and boosting_train_set (10% training)
Train the first level (matrix factorization) on factorization_train
Make a matrix factorization scoring prediction on boosting_train_set
Train boosting on boosting_train_set with the addition of tabular and statistical features

Here we encounter some difficulties: we need to train on 100% of the interactions in production. Otherwise, we will lose important information about the user’s preferences. There are two options here:

Train in parts (first on factorization_train, then a separate batch on boosting_train_set)
Train first the entire cycle of the two-level model, then separately the first level for production

Both ways have disadvantages: in the first, we are not sure that the model weights will converge to the correct values; in the second, we spend twice as much hardware resources on training. We chose the second way to minimize interference with the existing model training pipeline.

Experimental results

The results of the test run were quite discouraging. We did not see any growth of the engagement-metrics in the test group (the chart shows data on smile_rate in several days):

We haven’t seen an increase in metrics in the experiment. Why did this happen?

The main advantage of gradient boosting is that you can use many additional features for ranking. We have added a few components due to architectural constraints, so we agreed to roll back the experiment and make some improvements to make it possible to add and test new features in the model quickly.

Where to move next

Now we are developing the recommendation system, moving in two directions: improving the architecture of the system and the ranking algorithms.

Architectural improvements:

Create a Feature Store service to ensure that the same features that were used in training are used in inference
Bring the LightGBM model into a separate microservice (we have ideas to try ONNX)

Improving algorithms:

Add more features — statistical and tabular (e.g., picture embeddings from the convolutional network, we call them content ones)
Train other targets (e.g., allowing ranking loss)
Combine multiple sources for candidates (in addition to selection by matrix factorization scoring, you can also gather content by popularity, use other factorization models, such as ALS, or similarity by content embedding)
Change the offline validation scheme (maybe it is possible to find the best model parameters).

Our next series will describe these improvements in detail and their impact on business metrics.