Most articles about the algorithms of the multi-armed bandit are too academic. They are filled with formulas and seem to imply that we have an immutable set of handles for pulling and n→∞ attempts. I will try to talk about these algorithms from the point of view of an ordinary developer, taking into account the real conditions in which our iFunny application for recommending memes works.
The choice of a suitable algorithm is associated with a long and painstaking work of analysts, a huge number of hypotheses, A/B tests, business metrics and graphic indicators, as well as the search for compromises. The developer must implement everything so that everything works efficiently and quickly. In addition, it is desirable to make sure that you do not need to set up a separate data center. Hopefully, this article will help developers deal with the last two problems.
We use bandits to select the best push notifications and solve other problems in the field of content recommendation. Since this is an article about a specific use case, I will also use terms from our area of expertise:
Roughly, the task of the multi-armed bandit is as follows: not knowing in advance how good the content that users upload to our application is, we need to maximize the final smiley rate. However, this task is good only from the point of view of the average consumer of content. In real life, everything is somewhat more complicated.
There are many ins and outs. For example, there are content creators who care about attention (i.e., views). Personalization tools need to collect enough data on all content, not just the top one (you can’t miss good content because it’s a bit unlucky). And so on.
There’s a little more context. In our iFunny app, tens of thousands of pieces of contentitalic text are in simultaneous rotation. At the same time, new content constantly arrives at an average speed of one meme in a couple of seconds, and old content is excluded from rotation at the same rate. That is, our set of content is not static, otherwise, any algorithm could, in one way or another, converge and begin to solve the problem as efficiently as possible. It’s a pretty fluid structure, constantly getting new content that we don’t know anything about. In addition, verified memes with smileys are also constantly removed.
Namely:
I will talk about the experience of using each of them. For clarity, I added animation and the final state of the full-scale experiment for each algorithm. The test itself was conducted according to the following methodology:
This is a deterministic algorithm and with the same input data, it will return the same response.
Where:
As you can see, the left part determines the “profitability” of specific content, and the right part is responsible for exploration (it gives a chance to show itself well to content with a small number of views). By changing the C coefficient, you can influence the balance between earnings and exploration to some extent.
Result:
Animation:
Pros:
Cons:
Thus, given the constant rotation, old quality content loses a lot to new content with a small number of views.
Let’s take N=10,000 and C=1 to see how two different groups of content behave.
Content group 1:
Content group 2:
It would seem that the preservation of such a smiley rate for 1,000 impressions is an indicator of quality, but there is practically no chance to please the user with this content.
Let’s calculate the minimum smiley rate that content with 500 views must possess in order to defeat an honored “veteran”:
You need to have only 25 smileys for 500 views. This is objectively twice as bad. But the content seems to be fresh. However, the essence of the multi-armed bandit task is precisely that it is necessary to maintain a reasonable balance between “earning” on good content and exploring new content. Let’s move on!
The essence of the algorithm is extremely simple: there is a certain threshold value (epsilon), we roll the dice and, if the result is greater than this value, then we display the top sorted by the smiley rate. If the value is less, we enable exploration and display random content.
Result:
Animation:
Pros:
Cons:
The algorithm is very good at bringing really high-quality content to the top, but at the same time does not make any distinction between “good” and “bad” within a much larger group of memes. This results in a very uneven user feed, where the top-end content will be mixed with outright garbage in fairly large quantities, which doesn’t affect the depth of viewing very well.
**Since the weight of the content is calculated in an extremely primitive way, **it does not pessimize content with a large number of views and gives equal chances for both the verified 100/1,000 and the upstart newcomer 1/10. It turns out to be a situation opposite to UCB1, since the loss is always potentially good content with fewer views.
Now let’s give each one an additional view. It seems like a trifle, but the results will be as follows:
Oops. We fall out of the top, and all our hope is for random display. We use constant rotation of content, so for us this is a rather unpleasant feature.
Like UCB1, it is a deterministic algorithm.
Where:
The multi-armed bandit problem uses the upper limit of the confidence interval. The lower limit of the confidence interval, as a nice bonus, can be used to get the topmost top.
Result:
Animation:
Pros:
Cons:
I think you noticed a rather strange final distribution that looks like a fence. This is due to the fact that it is strictly deterministic and is associated exclusively with the statistics of the content itself. Roughly speaking, for a large number of iterations, content that has an equal number of smileys will also have an equal number of views.
In our particular circumstances, this is just a feature.
It is the most mathematically complex algorithm. In simple terms, for each piece of content, we create a beta distribution and take its value at a random point.
I will not provide the formulas, since there are a lot of articles of varying degrees of scientificity about this algorithm on the Internet. But graphically it looks like this:
Result:
Pros:
Cons:
Life hack: the complexity can be partially decreased by calculating in advance the beta distributions for all reasonable values of smiley/view pairs.
Another life hack: you can cache the sample values for all these pairs in increments, for example, 0.01. Then it starts to work quite quickly, but at the expense of reducing accuracy and increasing memory consumption. For example, we believe that a reasonable number of views lies in the range of 0 … 5,000, and a reasonable number of smileys lies in the range of 0 … 500.
As a result we get the following: 5,000 × 500 × 100 (samples) × 4 (bytes in float) = ~1GB of memory.
Ultimately, it’s not so scary, but it’s up to you.