In DEV’s relevance feed, we utilize the Articles::Feeds::VariantQuery to perform the correct query for the current DEV feed experiment variants. For those curious, you can look to at the ./config/field_test.yml at github.com/forem/forem.
For each variant, we choose a order by lever. we had two order by levers:
The one I want to write about is the
final_order_by_random_weighted_to_score. It had the following SQL fragment:
RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC. Let’s ignore everything except the
I wrote the following Github issue to enumerate the problem: As a person who wants to explore different sort orders I want a bit of control over fully random numbers.
The consequence of having this
RANDOM() is time you run the query, you get a new random number. The impact is that depending on the use of
RANDOM() when you refresh the page you could see significantly different results.
Yet there is utility in randomization. What to do?
A BASIC Sidebar
In junior high, my dad got a new computer for the family. I started writing some BASIC programs to simulate a simple siege game I had created. The castle defenders attempted to ward off the assailants. It involved lots of randomization. Every time I ran the simulation, I got the exact same result.
What I didn’t realize was that I needed to first seed the randomizer with a random seed. Without that randomized seed, my version of BASIC used the same seed. This meant that when I called the
RANDOM function in BASIC, the sequence was always the same.
My dice were always going to return the sequence of results. But this was well before Google and household internet. So I remained stumped until a year later when I stumbled upon a BASIC manual and had an aha moment.
Back to the Query at Hand
I had two task at hand:
- Create a result set that had a repeatable sequence of random numbers.
- “Join” that result set into the existing
When working with SQL, I always strive to build from the inside out.
Create a Result Set That Had a Repeatable Sequence of Random Numbers
To explore this, I set about writing the simplest query I could make. After some fits and starts, I settled on the following query.
WITH seeder AS (SELECT setseed(0.5)) SELECT *, RANDOM() AS randomized_value FROM generate_series(1,10), seeder
The above query is logically equivalent to the query below.
SELECT *, RANDOM() AS randomized_value FROM generate_series(1,10), (SELECT setseed(0.5)) as seeder
Both have the result set of:
The above queries may warrant a bit of discussion. First, it is rare for me to
SELECT FROM two “tables” without a join condition. The
setseed column come from the
seeder “tables.” The
randomized_value is the result of the
Now, each time I run it with
setseed(0.5) I get the same randomized_value. If I change the
0.75 I get different values, but again each time I run the query with the seed value, I get the same results.
I had my proof of concept. Next came the hard part. Integrating it into the complicated query.
“Join” That Result Set into the Existing
This next step took a bit. The primary challenge was in thinking about the result sets.
Conceptually there are two queries happening for the feed:
- Inner Query
- Get a subset of relevant articles for the given user.
- Outer Query
- Order that subset of relevant articles for the given user.
To achieve the desired goal, I needed to add the randomized_value to the list of “columns” for the inner query of relevant articles. With that, we could then do the sorting.
My implementation constraint is ensuring that the the outer query would be an
ActiveRecord::Relation; In sticking to that constraint, I could then use ActiveRecord’s query interface to chain
where statements. Rather powerful.
My solution was to insert a middle query. Again, the inner query is responsible for retrieving the relevant subset. Then the middle query is responsible for appending the
randomized_value column to the relevant subset. And finally the outer query could then sort using this new
randomized_value instead of
When I started writing this post, I thought I might dump a bunch of SQL. But I did some grooming to highlight the difference.
The original query without the seeded randomizer
SELECT "articles".* FROM "articles" INNER JOIN ( SELECT articles.id, 1 AS relevancy_score FROM articles LEFT OUTER JOIN user_blocks ON user_blocks.blocked_id = articles.user_id AND user_blocks.blocked_id IS NULL AND user_blocks.blocker_id = 322 WHERE articles.published = true AND articles.published_at > '2022-05-19 20:12:37.777414' AND articles.published_at < '2022-06-03 20:12:37.777493' GROUP BY articles.id, articles.published_at ORDER BY relevancy_score DESC, articles.published_at DESC LIMIT 50 ) AS article_relevancies ON articles.id = article_relevancies.id ORDER BY RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC
The updated query using the seeded randomizer
SELECT "articles".* FROM "articles" INNER JOIN ( WITH seeder AS (SELECT setseed(0.5)) SELECT inner_article_relevancies.id, inner_article_relevancies.relevancy_score, RANDOM() AS randomized_value FROM seeder, (SELECT articles.id, 1 AS relevancy_score FROM articles LEFT OUTER JOIN user_blocks ON user_blocks.blocked_id = articles.user_id AND user_blocks.blocked_id IS NULL AND user_blocks.blocker_id = 322 WHERE articles.published = true AND articles.published_at > '2022-05-19 20:12:37.777414' AND articles.published_at < '2022-06-03 20:12:37.777493' GROUP BY articles.id, articles.published_at ORDER BY relevancy_score DESC, articles.published_at DESC LIMIT 50) AS inner_article_relevancies ) AS article_relevancies ON articles.id = article_relevancies.id ORDER BY article_relevancies.randomized_value ^ (1.0 / greatest(articles.score, 0.1)) DESC
The main difference between the two? The randomization moves from the outer query to the inner query.