In DEV’s relevance feed, we utilize the Articles::Feeds::VariantQuery to perform the correct query for the current DEV feed experiment variants. For those curious, you can look to at the ./config/field_test.yml at github.com/forem/forem.
For each variant, we choose a order by lever. we had two order by levers:
relevancy_score_and_publication_date
final_order_by_random_weighted_to_score
The one I want to write about is the final_order_by_random_weighted_to_score
. It had the following SQL fragment: RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC
. Let’s ignore everything except the RANDOM()
.
I wrote the following Github issue to enumerate the problem: As a person who wants to explore different sort orders I want a bit of control over fully random numbers.
The consequence of having this RANDOM()
is time you run the query, you get a new random number. The impact is that depending on the use of RANDOM()
when you refresh the page you could see significantly different results.
Yet there is utility in randomization. What to do?
A BASIC Sidebar
In junior high, my dad got a new computer for the family. I started writing some BASIC programs to simulate a simple siege game I had created. The castle defenders attempted to ward off the assailants. It involved lots of randomization. Every time I ran the simulation, I got the exact same result.
What I didn’t realize was that I needed to first seed the randomizer with a random seed. Without that randomized seed, my version of BASIC used the same seed. This meant that when I called the RANDOM
function in BASIC, the sequence was always the same.
My dice were always going to return the sequence of results. But this was well before Google and household internet. So I remained stumped until a year later when I stumbled upon a BASIC manual and had an aha moment.
Back to the Query at Hand
I had two task at hand:
- Create a result set that had a repeatable sequence of random numbers.
- “Join” that result set into the existing
Articles::Feeds::VariantQuery
implementation.
When working with SQL, I always strive to build from the inside out.
Create a Result Set That Had a Repeatable Sequence of Random Numbers
To explore this, I set about writing the simplest query I could make. After some fits and starts, I settled on the following query.
WITH seeder AS (SELECT setseed(0.5))
SELECT *, RANDOM() AS randomized_value
FROM generate_series(1,10), seeder
The above query is logically equivalent to the query below.
SELECT *, RANDOM() AS randomized_value
FROM generate_series(1,10),
(SELECT setseed(0.5)) as seeder
Both have the result set of:
generate_series | setseed | randomized_value |
---|---|---|
1 | empty string | 0.2499104186659835 |
2 | empty string | 0.520017612227381 |
3 | empty string | 0.4611753978720401 |
4 | empty string | 0.22773722382948947 |
5 | empty string | 0.18489966987045392 |
6 | empty string | 0.06851542705898694 |
7 | empty string | 0.12457768622157062 |
8 | empty string | 0.2790778552363484 |
9 | empty string | 0.14696090303700515 |
10 | empty string | 0.6336384228449994 |
The above queries may warrant a bit of discussion. First, it is rare for me to SELECT FROM
two “tables” without a join condition. The generate_series
and setseed
column come from the generate_series(1,10)
and seeder
“tables.” The randomized_value
is the result of the RANDOM()
function.
Now, each time I run it with setseed(0.5)
I get the same randomized_value. If I change the 0.5
to 0.75
I get different values, but again each time I run the query with the seed value, I get the same results.
I had my proof of concept. Next came the hard part. Integrating it into the complicated query.
“Join” That Result Set into the Existing Articles::Feeds::Variantquery
Implementation
This next step took a bit. The primary challenge was in thinking about the result sets.
Conceptually there are two queries happening for the feed:
- Inner Query
- Get a subset of relevant articles for the given user.
- Outer Query
- Order that subset of relevant articles for the given user.
To achieve the desired goal, I needed to add the randomized_value to the list of “columns” for the inner query of relevant articles. With that, we could then do the sorting.
My implementation constraint is ensuring that the the outer query would be an ActiveRecord::Relation
; In sticking to that constraint, I could then use ActiveRecord’s query interface to chain includes
, order
, where
statements. Rather powerful.
My solution was to insert a middle query. Again, the inner query is responsible for retrieving the relevant subset. Then the middle query is responsible for appending the randomized_value
column to the relevant subset. And finally the outer query could then sort using this new randomized_value
instead of RANDOM()
.
Conclusion
When I started writing this post, I thought I might dump a bunch of SQL. But I did some grooming to highlight the difference.
The original query without the seeded randomizer
SELECT "articles".*
FROM "articles"
INNER JOIN (
SELECT articles.id, 1 AS relevancy_score
FROM articles
LEFT OUTER JOIN user_blocks
ON user_blocks.blocked_id = articles.user_id
AND user_blocks.blocked_id IS NULL
AND user_blocks.blocker_id = 322
WHERE articles.published = true
AND articles.published_at > '2022-05-19 20:12:37.777414'
AND articles.published_at < '2022-06-03 20:12:37.777493'
GROUP BY articles.id, articles.published_at
ORDER BY relevancy_score DESC,
articles.published_at DESC
LIMIT 50
) AS article_relevancies
ON articles.id = article_relevancies.id
ORDER BY RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC
The updated query using the seeded randomizer
SELECT "articles".*
FROM "articles"
INNER JOIN (
WITH seeder AS (SELECT setseed(0.5))
SELECT inner_article_relevancies.id,
inner_article_relevancies.relevancy_score,
RANDOM() AS randomized_value
FROM seeder,
(SELECT articles.id, 1 AS relevancy_score
FROM articles
LEFT OUTER JOIN user_blocks
ON user_blocks.blocked_id = articles.user_id
AND user_blocks.blocked_id IS NULL
AND user_blocks.blocker_id = 322
WHERE articles.published = true
AND articles.published_at > '2022-05-19 20:12:37.777414'
AND articles.published_at < '2022-06-03 20:12:37.777493'
GROUP BY articles.id, articles.published_at
ORDER BY relevancy_score DESC,
articles.published_at DESC
LIMIT 50) AS inner_article_relevancies
) AS article_relevancies
ON articles.id = article_relevancies.id
ORDER BY article_relevancies.randomized_value ^
(1.0 / greatest(articles.score, 0.1)) DESC
The main difference between the two? The randomization moves from the outer query to the inner query.