Adding Reproducible Randomization to SQL Queries

Or How I Spent an Afternoon Building Out More Creases for Extensibility in Forem’s Feed Query

In DEV’s relevance feed, we utilize the Articles::Feeds::VariantQuery to perform the correct query for the current DEV feed experiment variants. For those curious, you can look to at the ./config/field_test.yml at github.com/forem/forem.

For each variant, we choose a order by lever. we had two order by levers:

  • relevancy_score_and_publication_date
  • final_order_by_random_weighted_to_score

The one I want to write about is the final_order_by_random_weighted_to_score. It had the following SQL fragment: RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC. Let’s ignore everything except the RANDOM(). I wrote the following Github issue to enumerate the problem: As a person who wants to explore different sort orders I want a bit of control over fully random numbers. The consequence of having this RANDOM() is time you run the query, you get a new random number. The impact is that depending on the use of RANDOM() when you refresh the page you could see significantly different results.

Yet there is utility in randomization. What to do?

A BASIC Sidebar

In junior high, my dad got a new computer for the family. I started writing some BASIC programs to simulate a simple siege game I had created. The castle defenders attempted to ward off the assailants. It involved lots of randomization. Every time I ran the simulation, I got the exact same result.

What I didn’t realize was that I needed to first seed the randomizer with a random seed. Without that randomized seed, my version of BASIC used the same seed. This meant that when I called the RANDOM function in BASIC, the sequence was always the same.

My dice were always going to return the sequence of results. But this was well before Google and household internet. So I remained stumped until a year later when I stumbled upon a BASIC manual and had an aha moment.

Back to the Query at Hand

I had two task at hand:

  • Create a result set that had a repeatable sequence of random numbers.
  • “Join” that result set into the existing Articles::Feeds::VariantQuery implementation.

When working with SQL, I always strive to build from the inside out.

Create a Result Set That Had a Repeatable Sequence of Random Numbers

To explore this, I set about writing the simplest query I could make. After some fits and starts, I settled on the following query.

WITH seeder AS (SELECT setseed(0.5))
SELECT *, RANDOM() AS randomized_value
FROM generate_series(1,10), seeder

The above query is logically equivalent to the query below.

SELECT *, RANDOM() AS randomized_value
FROM generate_series(1,10),
  (SELECT setseed(0.5)) as seeder

Both have the result set of:

Table 238: Result set of above SQL queries
generate_seriessetseedrandomized_value
1empty string0.2499104186659835
2empty string0.520017612227381
3empty string0.4611753978720401
4empty string0.22773722382948947
5empty string0.18489966987045392
6empty string0.06851542705898694
7empty string0.12457768622157062
8empty string0.2790778552363484
9empty string0.14696090303700515
10empty string0.6336384228449994

The above queries may warrant a bit of discussion. First, it is rare for me to SELECT FROM two “tables” without a join condition. The generate_series and setseed column come from the generate_series(1,10) and seeder “tables.” The randomized_value is the result of the RANDOM() function.

Now, each time I run it with setseed(0.5) I get the same randomized_value. If I change the 0.5 to 0.75 I get different values, but again each time I run the query with the seed value, I get the same results.

I had my proof of concept. Next came the hard part. Integrating it into the complicated query.

“Join” That Result Set into the Existing Articles::Feeds::Variantquery Implementation

This next step took a bit. The primary challenge was in thinking about the result sets.

Conceptually there are two queries happening for the feed:

Inner Query
Get a subset of relevant articles for the given user.
Outer Query
Order that subset of relevant articles for the given user.

To achieve the desired goal, I needed to add the randomized_value to the list of “columns” for the inner query of relevant articles. With that, we could then do the sorting. My implementation constraint is ensuring that the the outer query would be an ActiveRecord::Relation; In sticking to that constraint, I could then use ActiveRecord’s query interface to chain includes, order, where statements. Rather powerful. My solution was to insert a middle query. Again, the inner query is responsible for retrieving the relevant subset. Then the middle query is responsible for appending the randomized_value column to the relevant subset. And finally the outer query could then sort using this new randomized_value instead of RANDOM().

Conclusion

When I started writing this post, I thought I might dump a bunch of SQL. But I did some grooming to highlight the difference.

The original query without the seeded randomizer
SELECT "articles".*
FROM "articles"
INNER JOIN (
      SELECT articles.id, 1 AS relevancy_score
      FROM articles
      LEFT OUTER JOIN user_blocks
	   ON user_blocks.blocked_id = articles.user_id
	   AND user_blocks.blocked_id IS NULL
	   AND user_blocks.blocker_id = 322
      WHERE articles.published = true
	    AND articles.published_at > '2022-05-19 20:12:37.777414'
	    AND articles.published_at < '2022-06-03 20:12:37.777493'
      GROUP BY articles.id, articles.published_at
      ORDER BY relevancy_score DESC,
	       articles.published_at DESC
      LIMIT 50
) AS article_relevancies
  ON articles.id = article_relevancies.id
ORDER BY RANDOM() ^ (1.0 / greatest(articles.score, 0.1)) DESC
The updated query using the seeded randomizer
SELECT "articles".*
FROM "articles"
INNER JOIN (
      WITH seeder AS (SELECT setseed(0.5))
      SELECT inner_article_relevancies.id,
	     inner_article_relevancies.relevancy_score,
	     RANDOM() AS randomized_value
      FROM seeder,
	   (SELECT articles.id, 1 AS relevancy_score
	    FROM articles
	    LEFT OUTER JOIN user_blocks
	    ON user_blocks.blocked_id = articles.user_id
	       AND user_blocks.blocked_id IS NULL
	       AND user_blocks.blocker_id = 322
	     WHERE articles.published = true
	     AND articles.published_at > '2022-05-19 20:12:37.777414'
	     AND articles.published_at < '2022-06-03 20:12:37.777493'
	     GROUP BY articles.id, articles.published_at
	     ORDER BY relevancy_score DESC,
	     articles.published_at DESC
	     LIMIT 50) AS inner_article_relevancies
) AS article_relevancies
  ON articles.id = article_relevancies.id
ORDER BY article_relevancies.randomized_value ^
           (1.0 / greatest(articles.score, 0.1)) DESC

The main difference between the two? The randomization moves from the outer query to the inner query.