Earlier I wrote Practicing Postgresql and Postulating (Im)Provements hinting at some possible changes regarding Forem’s feed algorithm.
This past week, I’ve been iterating on a pull request. Which at the time of writing is still a Work in Procress (WIP).
The core concept I introduced was the
Articles::Feeds::WeightedQueryStrategy class; a documented and configurable query strategy. Because of the WIP nature, I’m including a link to a Github Gist that is the state of the code as of .
Class-level documentation for
This is an experimental object that we’re refining to be a competetor to the existing feed strategies.
It works to implement conceptual parity with two methods of Articles::Feeds::LargeForemExperimental:
What do we mean by “conceptual parity”? Those two methods are used in the two feeds controllers: StoriesController and Stories::FeedsController. And while they use some of the internal tooling there’s some notable subtle differences.
Where this class differs is that it is aiming to build the feed based from the given user’s perspective. Whereas the other Feed algorithm starts with a list of candidates that are global to the given Forem (e.g., starting the base query from the
articles.score, a volatile and swingy value that favors global reactions over user desired content).
This is not quite a chronological only feed but could be easily modified to favor that.
@note One possible short-coming is that the query does not account for the Forem’s administrators.
@note For those considering extending this, be very mindful of Structured Query Language (SQL 📖) injection.
As part of the development process, I extracted the configurable scoring methods.
Top-level documentation for scoring methods
This constant defines the allowable relevance scoring methods.
A scoring method should be a SQL fragment that produces a value between 0 and 1. The closer the value is to 1, the more relevant the article is for the given user. Note: the values are multiplicative. Make sure to consider if you want a 0 multiplier for your score. Aspirationally, you may want to think of the relevance_score as the range (0,1]. That is greater than 0 and less than or equal to 1.
In addition, as part of initialization, the caller can configure each of the scoring methods cases and fallback.
Each scoring method has the following keys:
- The SQL clause statement; note: there exists a coupling between the clause and the SQL fragments that join the various tables. Also, under no circumstances should you allow any user value for this, as it is not something we can sanitize.
- An Array of Arrays, the first value is what matches the clause, the second value is the multiplicative factor.
- When no case is matched use this factor.
- Does this scoring method require a given user. If not, don't use it if we don't have a nil user.
The configurable options as of are:
- Weight to give based on the age of the article.
- Weight to give for the number of comments on the article from other users that the given user follows.
- Weight to give to the number of comments on the article.
- Weight to give based on the difference between experience level of the article and given user.
- Weight to give when the given user follows the article's author.
- Weight to give to the when the given user follows the article's organization.
- Weight to give an article based on it's most recent comment.
- Weight to give for the number of intersecting tags the given user follows and the article has.
- Weight to give for the number of reactions on the article.
- Weight to give based on spaminess of the article.
I’ve structured the code to allow for the initializer of the
Articles::Feeds::WeightedQueryStrategy to configure which methods to use as well as the factors. The idea being that we can easily iterate on feed refinement and even open the door for site-wide configuration of these scoring methods.
But that’s a future exercise.
Again, the goal of these scoring methods is to rank articles against the user’s apparent preferences. For astute readers and those following the code, I’ve taken no consideration for the weights a user has given to tags.
When I was iterating on the implementation, I was writing and refining lots of SQL. I wrote a handful of RSpec (RSpec 📖) specs that verified I had valid queries. And as I drew closer to testing this in the User Interface (UI 📖), I knew that I had one significant problem to address.
I wanted to return an
ActiveRecord::Relation object; that’s an object from which you can chain
ActiveRecord::Base.scope calls and other
The reason being that I really wanted to re-use two
In the case of
.limited_column_select, I wanted to ensure that I wasn’t returning all of the columns from the
articles table. I wanted to avoid duplicating the knowledge of what fields should be included in the result set.
A lot of things have been written about Don’t Repeat Yourself (DRY 📖) principles. But I believe the it’s more important to focus on “Don’t Repeat Knowledge”.
More important was wanting to re-use
.includes(top_comments: :user). Without those eager includes, when it came time to render the feed, each result would query the top comment and it’s associated user. So the naive implementation would result in 2N+1 queries, where N was the number of articles in the result set.
My solution was to perform some
ActiveRecord antics. I had previously done extensive antics in my query implementations of Sipity. Fortunately, for the Forem implementation I didn’t need to dive deep into Arel.
Below is the part of the implementation on which I want to focus; the preceding numbers are the line numbers I’ll use in the example. Remember you can see the whole class at this Gist.
1 Article.where( 2 Article.arel_table[:id].in( 3 Arel.sql( 4 Article.sanitize_sql(unsanitized_sub_sql) 5 ) 6 ) 7 ).limited_column_select.includes(top_comments: :user).order(published_at: :desc)
Let’s work from the inside out.
Starting with line 4:
unsanitized_sub_sql is the SQL that is built from the scoring method configuration and the necessary
INNER JOIN and
LEFT OUTER JOIN to build the query.
It is a non-trivial query, and writing it mostly by hand made this easier to implement.
Article.sanitize_sql call ensures that we have sanitized the output.
I believe my implementation has avoided any SQL injection vectors.
Moving out to line 3:
Arel.sql marks the resulting string as safe SQL.
Without this step, the
sanitized_sql result with be treated as
Line 3 returns a valid sub-query with a SQL select clause of
Moving out to line 2: the
Article.arel_table[:id].in resolves to the SQL fragment:
articles.id IN (sub-query).
Line 2 is the inflection point that moves us from hand-written SQL into the
ActiveRecord::Querying module space.
Moving out to line 1: this is where we now use the
ActiveRecord::Querying.where method, which is very family to Ruby on Rails (Rails 📖) developers.
Moving on to line 7: because we now have an ActiveRecord::Relation object, we can chain model scopes and get all of the ActiveRecord goodness.
There’s quite a lot going on with my proposed change, but I wanted to share two relevant bits that might make you curious to take a look at the implementation details.
I’m uncertain when we’ll be deploying this change, as I need to wire in some performance instrumentation to compare against the original strategy.
All of this is in service of trying to improve the baseline feed experience and bring some further insight and clarity into how things make it into a given user’s feed.