Earlier I wrote Practicing Postgresql and Postulating (Im)Provements hinting at some possible changes regarding Forem’s feed algorithm.
This past week, I’ve been iterating on a pull request. Which at the time of writing is still a Work in Procress (WIP).
Introducing Articles::Feeds::WeightedQueryStrategy
The core concept I introduced was the Articles::Feeds::WeightedQueryStrategy
class; a documented and configurable query strategy. Because of the WIP nature, I’m including a link to a Github Gist that is the state of the code as of .
Class-level documentation for Articles::Feeds::WeightedQueryStrategy
@api private
This is an experimental object that we’re refining to be a competetor to the existing feed strategies.
It works to implement conceptual parity with two methods of Articles::Feeds::LargeForemExperimental:
#default_home_feed
#more_comments_minimal_weight_randomized
What do we mean by “conceptual parity”? Those two methods are used in the two feeds controllers: StoriesController and Stories::FeedsController. And while they use some of the internal tooling there’s some notable subtle differences.
Where this class differs is that it is aiming to build the feed based from the given user’s perspective. Whereas the other Feed algorithm starts with a list of candidates that are global to the given Forem (e.g., starting the base query from the articles.score
, a volatile and swingy value that favors global reactions over user desired content).
This is not quite a chronological only feed but could be easily modified to favor that.
@note One possible short-coming is that the query does not account for the Forem’s administrators.
@note For those considering extending this, be very mindful of Structured Query Language (SQL 📖) injection.
Configurable Options
As part of the development process, I extracted the configurable scoring methods.
Top-level documentation for scoring methods
This constant defines the allowable relevance scoring methods.
A scoring method should be a SQL fragment that produces a value between 0 and 1. The closer the value is to 1, the more relevant the article is for the given user. Note: the values are multiplicative. Make sure to consider if you want a 0 multiplier for your score. Aspirationally, you may want to think of the relevance_score as the range (0,1]. That is greater than 0 and less than or equal to 1.
In addition, as part of initialization, the caller can configure each of the scoring methods cases and fallback.
Each scoring method has the following keys:
- clause
- The SQL clause statement; note: there exists a coupling between the clause and the SQL fragments that join the various tables. Also, under no circumstances should you allow any user value for this, as it is not something we can sanitize.
- cases
- An Array of Arrays, the first value is what matches the clause, the second value is the multiplicative factor.
- fallback
- When no case is matched use this factor.
- requires_user
- Does this scoring method require a given user. If not, don't use it if we don't have a nil user.
The configurable options as of are:
- daily_decay_factor
- Weight to give based on the age of the article.
- comment_count_by_those_followed_factor
- Weight to give for the number of comments on the article from other users that the given user follows.
- comments_count_factor
- Weight to give to the number of comments on the article.
- experience_factor
- Weight to give based on the difference between experience level of the article and given user.
- following_author_factor
- Weight to give when the given user follows the article's author.
- following_org_factor
- Weight to give to the when the given user follows the article's organization.
- latest_comment_factor
- Weight to give an article based on it's most recent comment.
- matching_tags_factor
- Weight to give for the number of intersecting tags the given user follows and the article has.
- reactions_factor
- Weight to give for the number of reactions on the article.
- spaminess_factor
- Weight to give based on spaminess of the article.
I’ve structured the code to allow for the initializer of the Articles::Feeds::WeightedQueryStrategy
to configure which methods to use as well as the factors. The idea being that we can easily iterate on feed refinement and even open the door for site-wide configuration of these scoring methods.
But that’s a future exercise.
Again, the goal of these scoring methods is to rank articles against the user’s apparent preferences. For astute readers and those following the code, I’ve taken no consideration for the weights a user has given to tags.
ActiveRecord Antics
When I was iterating on the implementation, I was writing and refining lots of SQL. I wrote a handful of RSpec (RSpec 📖) specs that verified I had valid queries. And as I drew closer to testing this in the User Interface (UI 📖), I knew that I had one significant problem to address.
I wanted to return an ActiveRecord::Relation
object; that’s an object from which you can chain ActiveRecord::Base.scope
calls and other ActiveRecord::Query
methods.
The reason being that I really wanted to re-use two Article
methods:
.limited_column_select
.includes(top_comments: :user)
In the case of .limited_column_select
, I wanted to ensure that I wasn’t returning all of the columns from the articles
table. I wanted to avoid duplicating the knowledge of what fields should be included in the result set.
A lot of things have been written about Don’t Repeat Yourself (DRY 📖) principles. But I believe the it’s more important to focus on “Don’t Repeat Knowledge”.
More important was wanting to re-use .includes(top_comments: :user)
. Without those eager includes, when it came time to render the feed, each result would query the top comment and it’s associated user. So the naive implementation would result in 2N+1 queries, where N was the number of articles in the result set.
My solution was to perform some ActiveRecord
antics. I had previously done extensive antics in my query implementations of Sipity. Fortunately, for the Forem implementation I didn’t need to dive deep into Arel.
Below is the part of the implementation on which I want to focus; the preceding numbers are the line numbers I’ll use in the example. Remember you can see the whole class at this Gist.
1 Article.where(
2 Article.arel_table[:id].in(
3 Arel.sql(
4 Article.sanitize_sql(unsanitized_sub_sql)
5 )
6 )
7 ).limited_column_select.includes(top_comments: :user).order(published_at: :desc)
Let’s work from the inside out.
Starting with line 4: Article.sanitize_sql(unsanitized_sub_sql)
. The unsanitized_sub_sql
is the SQL that is built from the scoring method configuration and the necessary INNER JOIN
and LEFT OUTER JOIN
to build the query.
It is a non-trivial query, and writing it mostly by hand made this easier to implement.
The Article.sanitize_sql
call ensures that we have sanitized the output.
I believe my implementation has avoided any SQL injection vectors.
Moving out to line 3: Arel.sql
marks the resulting string as safe SQL.
Without this step, the sanitized_sql
result with be treated as NULL
.
Line 3 returns a valid sub-query with a SQL select clause of SELECT articles.id
.
Moving out to line 2: the Article.arel_table[:id].in
resolves to the SQL fragment: articles.id IN (sub-query)
.
Line 2 is the inflection point that moves us from hand-written SQL into the ActiveRecord::Querying
module space.
Moving out to line 1: this is where we now use the ActiveRecord::Querying.where
method, which is very family to Ruby on Rails (Rails 📖) developers.
Moving on to line 7: because we now have an ActiveRecord::Relation object, we can chain model scopes and get all of the ActiveRecord goodness.
Conclusion
There’s quite a lot going on with my proposed change, but I wanted to share two relevant bits that might make you curious to take a look at the implementation details.
I’m uncertain when we’ll be deploying this change, as I need to wire in some performance instrumentation to compare against the original strategy.
All of this is in service of trying to improve the baseline feed experience and bring some further insight and clarity into how things make it into a given user’s feed.