Keeping the Stakes Low while Breaking Production

Using Existing Tooling to Make Changes Reversible

I deployed a major change to the production servers of You can read more about that change at Feature update: Feed - DEV Community.

Adjusting the feed is a big deal. It’s the front door of

Preparing for Change

We wanted to gather information about the impact of changing the front door. At Forem we use the Field Test gem. This allows us to run different experiments. I configured Field Test to assign 20% of the folks to the new strategy and 80% of the folks to the prior strategy.

To insulate from upstream libraries, we often wrap those libraries in our own implementations. This helps with the principle of least exposure; rely on the narrowest interface possible. I wrote the AbExperiment class to be the wrapper around Field Test. With the wrapper, I allowed for the system to set an ENV variable and thus force all requests to use the same experiment. This helped with running local tests.

In addition, with that change, I created a means for our Site Reliability Engineers (SREs 📖) to disable, with an application restart, the changes that I made. They would just need to add an ENV variable.

The next step came about when I learned more about our use of Flipper; a Ruby gem for dynamically toggling on and off features. I didn’t know when the feature would roll out, but I wanted control over the feature. I also wanted admins of other Forems to have control as well. This was trivial with Flipper. Once I deployed the code, Forem’s got the original behavior unless they turned “flipped” on the feature.

Enter the Bug

I flipped the feature on in the mid-morning, and things looked fine until someone reported a broken feed. They had no articles in their Relevant feed. I checked the SQL query and things appeared to work. I also checked the database, and saw that I was in the same experiment segment. Things worked for me…until they didn’t. As in things worked, and then after poking around a bit they failed.

Now with confirmation, I pinged our SREs to toggle off the feature. It takes a minute or so for the feature toggle to propogate. This meant we didn’t have to rush out a hot fix. Which, if you’ve ever done one of those, you know that it’s rather even odds that you first make things worse.

With the immediate problem contained, we moved into figuring out the problem. Me and a few SREs started debugging. They have access to the Rails console, and we reconstructed the situation.

And in a few minutes, we found the problem. I spent the next hour writing a new pull request to fix the problem. , I haven’t merged that pull request.

Once we merge the pull request, we’ll need to toggle back on the feature. And pay attention to see if we’ve fixed things.

Prepare for Problems

This was my first major feature change release and the collaboration during the pull request really helped prepare all of us for what we would do if things went sideways.

I was a bit disheartened when a SRE and I decided to turn off the new feature. I wanted it to work; to be bug free. But that was not the case.

Turning off the feature freed my mind from worrying about all the folks on DEV. And we could all turn and calmly explore the root cause of the bug.

That calmness helped create the space for finding the problem. And it took all of us.

The SREs had powers I didn’t have and I had context that they didn’t.

For the Curious

We ran the following in the console:

  user: User.find_by(id: 1),
  page: nil).call.to_sql

We then pasted that into Blazer and started looking at the SQL 📖. As we moved around the massive SQL statement, we saw the culprit. A very narrow range for allowed article’s publication dates.

Alas, temporality strikes again!

Lessons Learned

I deployed a logic bug. My change broke the feed for some folks. What could’ve been a super stressful afternoon, was instead a calm exploration of the problem.

This calmness was made possible because I added two key mechanisms for testing and rollback. This bit of effort meant that we could quickly and confidently rollback the change.

And with the fire put out, we could approach the problem with clearer minds. We didn’t have the stress of a broken system screaming on the sideliness saying “Fix it! Fix it! Fix it!”