Take on Recursion Redux

Thanking a Reader and Adding Some Recursion Testing Strategies

But what if we proceed in a zigzag fashion, as in this maze, or within any labyrinthine structure? We wrinkle time, you see. Divagations! If we linger in these loopholes of history, these timeless tortuous stretches, perhaps we shall never come to the end, or if we do, we will find an opening there and start again.

Thank you Eshel Yaron for your Take on Recursion. A response and improvement on my Using Built-in Emacs 29 Tree Sitter Package to Get Qualified Ruby Function Name and Expanding on using Tree Sitter in Emacs for Getting Qualified Ruby Method Name posts. I went ahead and cleaned up my function to incorporate the accumulator object.

This past week, I’ve been bumping up against recursion and wanted to write some testing strategies. These examples will be written in Ruby 📖 and use RSpec 📖 as the testing framework.

Glossary

For this post I’m going to use the following glossary of terms:

Original File
A file that is integral to the description of the Work. Examples include: a pre-print Portable Document Format (PDF 📖) of the journal article; a power-point presentation file; an audio recording of a lecture.
Derivative File
A file that was “derived” from the Original File.
File Set
A container for an Original File and associated Derivative Files.
Work
Descriptive metadata that describes a concept, which might have one or more Original Files. Each Original File will be written to a File Set.

The above is a quick summary of some of the concepts of Portland Common Data Model (PCDM 📖).

Current Implementation

In Samvera 📖’s Hyrax’s framework, the default assumption is that for each provided Original File, Hyrax will generate one or more Derivative Files. And, , if you already have existing an Derivative File you want to associate, instead of having Hyrax generate it, you will need to dig into the internal Application Programming Interfaces (APIs 📖) and make adjustments.

In other words, if you already have derivatives you want to re-use there is a technical barrier that adopters must overcome in Hyrax. This is a common practice, especially when you are migrating content from one system to another. After all, we’d prefer to not again spend the computation cycles to re-generate these derivatives. Also, in some cases, an automatically generated derivative is not ideal; I want a very specific thumbnail to represent a still frame from a video.

Part of our recent work at Software Services by Scientist.com 📖 has been re-architecting the ingest process along three vectors:

  • Leveraging Amazon Web Services (AWS 📖) Lambda functions to generate derivatives.
  • Refactoring the Work derivative creation to look first for a named local Derivative File, failing that a named remote Derivative File (e.g. from AWS), and failing that generate the Derivative File locally.
  • Splitting PDFs into constituent pages for Optical Character Recognition (OCR 📖) and rendering in International Image Interoperability Framework (IIIF 📖) image viewers.

The change I want to make is to have explicit declarations of the expected derivatives and to have explicit declarations of their dependencies.

For example, we have found that the best results for generating a hocr file from Tesseract is to provide a monochrome image. Thus a pre-requisite of hocr is monochrome. And a pre-requisite of monochrome is that we have an image. To create an alto file, we need a hocr.

Given the above three vectors, I felt that it would be best to take the above and determine the order in which we should generate derivatives so that dependencies are met…enter recursion.

My Strategy for Engaging with Recursion

I sometimes find recursion to be complicated. I need to juggle the call stack state, tail conditions, and what I’m doing at each function call.

In the above case of derivative dependencies, I found that I could represent those as a hash of symbols. Here is an example:

{
  hocr: [:monochrome],
  monochrome: [:image],
  alto: [:hocr],
  image: []
}

The keys (e.g. hocr:, monochrome:, alto:, and image:) are the derivatives, and the array values (e.g. [:monochrome], [:image], [:hocr], []) are the dependencies. The sequence would be [:image, :monochrome, :hocr, :alto].

A key strategy I have is to simplify what I’m doing in each function and then bombard it with tests.

In the above case each symbol (e.g. hocr:, image:, etc.) represents a named derivative concept that represents generator functions as well as files on the file system. But the symbols make it easy; I can bombard my tests with primitives and then later translate into the complex.

Enter the Testing

The first test I like to write is a simple case.

RSpec.describe Chain do
  it "works with a simple case" do
    chain = Chain.new(image: [])
    expect(chain.to_a).to eq([:image])
  end
end

The code for making that work is rather straight forward:

class Chain
  def initialize(hash:)
    @hash = hash
  end
  attr_reader :hash

  def to_a
    hash.keys
  end
end

The above is most definitely a cheat; and I know it. So let’s introduce the second test:

RSpec.describe Chain do
  it "works with a simple case" do
    chain = Chain.new(image: [])
    expect(chain.to_a).to eq([:image])
  end

  it "works with a slightly more complex case" do
    chain = Chain.new(monochrome: [:image], image: [])
    expect(chain.to_a).to eq([:image, :monochrome])
  end
end
require 'set'
class Chain
  def initialize(hash)
    @hash = hash
  end
  attr_reader :hash

  def to_a
    accumulator = Set.new
    hash.each do |derivative, dependencies|
      accumulate(derivative, dependencies, accumulator)
    end
    accumulator.to_a
  end

  def accumulate(derivative, dependencies, accumulator)
    # Before we add the derivative to the accumulator, ensure that we
    # handle dependencies
    Array(dependencies).each do |child, child_deps|
      accumulate(child, child_deps, accumulator)
    end
    # Now that we've handled the dependencies, we can add the
    # derivative.
    accumulator << derivative
  end
end

That was a big change in logic, but the tests pass. Now I want to look at introducing a test table, instead of those individual spec statements.

The purpose of the test table is to provide a quick means of bombarding a function with many different scenarios. Ideally, I can represent those scenarios with primitives (instead of domain objects).

RSpec.describe Chain do
  # We have an array of arrays.  For the inner array:
  #
  # - the first element is the derivatives dependency hash.
  # - the second element is the expected order.
  [
    [{ image: [] }, [:image]],
    [{ monochrome: [:image], image: [] }, [:image, :monochrome]],
  ].each do |derivatives, expected|
    context "with #{derivatives.inspect}" do
      subject { Chain.new(derivatives).to_a }
      it { is_expected.to eq(expected) }
    end
  end
end

The above code is a refactor of our the initial tests. With this structure, I find it easier to add additional tests to the test table. I use the above method when testing regular expressions. Isolate the function for the regular expression and then bombard with test cases to provide examples of expected behavior.

Finding a Bug as We Bombard

Let’s add a few more:

RSpec.describe Chain do
  # We have an array of arrays.  For the inner array:
  #
  # - the first element is the derivatives dependency hash.
  # - the second element is the expected order.
  [
    [{ image: [] }, [:image]],
    [{ monochrome: [:image], image: [] }, [:image, :monochrome]],
    # Test that the order of the hash is not important
    [{ image: [], monochrome: [:image] }, [:image, :monochrome]],
    # Test nested dependencies
    [{ a: [:b, :c], b: [:c], c: [] }, [:c, :b, :a]],
    # What if we forget to declare :b but it’s a dependency
    [{ a: [:b] }, [:b, :a]]
  ].each do |derivatives, expected|
    context "with #{derivatives.inspect}" do
      subject { Chain.new(derivatives).to_a }
      it { is_expected.to eq(expected) }
    end
  end
end

With the above, I encountered the following failure:

Failures:

 1) Chain with {:a=>[:b, :c], :b=>[:c], :c=>[]} is expected to
    eq [:c, :b, :a]
    Failure/Error: it { is_expected.to eq(expected) }

       expected: [:c, :b, :a]
	    got: [:b, :c, :a]

       (compared using ==)

       Diff:
       @@ -1 +1 @@
       -[:c, :b, :a]
       +[:b, :c, :a]

I commented out all but the failing scenario, and began debugging. I find that debugging recursive functions requires a lot of concentration and thinking. When I’m at my best, if I encounter this problem, I will first stand-up and walk around just a bit. This helps dislodge my current thinking and allows me to return with a different perspective.

Ah right, I forgot that the dependencies are an Array and to find the dependency’s dependencies, I need to check with the original @hash.

I make the following change to the Chain#accumulate method and the above tests pass.

def accumulate(derivative, dependencies, accumulator)
  # Before we add the derivative to the accumulator, ensure that we
  # handle dependencies
  Array(dependencies).each do |child|
    # Need to get child dependency’s dependencies by consulting the
    # original hash.  Fallback to “there are no dependencies”
    child_deps = hash.fetch(child, [])
    accumulate(child, child_deps, accumulator)
  end
  # Now that we've handled the dependencies, we can add the
  # derivative.
  accumulator << derivative
end

I’ve tested the happy paths.

Guarding Against Infinity

If there is a cyclical dependency (e.g. :a requires :b which requires :a), I want to raise an exception.

For this case I’m going to introduce a time to live variable. When any single recursion branch exceeds the time to live variable, I’ll raise an exception. Note: Ruby will raise an exception on this infinity; but I find it a matter of good practice to name the exception and be explicit about it; that way future me has a better chance of narrowing down the problem.

Let’s wire up the spec:

context 'when I have a cyclical dependency' do
  it "will raise a CyclicalDependencyError" do
    expect do
      Chain.new(a: [:b], b: [:a]).to_a
    end.to raise_exception(CyclicalDependencyError)
  end
end

The above has a different structure than the test table; hence it gets its own test. To make the spec pass, I again adjust the Chain#accumulate function:

def accumulate(derivative, dependencies, accumulator, ttl = 10)
  raise CyclicalDependencyError if ttl < 0
  Array(dependencies).each do |child|
    child_deps = hash.fetch(child, [])
    accumulate(child, child_deps, accumulator, ttl - 1)
  end
  accumulator << derivative
end

What I like about the time to live variable is that I can also choose to pass that in at initialization. Which can help improve test performance if the default time to live is large and the objects in the recursion are “expensive” to process/instantiate.

The Final Example Form

Below is the “final” Chain class and its associated specs.

The Chain and Specs in One Code Block
require 'rspec'
require 'set'
class CyclicalDependencyError < StandardError; end
class Chain
  def initialize(hash)
    @hash = hash
  end
  attr_reader :hash

  def to_a
    accumulator = Set.new
    hash.each do |derivative, dependencies|
      accumulate(derivative, dependencies, accumulator)
    end
    accumulator.to_a
  end

  def accumulate(derivative, dependencies, accumulator, ttl = 10)
    raise CyclicalDependencyError if ttl < 0
    Array(dependencies).each do |child|
      child_deps = hash.fetch(child, [])
      accumulate(child, child_deps, accumulator, ttl - 1)
    end
    accumulator << derivative
  end
end

RSpec.describe Chain do
  # We have an array of arrays.  For the inner array:
  #
  # - the first element is the derivatives dependency hash.
  # - the second element is the expected order.
  [
    [{ image: [] }, [:image]],
    [{ monochrome: [:image], image: [] }, [:image, :monochrome]],
    # Test that the order of the hash is not important
    [{ image: [], monochrome: [:image] }, [:image, :monochrome]],
    # Test nested dependencies
    [{ a: [:b, :c], b: [:c], c: [] }, [:c, :b, :a]],
    [{ a: [:b] }, [:b, :a]],
  ].each do |derivatives, expected|
    context "with #{derivatives.inspect}" do
      subject { Chain.new(derivatives).to_a }
      it { is_expected.to eq(expected) }
    end
  end

  context 'when I have a cyclical dependency' do
    it "will raise a CyclicalDependencyError" do
      expect do
        Chain.new(a: [:b], b: [:a]).to_a
      end.to raise_exception(CyclicalDependencyError)
    end
  end
end

Back to the Current Implementation

As I mentioned at the beginning, this came about as I was working to refactor/re-architect along three vectors:

  • Leveraging AWS Lambda
  • Refactoring the Work derivative creation sequence
  • Splitting PDFs

Because of the mental complexity of the above, and knowing that I would need to determine and validate the dependency chain, I chose to write a simple function that would flatten the dependency graph.

And from that simple function, I could proceed with the more complicated refactor/re-architect knowing that I wasn’t going to introduce cyclical dependencies, and could proceed in a linear fashion through those dependencies.

For those curious, my work is happening in the Derivative::Rodeo 📖 gem; an in-progress work. And you can read up on the SpaceStone::Derivatives::Chain to see a more refined and how I incorporate that into the larger ecosystem.