— originally on blog.arkency.com

Practical use of Ruby PStore

Practical use of Ruby PStore

Arkency blog has undergone several improvements over recent weeks. One of such changes was opening the source of blog articles. We've have concluded that having posts in the open would shorten the feedback loop and allow our readers to collaborate and make the articles better for all.

Nanoc + Github

For years the blog has been driven by nanoc, which is a static-site generator. You put a bunch of markdown files in, drop a layout and on the other side out of it comes the HTML. Let's call this magic "compilation". One of nanoc prominent features is data sources. With it one could render content not only from a local filesystem. Given appropriate adapter posts, pages or other data items can be fetched from 3rd party API. Like SQL database. Or Github!

Choosing Github as a backend for our posts was no-brainer. Developers are familiar with it. It has quite a nice integrated web editor with Markdown preview — which gives in-place editing. Pull requests create the space for discussion. Last but not least there is octokit gem for API interaction, taking much of the implementation burden out of our shoulders.

An initial data adapter looked like this to fetch articles looked like this:

class Source < Nanoc::DataSource
  identifier :github

  def items
    client = Octokit::Client.new(access_token: ENV['GITHUB_TOKEN'])
    client
      .contents(ENV['GITHUB_REPO'])
      .select { |item| item.end_with?(".md") }
      .map    { |item| client.contents(ENV['GITHUB_REPO'], path: item[:path]) }
      .map    { |item| new_item(item[:content], item, Nanoc::Identifier.new(item[:path])) }  
  end
end

This code:

Good enough for a quick spike and exploration of the problem. Becomes problematic as soon as you start using it for real. Can you spot the problems?

Source data improved

For a repository with 100 markdown files we will have to make 100 + 1 HTTP requests in order to retrieve the content

Making those requests parallel will only make the process of hitting request quota faster. Something has to be done to limit number of requests that are needed.

Luckily enough octokit gem used faraday library for HTTP interaction and some kind souls documented how one could leverage faraday-http-cache middleware.

class Source < Nanoc::DataSource
  identifier :github

  def up
    stack = Faraday::RackBuilder.new do |builder|
      builder.use Faraday::HttpCache,
                  serializer: Marshal,
                  shared_cache: false
      builder.use Faraday::Request::Retry,
                  exceptions: [Octokit::ServerError]
      builder.use Octokit::Middleware::FollowRedirects
      builder.use Octokit::Response::RaiseError
      builder.use Octokit::Response::FeedParser
      builder.adapter Faraday.default_adapter
    end
    Octokit.middleware = stack
  end

  def items
    repository_items.map do |item|
      identifier     = Nanoc::Identifier.new("/#{item[:name]}")
      metadata, data = decode(item[:content])

      new_item(data, metadata, identifier, checksum_data: item[:sha])
    end
  end

  private

  def repository_items
    pool  = Concurrent::FixedThreadPool.new(10)
    items = Concurrent::Array.new
    client
      .contents(repository, path: path)
      .select { |item| item[:type] == "file" }
      .each   { |item| pool.post { items << client.contents(repository, path: item[:path]) } }
    pool.shutdown
    pool.wait_for_termination
    items
  rescue Octokit::NotFound => exc
    []
  end

  def client
    Octokit::Client.new(access_token: access_token)
  end

  def repository
    # ...
  end

  def path
    # ...
  end

  def access_token
    # ...
  end

  def decode(content)
    # ...
  end
end

Notice two main additions here:

If only that cache worked... Faraday ships with in-memory cache, which is useless for the flow of work one has with nanoc. We'd very much like to persist the cache across runs of the compile process. Documentation indeed shows how one could switch cache backend to one from Rails but that is not helpful advice in nanoc context either. You probably wouldn't like to start Redis or Memcache instance just to compile a bunch of HTML!

Time to roll-up sleeves again. Knowing what API is expected, we can build file-based cache backend. And there little-known standard library gem we could use to free ourselves of reimplementing the basics again. So much for standing on the shoulders of giants again.

Enter PStore

PStore is a file based persistence mechanism based on a Hash. We can store Ruby objects — they're serialized with Marshal before being dumped on disk. It supports transactional behaviour and can be made thread safe. Sounds perfect for the job!

class Cache
  def initialize(cache_dir)
    @store = PStore.new(File.join(cache_dir, "nanoc-github.store"), true)
  end

  def write(name, value, options = nil)
    store.transaction { store[name] = value }
  end

  def read(name, options = nil)
    store.transaction(true) { store[name] }
  end

  def delete(name, options = nil)
    store.transaction { store.delete(name) }
  end

  private
  attr_reader :store
end

In the end that cache store turned out to be merely a wrapper on pstore. How convenient! Thread safety is achieved here by using Mutex internaly around transaction block.

class Source < Nanoc::DataSource
  identifier :github

  def up
    stack = Faraday::RackBuilder.new do |builder|
      builder.use Faraday::HttpCache,
                  serializer: Marshal,
                  shared_cache: false,
                  store: Cache.new(tmp_dir)
      # ...            
    end
    Octokit.middleware = stack
  end

  # ...
end 

With persistent cache store plugged into Faraday we can now reap benefits of cached responses. Subsequent requests to Github API are skipped. Responses are being served directly from local files. That is, as long as the cache stays fresh..

Cache validity can be controlled by several HTTP headers. In case of Github API it is the Cache-Control: private, max-age=60, s-maxage=60 that matters. Together with Date header this roughly means that the content will be valid for 60 seconds since the response was received. Is it much? For frequently changed content — probably. For blog articles I'd prefer something more long-lasting…

And that is how we arrive to the last piece of nanoc-github. A faraday middleware to allow extending cache time. It is a quite primitive piece of code that substitutes max-age value to the desired one. For my particular needs I set this value 3600 seconds. The general idea is that we modify HTTP responses from API before they hit the cache. Then the cache middleware examines cache validity based on modified age, rather than original one. Simple and good enough. Just be careful to add this to middleware stack in correct order 😅

class ModifyMaxAge < Faraday::Middleware
  def initialize(app, time:)
    @app  = app
    @time = Integer(time)
  end

  def call(request_env)
    @app.call(request_env).on_complete do |response_env|
      response_env[:response_headers][:cache_control] = "public, max-age=#{@time}, s-maxage=#{@time}"
    end
  end
end

And that's it! I hope you found this article useful and learned a bit or two. Drop me a line on my twitter or leave a star on this project:

Happy hacking!