I’m a Pivotal Labs developer at our NYC offices working on the Casebook development team. Casebook is a child-welfare-focused web application used by governments and non-profit organizations. Our users are social workers, caseworkers, and their leadership who work with children, families, and the broader community to provide services that ensure children are safe and healthy.
Search worries
Our users need to quickly find accurate information about the people on their workload to respond appropriately in crises and keep a high quality written record of their work with the children and families.
Solr powered Casebook’s initial search engine. Solr is built in Java, so we set up our application servers to run Java alongside our Ruby on Rails web application. We maintained a real-time copy of our important searchable data, such as people’s names, in our Solr index.
Our Solr-based approach ran into a few problems. Sometimes users would see outdated search results or, even worse, errors. This was annoying and also potentially damaging to our users’ ability to keep up with emergency situations.
Keeping our data synched in multiple locations caused most of our problems with Solr. Some of our more complex code paths would update the database but not propagate those changes to the search index. Users saw search-related error messages when there were communication problems with our Solr instances.
We had some fail-safes in place.
We wrote code that automatically restarted the Solr instances when they crashed. When we found the search data diverged from our application data, we manually rebuilt the search index to get the two data stores back in sync. These solutions just managed our problems rather than solving them.
These problems aren't unique to Solr. Other tools like Lucene, Ferret, and Sphinx have the same shortcomings when combined with Ruby on Rails.
Using the database itself as the search index
So the thought occurred to our team that we ought to try to make the database itself be the search index. We use a PostgreSQL database, and PostgreSQL 8.3 and later have built-in support for full-text search. PostgreSQL is a popular, mature SQL database solution that works great with Active Record. If you use Heroku, then you are already using a PostgreSQL 8.3 database that supports full-text search.
Since full-text search in PostgreSQL uses fairly complex SQL queries, we decided that the best approach would be to take advantage of Active Record's scopes. The idea is to make it easy to write code that looks like this:
Book.search_title("Ruby").include(:author).where("created_at > ?", 1.year.ago).limit(10)
So, I am proud to introduce pg_search, a Ruby gem that makes it easy to build search scopes that work just like this.
Now that I'm starting to use DelayedJob to perform jobs in the future in my Heroku Sinatra app, its important that they happen at the scheduled time. But unless you pay attention, you'll find that times get mysteriously changed -- in my case, since I'm in San Francisco in the wintertime, by +/-8 hours -- which means that some conversion to or from UTC is being attempted, but it's only working halfway.
Trying to keep a handle on which libraries are attempting, and which are failing, to convert times is a losing battle, so I'm trying to do the right thing and save all my times in the database in UTC, and convert them to and from the user's local time as close to the UI as possible. Unfortunately, a variety of gotchas in Ruby and ActiveRecord and PostgreSQL makes this trickier than it should be. Here's a little catalog of my workarounds.
One of our projects had a pending chore in Tracker to move its backend to PostgreSQL from MySQL. This project has about a quarter of a million rows of production data and around a hundred tables in its schema which needed to be exactly migrated into PostgreSQL.
Forklifting the data proved more complicated than expected due to incompatibilities in the two DBMS' syntax such as in the way string escaping worked, how booleans were represented and a bunch of other small but painful differences. Despite MySQL's mysqldump utility including a command-line option to write statements in PostgeSQL format, it became clear that it wasn't going to be simple to create a repeatable procedure to do this work across our environments.
There's a bunch of information out there about how to approach this problem but none felt right. Most are multi-step manual procedures that require altering a dump file using sed or perl and others require the data to be loaded into an intermediary database and massaged prior to import. After testing some of these approaches, Todd and I decided to timebox ourselves to an hour to test the viability of a Ruby script using the DBI gem to move the data. We came up with:
require 'dbi'
require 'dbd/mysql'
require 'dbd/pg'
begin
mysql = DBI.connect("DBI:Mysql:source:localhost", "username", "password")
postgres = DBI.connect("DBI:Pg:destination:localhost", "username", "password")
mysql.select_all("SHOW TABLES") do |table|
next if ['schema_migrations', 'sessions'].include?(table.to_s)
select = mysql.execute("SELECT * FROM #{table}")
columns = select.column_names.map { |key| "\"#{key}\"" }.join(', ')
placeholders = (['?'] * select.column_names.size).join(', ')
insert = postgres.prepare("INSERT INTO #{table} (#{columns}) VALUES(#{placeholders})")
select.each { |row| insert.execute(*row) }
insert.finish
end
rescue DBI::DatabaseError => e
puts "Error #{e.err}: #{e.errstr}"
ensure
mysql.disconnect if mysql
postgres.disconnect if postgres
end
Our antiquely Perl-like script worked better than we expected — our application started right up with all of its data intact.
Has anybody out there encountered this need before? What kinds of solutions did you come up with?
