Pivotal Labs

Main menu

Skip to primary content
Skip to secondary content
  • About
  • Case Studies
  • Team
    • Executives
    • Locations
      • San Francisco (HQ)
      • Boston
      • Boulder
      • Denver
      • London
      • Los Angeles
      • New York
  • Community
    • Blogs
    • Tech Talks
    • Events
  • Careers
    • Lifestyle
    • Principles & Practices
    • Benefits
    • FAQ
    • Apply
  • Contact
    • Press Room
    • Press Releases
    • In The News
    • Press Kit
  • All
  • Labs
  • Standup
  • Tracker
Grant Hutchins

pg_search: How I Learned to Stop Worrying and Love PostgreSQL full-text search

Grant Hutchins
Friday, February 11, 2011

I’m a Pivotal Labs developer at our NYC offices working on the Casebook development team. Casebook is a child-welfare-focused web application used by governments and non-profit organizations. Our users are social workers, caseworkers, and their leadership who work with children, families, and the broader community to provide services that ensure children are safe and healthy.

Search worries

Our users need to quickly find accurate information about the people on their workload to respond appropriately in crises and keep a high quality written record of their work with the children and families.

Solr powered Casebook’s initial search engine. Solr is built in Java, so we set up our application servers to run Java alongside our Ruby on Rails web application. We maintained a real-time copy of our important searchable data, such as people’s names, in our Solr index.

Our Solr-based approach ran into a few problems. Sometimes users would see outdated search results or, even worse, errors. This was annoying and also potentially damaging to our users’ ability to keep up with emergency situations.

Keeping our data synched in multiple locations caused most of our problems with Solr. Some of our more complex code paths would update the database but not propagate those changes to the search index. Users saw search-related error messages when there were communication problems with our Solr instances.

We had some fail-safes in place.

We wrote code that automatically restarted the Solr instances when they crashed. When we found the search data diverged from our application data, we manually rebuilt the search index to get the two data stores back in sync. These solutions just managed our problems rather than solving them.

These problems aren’t unique to Solr. Other tools like Lucene, Ferret, and Sphinx have the same shortcomings when combined with Ruby on Rails.

Using the database itself as the search index

So the thought occurred to our team that we ought to try to make the database itself be the search index. We use a PostgreSQL database, and PostgreSQL 8.3 and later have built-in support for full-text search. PostgreSQL is a popular, mature SQL database solution that works great with Active Record. If you use Heroku, then you are already using a PostgreSQL 8.3 database that supports full-text search.

Since full-text search in PostgreSQL uses fairly complex SQL queries, we decided that the best approach would be to take advantage of Active Record’s scopes. The idea is to make it easy to write code that looks like this:

Book.search_title("Ruby").include(:author).where("created_at > ?", 1.year.ago).limit(10)

So, I am proud to introduce pg_search, a Ruby gem that makes it easy to build search scopes that work just like this.

Installing pg_search is easy. If you’re using Bundler, just add

gem "pg_search"

to your Gemfile and you’re good to go.

To use pg_search to build the Book.search_title scope above, you would write:

class Book < ActiveRecord::Base
  include PgSearch
  pg_search_scope :search_title, :against => [:title]
end

It’s as simple as that!

Adding more features

We took cues from the texticle gem to figure out how to generate our SQL code. Thanks to Aaron Patterson for this wonderful gem! However, our Solr solution had several features that texticle and the basic PostgreSQL full-text search alone don’t currently provide, like ignoring diacritical marks (accents like ü), searching for soundalikes, and searching for words that are misspelled.

We spent a day or two trying to hack texticle into something we could use, but realized that if we started from scratch we could more easily build a gem that could combine more than one PostgreSQL feature into a single search scope. That way, we could improve our Book.search_title scope by using unaccent to ignore accent marks, Double Metaphone to match soundalikes, and trigrams to match misspellings.

So with all of these features turned on, we get the following code:

class Book < ActiveRecord::Base
  include PgSearch
  pg_search_scope :search_title,
    :against => [:title],
    :using => [:tsearch, :dmetaphone, :trigrams],
    :ignoring => :accents
end

Except for :tsearch, the default-in full-text search implementation, the other features require you to install certain contrib packages into your database. For now, this is an exercise for the reader, but we hope to help automate this process soon.

Our gem development approach

We started by taking our application with the existing Solr-based search intact and boosting our test coverage as we could to cover all of the different cases (misspellings, soundalikes, etc.) for some of our most complicated searchable models. Once we were satisfied with our test coverage, we completely removed the Solr search code and were left with dozens of failing tests.

We then created a blank gem and starting adding features to it one-by-one to get each of our application’s tests to pass. First we made sure that simple situations were solid, such as when the search query string exactly matches the searchable text.
Then we moved on to the complicated parts.

Our existing application uses Ruby 1.8 and Rails 2.3, and at the same time we have a new second project that uses Ruby 1.9 and Rails 3. So we made sure that all of our code worked in both environments. I will write another blog post soon about how we used two instances of autotest to make this easy to do.

The great thing about this approach is that we were able to start by defining a set of behaviors based on what our real-world application needed. This kept our code lean. Also, we were able to define our own syntax for the pg_search_scope method. By mimicking the Active Record scope syntax, hopefully we have created something that is easy to pick up. We would just add a new option to one of our calls to pg_search_scope and code until it worked as desired.

User impact

Our users have noticed the difference after we deployed our updated search implementation. We had been rebuilding the search index or troubleshooting a search related bug a few times a week. We haven’t seen a search related help request from our users since we made the changes. In addition, our developers are happier because code deployments are much more reliable and easy to understand.

Overall, the project has been a resounding success!

Getting involved

pg_search isn’t complete yet (will it ever be?). There are many more features we’d like to have to improve performance, search quality, and overall user experience.

For example, right now our developers have to hand-build SQL indexes to improve query speed. pg_search should automatically generate those indexes for us based on which PostgreSQL features are in use.

That’s just one example. We’d love to hear more ideas from you about how pg_search can improve to meet users needs.

To learn more, read our documentation. We also have a public Pivotal Tracker project for requesting features and bugfixes, and a Google Group for discussing pg_search and other Case Commons open source projects.

Also, the Casebook team is currently hiring for an Agile Developer.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter

UTC vs Ruby, ActiveRecord, Sinatra, Heroku and Postgres

Alex Chaffee
Friday, January 22, 2010

Now that I’m starting to use DelayedJob to perform jobs in the future in my Heroku Sinatra app, its important that they happen at the scheduled time. But unless you pay attention, you’ll find that times get mysteriously changed — in my case, since I’m in San Francisco in the wintertime, by +/-8 hours — which means that some conversion to or from UTC is being attempted, but it’s only working halfway.

Trying to keep a handle on which libraries are attempting, and which are failing, to convert times is a losing battle, so I’m trying to do the right thing and save all my times in the database in UTC, and convert them to and from the user’s local time as close to the UI as possible. Unfortunately, a variety of gotchas in Ruby and ActiveRecord and PostgreSQL makes this trickier than it should be. Here’s a little catalog of my workarounds.


You must set both Time.zone = "UTC" and ActiveRecord::Base.default_timezone = :utc. Since I’m using Sinatra, not Rails, this stuff goes either in main (i.e. not inside any class) right after require 'active_record', or in a configure block in your app, depending on your preference.


When ActiveRecord creates queries — which are used for both reading and writing, mind you — it will only convert to UTC times that are instances of ActiveSupport’s proprietary TimeWithZone class. It will not convert regular Ruby Time objects, even though Time objects are perfectly aware of their time zones, and AR is perfectly aware that you’d prefer they be written as UTC (due to the default_timezone setting). This is clearly a bug IMHO, but the Rails core marked the bug as “will not fix”, so w/e. Here’s a monkey patch, courtesy of Peter Marklund:

  module ActiveRecord
    module ConnectionAdapters # :nodoc:
      module Quoting
        # Convert dates and times to UTC so that the following two will be equivalent:
        # Event.all(:conditions => ["start_time > ?", Time.zone.now])
        # Event.all(:conditions => ["start_time > ?", Time.now])
        def quoted_date(value)
          value.respond_to?(:utc) ? value.utc.to_s(:db) : value.to_s(:db)
        end
      end
    end
  end

When outputting timestamps to a UI — either inside HTML or in a JSON API — you’ll probably want to use Time#strftime. Beware: on Mac OS X under Ruby 1.8, the %z (lowercase Z) selector will emit the local time zone, not the zone of the Time object you’ve called strftime on. The solution is to either use %Z (capital Z) or just a plain Z which stands for Zulu Time. The latter is OK if you know you’re using UTC, which, if you’ve followed my advice, you probably do. This is a pretty annoying issue, since it’s much safer to use %z‘s hour offsets than %Z‘s three-letter codes, since the three-letter codes can be ambiguous, and in any case require an extra conversion to time offset, so you may as well just emit the offset.

Here are some methods on Time you may want to use that work around this %z issue:

  # Note: do NOT call this file 'time.rb' :-D

  require 'time'

  class Time
    def full_date_and_time
      strftime('%Y-%m-%d %H:%M:%S %Z')
    end

    def iso8601
      strftime('%Y-%m-%dT%H:%M:%SZ') # the final "Z" means "Zulu time" which is ok since we're now doing all times in UTC
    end
  end

That iso8601 method comes in really handy when you’re using the excellent timeago jQuery plugin by Ryan McGeary (@rmm5t).


By default PostgreSQL saves timestamps sans time zone, which means that ActiveRecord interprets them as being in the default_timezone. If you want to be extra clear and save them with time zone, you’ll have to change the Postgres adapter’s type mapping. ActiveRecord doesn’t let you configure this but here’s a monkey patch, courtesy of
Chirag Patel (with a couple of mods):

    require 'active_record/connection_adapters/postgresql_adapter'
    class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter < ActiveRecord::ConnectionAdapters::AbstractAdapter
      def native_database_types
        {
          :primary_key => "serial primary key".freeze,
          :string      => { :name => "character varying", :limit => 255 },
          :text        => { :name => "text" },
          :integer     => { :name => "integer" },
          :float       => { :name => "float" },
          :decimal     => { :name => "decimal" },
          :datetime    => { :name => "timestamp with time zone" },
          :timestamp   => { :name => "timestamp with time zone" },
          :time        => { :name => "time" },
          :date        => { :name => "date" },
          :binary      => { :name => "bytea" },
          :boolean     => { :name => "boolean" }
        }
      end
    end

It turned out that I didn’t need this, so I ended up commenting it out. It may be that storing timestamps with time zones will cause a hiccup with some other random DB code, so watch out. If you do use it, and you’ve already got some data, make sure to write a migration that changes the types of all extant datetime and timestamp fields, and maybe a migration that shifts the times too.


That’s all I’ve got for right now. I’m sure some more problems will come up on March 14, 2010…

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Pivotal Labs

Converting Rails application data from MySQL to PostgreSQL

Pivotal Labs
Monday, November 23, 2009

One of our projects had a pending chore in Tracker to move its backend to PostgreSQL from MySQL. This project has about a quarter of a million rows of production data and around a hundred tables in its schema which needed to be exactly migrated into PostgreSQL.

Forklifting the data proved more complicated than expected due to incompatibilities in the two DBMS’ syntax such as in the way string escaping worked, how booleans were represented and a bunch of other small but painful differences. Despite MySQL’s mysqldump utility including a command-line option to write statements in PostgeSQL format, it became clear that it wasn’t going to be simple to create a repeatable procedure to do this work across our environments.

There’s a bunch of information out there about how to approach this problem but none felt right. Most are multi-step manual procedures that require altering a dump file using sed or perl and others require the data to be loaded into an intermediary database and massaged prior to import. After testing some of these approaches, Todd and I decided to timebox ourselves to an hour to test the viability of a Ruby script using the DBI gem to move the data. We came up with:

require 'dbi'
require 'dbd/mysql'
require 'dbd/pg'

begin
  mysql = DBI.connect("DBI:Mysql:source:localhost", "username", "password")
  postgres = DBI.connect("DBI:Pg:destination:localhost", "username", "password")

  mysql.select_all("SHOW TABLES") do |table|
    next if ['schema_migrations', 'sessions'].include?(table.to_s)
    select = mysql.execute("SELECT * FROM #{table}")
    columns = select.column_names.map { |key| ""#{key}"" }.join(', ')
    placeholders = (['?'] * select.column_names.size).join(', ')
    insert = postgres.prepare("INSERT INTO #{table} (#{columns}) VALUES(#{placeholders})")
    select.each { |row| insert.execute(*row) }
    insert.finish
  end
rescue DBI::DatabaseError => e
  puts "Error #{e.err}: #{e.errstr}"
ensure
  mysql.disconnect if mysql
  postgres.disconnect if postgres
end

Our antiquely Perl-like script worked better than we expected — our application started right up with all of its data intact.

Has anybody out there encountered this need before? What kinds of solutions did you come up with?

  • 0 Shares
  • Share on Facebook
  • Share on Twitter

Topics

  • agile (778)
  • rails (113)
  • testing (87)
  • ruby (83)
  • ruby on rails (70)
  • jobs (62)
  • javascript (54)
  • techtalk (44)
  • rspec (38)
  • activerecord (29)
  • productivity (29)
  • gogaruco (29)
  • ironblogger (29)
  • git (28)
  • nyc (27)
  • rubymine (25)
  • mobile (22)
  • bloggerdome (21)
  • cucumber (20)
  • process (19)
  • pivotal tracker (19)
  • jasmine (19)
  • design (18)
  • ios (18)
  • webos (17)
  • objective-c (17)
  • android (16)
  • palm (16)
  • "soft" ware (16)
  • fun (15)
  • tracker ecosystem (15)
  • ci (15)
  • cedar (15)
  • rails3 (14)
  • performance (14)
  • bdd (14)
  • gem (13)
  • tdd (13)
  • selenium (12)
  • css (12)
  • goruco (12)
  • bundler (12)
  • meetup (11)
  • railsconf (11)
  • nyc-standup (11)
  • capybara (10)
  • mac (10)
  • mojo (10)
  • chef (10)
  • api (10)
Subscribe to postgresql Feed
  • About
  • Case Studies
  • Team
  • Community
  • Careers
  • Contact
  • Labs
  • Events

Contact Us

contact@pivotallabs.com
+1 415-77-PIVOT
TwitterLinkedInFacebook

Pivotal Tracker

Tracker is the award-winning agile project management tool that enables real-time collaboration around a shared, prioritized backlog.
Visit pivotaltracker.com >