Pivotal Labs

Main menu

Skip to primary content
Skip to secondary content
  • About
  • Case Studies
  • Team
    • Executives
    • Locations
      • San Francisco (HQ)
      • Boston
      • Boulder
      • Denver
      • London
      • Los Angeles
      • New York
  • Community
    • Blogs
    • Tech Talks
    • Events
  • Careers
    • Lifestyle
    • Principles & Practices
    • Benefits
    • FAQ
    • Apply
  • Tools
  • Contact
    • Press Room
    • Press Releases
    • In The News
    • Press Kit
  • All
  • Labs
  • Standup
  • Tracker
Grant Hutchins

pg_search: How I Learned to Stop Worrying and Love PostgreSQL full-text search

Grant Hutchins
Friday, February 11, 2011

I’m a Pivotal Labs developer at our NYC offices working on the Casebook development team. Casebook is a child-welfare-focused web application used by governments and non-profit organizations. Our users are social workers, caseworkers, and their leadership who work with children, families, and the broader community to provide services that ensure children are safe and healthy.

Search worries

Our users need to quickly find accurate information about the people on their workload to respond appropriately in crises and keep a high quality written record of their work with the children and families.

Solr powered Casebook’s initial search engine. Solr is built in Java, so we set up our application servers to run Java alongside our Ruby on Rails web application. We maintained a real-time copy of our important searchable data, such as people’s names, in our Solr index.

Our Solr-based approach ran into a few problems. Sometimes users would see outdated search results or, even worse, errors. This was annoying and also potentially damaging to our users’ ability to keep up with emergency situations.

Keeping our data synched in multiple locations caused most of our problems with Solr. Some of our more complex code paths would update the database but not propagate those changes to the search index. Users saw search-related error messages when there were communication problems with our Solr instances.

We had some fail-safes in place.

We wrote code that automatically restarted the Solr instances when they crashed. When we found the search data diverged from our application data, we manually rebuilt the search index to get the two data stores back in sync. These solutions just managed our problems rather than solving them.

These problems aren’t unique to Solr. Other tools like Lucene, Ferret, and Sphinx have the same shortcomings when combined with Ruby on Rails.

Using the database itself as the search index

So the thought occurred to our team that we ought to try to make the database itself be the search index. We use a PostgreSQL database, and PostgreSQL 8.3 and later have built-in support for full-text search. PostgreSQL is a popular, mature SQL database solution that works great with Active Record. If you use Heroku, then you are already using a PostgreSQL 8.3 database that supports full-text search.

Since full-text search in PostgreSQL uses fairly complex SQL queries, we decided that the best approach would be to take advantage of Active Record’s scopes. The idea is to make it easy to write code that looks like this:

Book.search_title("Ruby").include(:author).where("created_at > ?", 1.year.ago).limit(10)

So, I am proud to introduce pg_search, a Ruby gem that makes it easy to build search scopes that work just like this.

Installing pg_search is easy. If you’re using Bundler, just add

gem "pg_search"

to your Gemfile and you’re good to go.

To use pg_search to build the Book.search_title scope above, you would write:

class Book < ActiveRecord::Base
  include PgSearch
  pg_search_scope :search_title, :against => [:title]
end

It’s as simple as that!

Adding more features

We took cues from the texticle gem to figure out how to generate our SQL code. Thanks to Aaron Patterson for this wonderful gem! However, our Solr solution had several features that texticle and the basic PostgreSQL full-text search alone don’t currently provide, like ignoring diacritical marks (accents like ü), searching for soundalikes, and searching for words that are misspelled.

We spent a day or two trying to hack texticle into something we could use, but realized that if we started from scratch we could more easily build a gem that could combine more than one PostgreSQL feature into a single search scope. That way, we could improve our Book.search_title scope by using unaccent to ignore accent marks, Double Metaphone to match soundalikes, and trigrams to match misspellings.

So with all of these features turned on, we get the following code:

class Book < ActiveRecord::Base
  include PgSearch
  pg_search_scope :search_title,
    :against => [:title],
    :using => [:tsearch, :dmetaphone, :trigrams],
    :ignoring => :accents
end

Except for :tsearch, the default-in full-text search implementation, the other features require you to install certain contrib packages into your database. For now, this is an exercise for the reader, but we hope to help automate this process soon.

Our gem development approach

We started by taking our application with the existing Solr-based search intact and boosting our test coverage as we could to cover all of the different cases (misspellings, soundalikes, etc.) for some of our most complicated searchable models. Once we were satisfied with our test coverage, we completely removed the Solr search code and were left with dozens of failing tests.

We then created a blank gem and starting adding features to it one-by-one to get each of our application’s tests to pass. First we made sure that simple situations were solid, such as when the search query string exactly matches the searchable text.
Then we moved on to the complicated parts.

Our existing application uses Ruby 1.8 and Rails 2.3, and at the same time we have a new second project that uses Ruby 1.9 and Rails 3. So we made sure that all of our code worked in both environments. I will write another blog post soon about how we used two instances of autotest to make this easy to do.

The great thing about this approach is that we were able to start by defining a set of behaviors based on what our real-world application needed. This kept our code lean. Also, we were able to define our own syntax for the pg_search_scope method. By mimicking the Active Record scope syntax, hopefully we have created something that is easy to pick up. We would just add a new option to one of our calls to pg_search_scope and code until it worked as desired.

User impact

Our users have noticed the difference after we deployed our updated search implementation. We had been rebuilding the search index or troubleshooting a search related bug a few times a week. We haven’t seen a search related help request from our users since we made the changes. In addition, our developers are happier because code deployments are much more reliable and easy to understand.

Overall, the project has been a resounding success!

Getting involved

pg_search isn’t complete yet (will it ever be?). There are many more features we’d like to have to improve performance, search quality, and overall user experience.

For example, right now our developers have to hand-build SQL indexes to improve query speed. pg_search should automatically generate those indexes for us based on which PostgreSQL features are in use.

That’s just one example. We’d love to hear more ideas from you about how pg_search can improve to meet users needs.

To learn more, read our documentation. We also have a public Pivotal Tracker project for requesting features and bugfixes, and a Google Group for discussing pg_search and other Case Commons open source projects.

Also, the Casebook team is currently hiring for an Agile Developer.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Mike Gehard

Cucumber and Sunspot…

Mike Gehard
Tuesday, August 24, 2010

In continuing with my Cucumber themed posts, here is a great post about using Cucumber and Sunspot together…

http://opensoul.org/2010/4/7/cucumber-and-sunspot

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Pivotal Labs

Standup 04/07/2010: Passenger, Solr, Git, and rSpec timeouts

Pivotal Labs
Friday, May 7, 2010

Ask for Help

Passenger Memory Bloat

“We found one of our passenger workers is using around 900MB of memory. Has anyone has problem with Passenger memory usage? We are using REE 1.8.7-2009.10.”

Solr Master-Slave Replication

“We are interested in adding automatic failover to our Solr slave when the master fails. What are some strategies for doing this?”

Interesting Things

Git Push –force Blocked
If you find your git push being rejected, even when you use git push -f, it’s probably because your git server is configured to not allow non fast-forward pushes. You’ll need to change the server configuration to allow them.

spec –timeout
Be careful when running rspec with the –timeout option. When the timeout occurs the test process will be interrupted and it will print out a stack trace for wherever it was executing when it was interrupted. This can lead to a lot of confusion if you do not immediately realize it was the result of timing out and instead think that an exception actually occurred at that point.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Pivotal Labs

Standup 04/05/2010: Cinco de Mayo

Pivotal Labs
Wednesday, May 5, 2010

Ask for Help

Paperclip Slowness

“In one web request we are collecting the file paths of about 250 objects that have attachments via Paperclip. Unfortunately this is really slow and takes a couple seconds to finish. Does anyone have thoughts on how we could speed this up? Is de-normalizing the file path a reasonable solution?”

Moderation of Solr Search Results

“One of our projects uses Solr and acts_as_solr to provide search results to users. One particular result is showing up far higher than we want. What is the best way to use boosting to downgrade the score of an individual result in Solr?”

Interesting Things

Bike to Work Day
May 13th is Bike to Work Day in San Francisco. We are hoping more people take advantage of this to try biking to work for the first time. To mix things up for those that normally bike to work we are planning a Bike to Lunch.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Pivotal Labs

NYC Standup Roundup – Week of 4/19

Pivotal Labs
Saturday, April 24, 2010

Interesting

  • A Pivot noted a facepalm + headdesk moment when debugging an issue whose cause turned out to be related to two adjacent string literals being auto-concatenated by Ruby’s parser.

    >> "foo" "bar"
    => "foobar"
    

In this case, a missing comma in a method call went undetected because of this language characteristic. Whether or not this follows the principle of least surprise is an exercise left up to the reader.

  • Another pair warned that while this is valid syntax in Ruby 1.8.7 and beyond:

    define_method(:burninate) { |&block| block.call("burninating") }
    

.. in 1.8.6 you can’t use a block as a parameter of a block.

  • Another pair noted that exceptions with Sunspot can cause wider failures on a site than just those that touch Solr. The symptom on this project was that if Solr was inaccessible for any reason every page on the site would throw an error. Their fix was to use Sunspot’s SessionProxy to wrap methods with some exception handling love.

  • Lastly, GoRuCo — the Gotham Ruby Conference — will be held on May 22nd at Pace University’s downtown campus. The roster of talks is up and registration is open for business.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Pivotal Labs

Boosting with Acts As Solr

Pivotal Labs
Monday, August 10, 2009

Probably my favorite feature of the Solr full text search engine and the acts_as_solr plugin is a feature called boosting. Boosting is a great tool that gives you the ability to wield some influence over how the results that are returned are going to be ordered. When boosting is applied properly the quality of the search results appears improve dramatically even though the same results are being returned, just in a different order. There are two different kinds of boosting that you need to be aware of: column boosting and document boosting.

Field Boosting

Field or column boosting allows you to specify that if a query matches on a boosted field, give that more weight than usual. In the app I am working on, I added a field boost to the name attribute because I want results that have the query string in the name to appear before those results that have it somewhere in their description or as a tag. Here is an example of how to do a field boost when using acts_as_solr.

acts_as_solr :fields => [{:name => {:boost => 3.0}}, :description, :tags]

Document Boosting

A document boost should be utilized there is a way of quantifying one result as being better than another result, regardless of the query. For example, there are two entries in my database that both have a tag of “twitter client”: Twitterific and Twitterfon. In the iPhone App Store, Twitterfon has a higher popularity rating than Twitterific so I want Twitterfon to appear above Twitterific if someone searches for “twitter client” within the app. To specify document boosting based on the app store popularity field I can pass a Proc object to acts_as_solr (rdoc) and return the member field that holds the popularity rating. A great thing about the Proc object is that I can execute any ruby code inside of it that I want. This is useful if the popularity score is not directly stored in the database and must be calculated on the fly.

acts_as_solr :fields => [:name, :description, :tags],
           :boost => Proc.new { |item| item.popularity_score.to_f }

Closing

If you are using Solr at all it is important to be aware of what boosting can accomplish. When using multiple boosts, finding the right boost values to produce the best search results is a bit of black magic. I have found that after achieving “pretty good” results the law of diminishing returns comes into play and slows down progress. With a single boost it is much easier because there is only one variable in play.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
David Stevenson

Standup for 2/3/2009: SOLR & rails fails with IPv6

David Stevenson
Wednesday, February 4, 2009

Interesting things

  • When we used localhost in our solr.yml configuration, we couldn’t run tests on our OSX 10.5.6 machines. Commenting out the IPv6 localhost entries in /etc/hosts fixed the problem. The better solution would probably be to use 127.0.0.1 in SOLR configuration.
  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Adam Milligan

Solr demystified

Adam Milligan
Sunday, September 28, 2008

As I mentioned in this post, we’ve decided to set aside some of our weekly brown bags to spread around some knowledge on different technologies via a relatively informal presentation/discussion format. This past week we talked a bit about Solr.

This post covers much of what we discussed, ranging from the introductory to the somewhat arcane. If you’re a seasoned Solr user, this may not have much for you. But, you never know.

Solr: wtf?

For people who have never used Solr (me, for instance), I’ll start with the obvious question: what is it? At its most basic, Solr simply provides a web interface to the Lucene search engine. It’s written in Java and runs as a servlet inside a servlet container such as Tomcat or Jetty. The example application included in the distribution package includes Jetty, so you can get up and running relatively easily. You use Solr by sending your requests in the form of XML over HTTP; the responses also contain XML.

For those of you looking for sense in the world, I’m sorry: Solr isn’t an acronym, and to our knowledge doesn’t stand for anything in particular. It’s just a name with a vowel shortage.

You can find the home page for Solr here, a wiki for discussion of all things Solr here, and tutorial to get you started here. Finally, you can download the distribution (the current release version is 1.3.0) here.

But, why?

Let’s say you’re working on a site to help people find a physician. Users of this site might care about location, age, or gender of each physician. Your site might include how many pending malpractice suits each physician has, how patients have rated their bedside manner, or what magazines they stock in their waiting rooms. As a good citizen of the web community, you want to provide your users the ability to search for any combination of these criteria. You have all the information sitting in your database, so you should be able to search it, right?

Sure, no problem, but in order to ensure quick response times you’ll want to add indices on the columns in your physicians table. But, which indices to add? If your table has columns for age, gender, and rating, and you want to allow users to search on any combination of fields, then you need three indices to match all searches:

  1. age, gender, rating
  2. rating, age, gender
  3. gender, rating, age

Keep in mind that indices match from left to right, and will only match on columns included in the query. Thus, if you allow searching on another column you’ll need to have eight indices:

  1. age, gender, rating, mortality rate
  2. age, gender, mortality rate, rating
  3. age, rating, mortality rate, gender
  4. age, mortality rate, gender, rating
  5. gender, rating, mortality rate, age
  6. gender, mortality rate, age, rating
  7. rating, mortality rate, age, gender
  8. mortality rate, age, gender rating

So, we quickly discover that we need n! / (n – 1) indices to search n columns, and this doesn’t take into account range queries. This could quickly get out of hand; Solr to the rescue.

Solr will build your indices for you based on the columns you tell it you want to search, it will keep these indices up to date as you add or change records, and it will do it fast.

More accurately, Lucene will do these things for you. However, Solr allows you to put Lucene on its own server that your application talks to via HTTP. This way all of your production servers can share the same Solr server, keeping searches consistent for all instances of your application.

Next up, performance and such.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter

Topics

  • agile (783)
  • rails (117)
  • testing (90)
  • ruby (86)
  • ruby on rails (71)
  • jobs (62)
  • javascript (59)
  • techtalk (44)
  • ironblogger (42)
  • rspec (39)
  • bloggerdome (34)
  • productivity (34)
  • activerecord (30)
  • rubymine (30)
  • git (29)
  • gogaruco (29)
  • nyc (27)
  • design (24)
  • mobile (23)
  • pivotal tracker (22)
  • process (21)
  • cucumber (21)
  • jasmine (19)
  • ios (18)
  • tracker ecosystem (17)
  • webos (17)
  • objective-c (17)
  • fun (16)
  • android (16)
  • palm (16)
  • ci (16)
  • "soft" ware (16)
  • bdd (15)
  • tdd (15)
  • cedar (15)
  • rails3 (14)
  • performance (14)
  • css (14)
  • gem (13)
  • mouse-free development (12)
  • selenium (12)
  • goruco (12)
  • bundler (12)
  • api (12)
  • keyboard (11)
  • meetup (11)
  • railsconf (11)
  • nyc-standup (11)
  • capybara (10)
  • mac (10)
Subscribe to solr Feed
  • About
  • Case Studies
  • Team
  • Community
  • Careers
  • Tools
  • Contact
  • Labs
  • Events

Contact Us

contact@pivotallabs.com
+1 415-77-PIVOT
TwitterLinkedInFacebook

Pivotal Tracker

Tracker is the award-winning agile project management tool that enables real-time collaboration around a shared, prioritized backlog.
Visit pivotaltracker.com >