As I mentioned in this post, we’ve decided to set aside some of our weekly brown bags to spread around some knowledge on different technologies via a relatively informal presentation/discussion format. This past week we talked a bit about Solr.
This post covers much of what we discussed, ranging from the introductory to the somewhat arcane. If you’re a seasoned Solr user, this may not have much for you. But, you never know.
For people who have never used Solr (me, for instance), I’ll start with the obvious question: what is it? At its most basic, Solr simply provides a web interface to the Lucene search engine. It’s written in Java and runs as a servlet inside a servlet container such as Tomcat or Jetty. The example application included in the distribution package includes Jetty, so you can get up and running relatively easily. You use Solr by sending your requests in the form of XML over HTTP; the responses also contain XML.
For those of you looking for sense in the world, I’m sorry: Solr isn’t an acronym, and to our knowledge doesn’t stand for anything in particular. It’s just a name with a vowel shortage.
You can find the home page for Solr here, a wiki for discussion of all things Solr here, and tutorial to get you started here. Finally, you can download the distribution (the current release version is 1.3.0) here.
Let’s say you’re working on a site to help people find a physician. Users of this site might care about location, age, or gender of each physician. Your site might include how many pending malpractice suits each physician has, how patients have rated their bedside manner, or what magazines they stock in their waiting rooms. As a good citizen of the web community, you want to provide your users the ability to search for any combination of these criteria. You have all the information sitting in your database, so you should be able to search it, right?
Sure, no problem, but in order to ensure quick response times you’ll want to add indices on the columns in your physicians table. But, which indices to add? If your table has columns for age, gender, and rating, and you want to allow users to search on any combination of fields, then you need three indices to match all searches:
- age, gender, rating
- rating, age, gender
- gender, rating, age
Keep in mind that indices match from left to right, and will only match on columns included in the query. Thus, if you allow searching on another column you’ll need to have eight indices:
- age, gender, rating, mortality rate
- age, gender, mortality rate, rating
- age, rating, mortality rate, gender
- age, mortality rate, gender, rating
- gender, rating, mortality rate, age
- gender, mortality rate, age, rating
- rating, mortality rate, age, gender
- mortality rate, age, gender rating
So, we quickly discover that we need n! / (n – 1) indices to search n columns, and this doesn’t take into account range queries. This could quickly get out of hand; Solr to the rescue.
Solr will build your indices for you based on the columns you tell it you want to search, it will keep these indices up to date as you add or change records, and it will do it fast.
More accurately, Lucene will do these things for you. However, Solr allows you to put Lucene on its own server that your application talks to via HTTP. This way all of your production servers can share the same Solr server, keeping searches consistent for all instances of your application.
Next up, performance and such.