If you're accepting user input for Solr (which I expect most projects using it are), you've probably noticed that you need to sanitize what queries you pass to Solr. After reading a bunch of conflicting documentation and blog posts, I put together a simple little module to handle it for you. It should strip out everything that would cause Solr to throw an error on a query string. Let me know if it works for you or if I missed any corner cases!
module SolrStringSanitizer
ILLEGAL_SOLR_CHARACTERS_REGEXP = /\+|\-|!|(|)|{|}|[|]|\^|\|"|~|*|\?|:|;|&&|\|\|/
def self.sanitize(string)
if string
string.gsub(ILLEGAL_SOLR_CHARACTERS_REGEXP,"")
end
end
end








I was getting an error:
invalid regular expression; there's no previous pattern, to which '{' would define cardinality at 13: /\+|\-|!|(|)|{|}|[|]|\^|\|"|~|*|\?|:|;|&&|\|\|/):
So I changed the regex to this and it seems to work:
ILLEGAL_SOLR_CHARACTERS_REGEXP = /[\+\-!(){}[]\^\|"~*\?:;&]/
Basically escaped most of the characters, and put them in a character class rather than having all of the 'OR' pipes.
remove
Wow, no markdown love. pastie to the rescue: http://pastie.org/550997
Feel free to delete the broken posts.
remove
The regular expression prevents wildcard searching...
remove
All of those characters are valid text too, escaping them seems more appropriate than removing them.
remove
I've also written alternative to accepting raw user input in the form of a Lucene query generator. We mainly used the library for constructing specific searches for view, but it's also makes building "advanced" search interfaces easier.
http://github.com/jvoorhis/lucene_query/tree/master
Thanks to Mike Mangino of Elevated Rails for allowing me to release the library with an MIT license.
remove