If you’re accepting user input for Solr (which I expect most projects using it are), you’ve probably noticed that you need to sanitize what queries you pass to Solr. After reading a bunch of conflicting documentation and blog posts, I put together a simple little module to handle it for you. It should strip out everything that would cause Solr to throw an error on a query string. Let me know if it works for you or if I missed any corner cases!
module SolrStringSanitizer
ILLEGAL_SOLR_CHARACTERS_REGEXP = /+|-|!|(|)|{|}|[|]|^||"|~|*|?|:|;|&&|||/
def self.sanitize(string)
if string
string.gsub(ILLEGAL_SOLR_CHARACTERS_REGEXP,"")
end
end
end
I was getting an error:
invalid regular expression; there’s no previous pattern, to which ‘{‘ would define cardinality at 13: /+|-|!|(|)|{|}|[|]|^||”|~|*|?|:|;|&&|||/):
So I changed the regex to this and it seems to work:
ILLEGAL_SOLR_CHARACTERS_REGEXP = /[+-!(){}[]^|”~*?:;&]/
Basically escaped most of the characters, and put them in a character class rather than having all of the ‘OR’ pipes.
July 18, 2009 at 11:03 pm
Wow, no markdown love. pastie to the rescue: http://pastie.org/550997
Feel free to delete the broken posts.
July 18, 2009 at 11:08 pm
The regular expression prevents wildcard searching…
July 20, 2009 at 3:08 am
All of those characters are valid text too, escaping them seems more appropriate than removing them.
July 20, 2009 at 9:45 am
I’ve also written alternative to accepting raw user input in the form of a Lucene query generator. We mainly used the library for constructing specific searches for view, but it’s also makes building “advanced” search interfaces easier.
http://github.com/jvoorhis/lucene_query/tree/master
Thanks to Mike Mangino of Elevated Rails for allowing me to release the library with an MIT license.
August 3, 2009 at 10:15 am