Evan FarrarEvan Farrar
FasterCSV, Ruby 1.8, and Character Encodings
edit Posted by Evan Farrar on Tuesday December 07, 2010 at 03:52PM

We had a bit of a head scratcher this week at the New York City office while working on Red Rover, a social directory for engaging students with their colleges and employees with their employer. We were trying to allow a CSV to be uploaded to the application, when it mysteriously failed to parse the CSV. We narrowed it down to being caused by a certain row with strangely encoded international characters (but not every row with them was a problem):

Fuentes,Jesús,"Cribbage, Chess, and Bridge Club",Treasurer

But another row with the same character with the same encoding would import fine:

Johnson,Lúisa,Dodgeball Club,President

It turned out that this was due a problem with how Ruby finds character boundaries in 1.8. If that miscalculated character boundary happens to be where a quote mark begins in your CSV file, FasterCSV will hurl:

1.8.7> 'Jesús,"'.split(//)
=> ["J","e","s","\349s,\""]
1.9   > 'Jesús,"'.split(//)
=> ["J","e","s","ú","s",",","\""]

This is not a problem in Ruby 1.9 with FasterCSV or in the old fashioned CSV class included with Ruby's standard library in 1.8.6. Hopefully I can help others who have got this error staring them in the face despite having a perfectly valid CSV in every regard:

FasterCSV::MalformedCSVError: FasterCSV::MalformedCSVError
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1623:in `shift'
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1614:in `each'
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1614:in `shift'
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1581:in `loop'
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1581:in `shift'
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1526:in `each'
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1537:in `to_a'
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1537:in `read'
    from /opt/ruby-enterprise-1.8.7-2010.01/lib/ruby/gems/1.8/gems/fastercsv-1.5.3/lib/faster_csv.rb:1229:in `parse'

Comments

  1. Joseph Palermo Joseph Palermo on December 07, 2010 at 11:03PM

    Is your $KCODE set to "U" in 1.8.7?

    Here are my results from 1.8.7 REE

    > $KCODE = "NONE"
    > 'Jesús,"'.split(//)
    => ["J", "e", "s", "\303", "\272", "s", ",", "\""]
    
    > $KCODE = 'U'
    > 'Jesús,"'.split(//)
    => ["J", "e", "s", "\303\272", "s", ",", "\""]
    

    No $KCODE value produces the results you were seeing for me though.

    I wonder if the input you have is actually an invalid character encoding in your input and 1.9 is able to correct it, but 1.8.7 is not.