Pivotal Labs

Main menu

Skip to primary content
Skip to secondary content
  • About
  • Case Studies
  • Team
    • Executives
    • Locations
      • San Francisco (HQ)
      • Boston
      • Boulder
      • Denver
      • London
      • Los Angeles
      • New York
  • Community
    • Blogs
    • Tech Talks
    • Events
  • Careers
    • Lifestyle
    • Principles & Practices
    • Benefits
    • FAQ
    • Apply
  • Contact
    • Press Room
    • Press Releases
    • In The News
    • Press Kit
  • All
  • Labs
  • Standup
  • Tracker
Pivotal Labs

Detecting invalid encoding in CSV uploads

Pivotal Labs
Friday, January 16, 2009

We ran into an odd bug using FasterCSV to import some data. We were requiring the CSV files to be UTF-8 encoded, but some users tried to upload files in other encodings. FasterCSV ended up choking on characters that weren’t valid UTF-8 and truncating the data to the end of the line and leaving fields blank. We didn’t want to ask the user to select an encoding, because they’d probably get it wrong anyway, so we decided to reject any files with characters that would cause problems. The trick then, is how to detect that.

First, the tests. We want to detect if an input string contains valid characters in UTF-8 encoding. And we need to deal with both strings and IO (File or StringIO) objects (more on that in a bit).

describe "::encoding_is_utf8? checks strings and IOs" do
  before do
    @utf8 = "This is a test with รง international characters"
  end

  it "returns true when all characters are valid" do
    Importer.encoding_is_utf8?(@utf8).should be_true
    Importer.encoding_is_utf8?(StringIO.new(@utf8)).should be_true
  end

  it "returns false when any characters are invalid" do
    bogus = Iconv.conv('ISO-8859-1', 'UTF-8', @utf8)
    Importer.encoding_is_utf8?(bogus).should be_false
    Importer.encoding_is_utf8?(StringIO.new(bogus)).should be_false
  end
end

Here’s the implementation:

class Importer
  def self.encoding_is_utf8?(file_or_string)
    file_or_string = [file_or_string] if file_or_string.is_a?(String)
    is_utf8 = file_or_string.all? { |line| Iconv.conv('UTF-8//IGNORE', 'UTF-8', line) == line }
    file_or_string.rewind if file_or_string.respond_to?(:rewind)
    is_utf8
  end
#...

So the meat of the check is that we are using the Iconv library to detect bad characters. We convert from an assumed UTF-8 to UTF-8, ignoring any characters that can’t be represented in UTF-8. If the output and input aren’t identical, that means there were bogus characters and the uploaded file should be rejected.

The #rewind is needed to reset the read position in the file so FasterCSV can start over from the beginning. Specs for that aren’t included here.

Then in our controller, we ensure the CSV doesn’t have any bad characters before we give it to FasterCSV. We extracted that check into its own method, shown here:

def require_utf8!(csv_content)
  unless Importer.encoding_is_utf8?(csv_content)
    raise "Import file must be UTF-8 only. You can paste non-UTF-8 CSV directly into the CSV Text field for automatic conversion."
  end
end

As you can read in the exception message (which ends up in the flash), the user can work around the encoding issue by pasting the CSV into a textarea input in the browswer, which automatically transcodes the data into UTF-8. Aren’t browsers awesome? The other option would be to transcode the CSV file, but the textarea is easier if the files aren’t gigundous. Anyway, since we can input CSV as either a file or a textarea string, that’s why #encoding_is_utf8? needs to check both files and strings.

This approach and implementation seem fine to me. I get the feeling there might be a much simpler way, though. Anyone got a better idea?

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Abhijit Hiremagalur

Standup 01/13/2009: daemons, encoding

Abhijit Hiremagalur
Tuesday, January 13, 2009

Ask for Help

“Daemon best practices in Ruby?”

  • We haven’t tried DaemonKit
  • SimpleDaemon is what we currently use, which we suspect of interfering with monit (had problems with multiple instances starting, and process not starting upon reboot).
  • A couple of people suggested looking at Daemonize
  • Always monitor daemons with sanity checks (e.g. memory usage); use Monit or God
  • Roll your own?

“cut doesn’t handle strange characters in large (5GB) text file, are there other unix commands for text file manipulation that are utf-8 compliant?”

  • Try awk/sed maybe
  • Try using od/hexdump to figure out what the weird characters are

UPDATE 01/14/2009: Chad’s corrections

  • 0 Shares
  • Share on Facebook
  • Share on Twitter

Topics

  • agile (780)
  • rails (113)
  • testing (88)
  • ruby (83)
  • ruby on rails (70)
  • jobs (62)
  • javascript (55)
  • techtalk (44)
  • rspec (38)
  • ironblogger (32)
  • productivity (30)
  • activerecord (29)
  • gogaruco (29)
  • git (28)
  • nyc (27)
  • rubymine (26)
  • bloggerdome (23)
  • mobile (22)
  • process (21)
  • pivotal tracker (20)
  • cucumber (20)
  • jasmine (19)
  • design (18)
  • ios (18)
  • webos (17)
  • objective-c (17)
  • android (16)
  • palm (16)
  • "soft" ware (16)
  • fun (15)
  • tracker ecosystem (15)
  • ci (15)
  • cedar (15)
  • rails3 (14)
  • performance (14)
  • bdd (14)
  • gem (13)
  • css (13)
  • tdd (13)
  • selenium (12)
  • goruco (12)
  • bundler (12)
  • meetup (11)
  • railsconf (11)
  • nyc-standup (11)
  • capybara (10)
  • mac (10)
  • mojo (10)
  • chef (10)
  • api (10)
Subscribe to character encoding Feed
  • About
  • Case Studies
  • Team
  • Community
  • Careers
  • Contact
  • Labs
  • Events

Contact Us

contact@pivotallabs.com
+1 415-77-PIVOT
TwitterLinkedInFacebook

Pivotal Tracker

Tracker is the award-winning agile project management tool that enables real-time collaboration around a shared, prioritized backlog.
Visit pivotaltracker.com >