Pivotal Labs

Main menu

Skip to primary content
Skip to secondary content
  • About
  • Case Studies
  • Team
    • Executives
    • Locations
      • San Francisco (HQ)
      • Boston
      • Boulder
      • Denver
      • London
      • Los Angeles
      • New York
  • Community
    • Blogs
    • Tech Talks
    • Events
  • Careers
    • Lifestyle
    • Principles & Practices
    • Benefits
    • FAQ
    • Apply
  • Contact
    • Press Room
    • Press Releases
    • In The News
    • Press Kit
  • All
  • Labs
  • Standup
  • Tracker

Marshal.dump vs YAML::dump

Colin Shield
Sunday, August 16, 2009

We find ourselves with a project with a very large dataset, more than 2 million items. This dataset changes frequently. The changes need to be transported to their respective servers ready to be served out to clients.
We decided to use a queuing architecture to distribute data. Objects are serialized and pushed to a queue. The large size of the dataset requires us to optimize as much as possible. There are only so many hours in a day and there is a lot of data to transport.
A question was raised in standup as to what was the fastest serialization method: YAML::dump or Marshal.dump. It seemed appropriate to write a quick script and work out which would be appropriate for our particular situation.
The objects we are serializing are simple hashes. I thought I’d write something that was representative of our situation in order to present a nice clear decision.
Here’s some code:

require 'yaml'
obj = {:a => "hello", :b => "goodbye", :c => "new string", :d => {:da => 1, :db => 2}, :e => 1}
start = Time.now
(0..10000).each do
  ser_obj = YAML::dump(obj)
  new_obj = YAML::load(ser_obj)
end
puts "YAML::dump time"
puts Time.now - start
start = Time.now
(0..10000).each do
  ser_obj = Marshal.dump(obj)
  new_obj = Marshal.load(ser_obj)
end
puts "Marshal.dump time"
p Time.now - start

I think we all knew how the results would look. It was nice to see that for our particular case there was a clear winner.

YAML::dump time
5.397909
Marshal.dump time
0.280292

Seems fairly cut and dried to me.
I personally prefer YAML for test result comparison. Maybe we’ll put something in our spec_helper to use YAML for testing and Marshal for production.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter

5 Comments

  1. Dan DeLeo says:

    I posed this question on the AMQP list, and ezmobius wrote that Marshal is the fastest, JSON is a close second, and YAML is way behind. I decided to just trust it ;) Anyway, is there some reason you can’t use JSON? If it’s fast enough to get the job done, seems like it would provide good readability and speed.

    Also, I personally haven’t seen a case where it didn’t work, but the Rdoc for Marshal or the Pickaxe (don’t remember which) warns that Marshal may change formats between VMs, i.e. Ruby 2.0 could potentially be unable to load Marshaled Ruby 1.9 or 1.8 objects. Seems to work fine with 1.8 and 1.9 though. Flag days are no fun, so it’s worth considering.

    August 16, 2009 at 7:45 pm

  2. Steve Conover says:

    For the record:

    http://www.pauldix.net/2008/08/serializing-dat.html

    August 16, 2009 at 8:27 pm

  3. Matthew O'Connor says:

    FWIW, if you do any weird eigenclass stuff to your hashes then Marshal won’t work:

    >> hash = {:foo => 1, :bar => 2}
    => {:bar=>2, :foo=>1}
    >> class < < hash
    >> def has_foo_key?
    >> has_key?(:foo)
    >> end
    >> end
    => nil
    >> hash.has_foo_key?
    => true
    >> Marshal.dump(hash)
    TypeError: singleton can’t be dumped
    from (irb):8:in `dump’
    from (irb):8

    August 16, 2009 at 9:16 pm

  4. Steve Conover says:

    Stick that in a module and include it instead…

    August 17, 2009 at 8:22 pm

  5. Duncan Beevers says:

    For data portability, you might consider checking out the alternate YAML implementation ZAML. A naive benchmark pegs it at 1600% faster than YAML, but feel free to check it out for yourself.

    http://gnomecoder.wordpress.com/2008/09/27/yaml-dump-1600-percent-faster/

    August 20, 2009 at 7:30 am

Add New Comment Cancel reply

Your email address will not be published.

Colin Shield

Colin Shield
San Francisco

Recent Posts

  • Database.yml files using Chef Server
  • Standup Oct/19/2010: Cedar, a BDD Testing Framework for Objective C
  • Donkey & Goat Open House
Subscribe to Colin's Feed

Author Topics

chef server (1)
agile standup (2)
performance (1)
  • About
  • Case Studies
  • Team
  • Community
  • Careers
  • Contact
  • Labs
  • Events

Contact Us

contact@pivotallabs.com
+1 415-77-PIVOT
TwitterLinkedInFacebook

Pivotal Tracker

Tracker is the award-winning agile project management tool that enables real-time collaboration around a shared, prioritized backlog.
Visit pivotaltracker.com >