Jim KingdonJim Kingdon
New York Standup 9/30/2008
edit Posted by Jim Kingdon on Tuesday September 30, 2008 at 07:02PM
  • What's the best way to import a million records into a postgres database via ActiveRecord (which is needed to implement some application-specific logic)? We anticipate waiting a second (or so) between inserts to avoid slowing down the production database (which is under load, almost entirely reads). If there is any ActiveRecord feature which helps batch together inserts, noone knew about it. As for generally how long this will take (estimates range from 9 to 27 hours), and what the load on the production database will be, we planned on answering that with a trial run of a small number of these records.

  • We're thinking of having capistrano deploy to two demo servers, one particularly aimed at showing to prospective users of our application, and the other mostly for story acceptance. The former would be hosted at a hosting company; the latter an internally run machine. Several people reported they have done this on their projects, and the problems were minor, mostly having to do with whether the deployed location (/u/apps/whatever or some such) is different on the two machines (the solution would be to use the capistrano variables, but tracking down all the places that need to do that could be an issue).

  • Erector tip of the day: in a Rails project, you can put a file (named edit.rb or edit.html.rb) in your view directory, and Rails/Erector will find the template implicitly (as it would for ERB, HAML, etc). It is not necessary to explicitly call render from your controller method.

Comments

  1. Strass Strass on September 30, 2008 at 07:35PM

    Regarding your 2nd point, I usually see this kind of instances as new steps in the production chain. That's why I use the capistrano multistage extension (gem install capistrano-ext) to define those new steps (possibly with their own environment files).

  2. Steve C Steve C on September 30, 2008 at 08:06PM

    Is there some non-AR way of loading records into postgres that would meet your needs? I'm thinking of some equivalent of the mysql "load data infile", that loads mass amounts of data 20x faster than any alternative.

  3. Steve C Steve C on September 30, 2008 at 08:08PM

    re: erector, I'd say "it's not necessary to use implicit templates, you can just call render directly". ha ha.

  4. Dan Kubb Dan Kubb on September 30, 2008 at 09:02PM

    Have you thought about using DataMapper to handle the inserts instead of ActiveRecord? As of the most recent DataMapper benchmarks DM is about 2x faster than AR when inserting records and performing most other operations.

    DataObjects would likely even be faster still, since it is what DM uses under the hood to communicate with Postgres. It should be the fastest Ruby RDBSM driver available at the moment -- faster than what AR uses, including the recently released Neverblock drivers, and it works with Ruby 1.8.

  5. Karl Karl on September 30, 2008 at 09:13PM

    RE: #1– Could you load the records on a copy of your production db on a local machine, and after all is done then do a export/import into the production machine? At least this way, if something goes wrong, there is much less of a chance of it munging up your production data. Not that that has ever happened me.

  6. Chad Woolley Chad Woolley on September 30, 2008 at 09:34PM

    re #2 - yep, Strass is right. That's what multistage was made for. Put all the differences in config/deploy/.rb

    Also, the story acceptance environment should be deployed after every CI build. On our projects, we already do that for a "local" localhost environment (check out Sandbox), so it should be straightforward to do the same thing for a "demo" (vs staging?) environment.

  7. Jonathan Jonathan on September 30, 2008 at 11:37PM

    You guys seen http://www.jobwd.com/article/show/31 ? If possible, batch the inserts into groups of 10 or 100. In my simple tests, ar-extensions is at least 3 times faster.

  8. Chris Kilmer Chris Kilmer on October 01, 2008 at 12:47AM

    In response to your large dataset import question, we used the acts_as_importable plugin with great results. The plugin allows to you do pretty much everything as usual (validations, column discovery, SQL-escaping, etc...) except that instead of saving to the db, the plugin creates a sql bulk import file which you can load into the db at your leisure.

    Sooooo much faster.

    Now, we were using MySQL. Not sure about the Postgres support, but it might be worth looking into.

  9. David Stevenson David Stevenson on October 01, 2008 at 12:50AM

    We use ar-extensions extensively to handle our large data imports. No problems to speak of, :validate => false is a useful option that speeds things up if you are okay skipping validations.

    ar-extensions also adds some really neat finder support.

  10. Toby Matejovsky Toby Matejovsky on October 01, 2008 at 02:36AM

    http://rubypond.com/articles/2008/06/18/bulk-insertion-of-data-with-activerecord/ discusses ar-extensions for multiple inserts per query. (http://www.continuousthinking.com/tags/arext)