Pivotal Labs

Main menu

Skip to primary content
Skip to secondary content
  • About
  • Case Studies
  • Team
    • Executives
    • Locations
      • San Francisco (HQ)
      • Boston
      • Boulder
      • Denver
      • London
      • Los Angeles
      • New York
  • Community
    • Blogs
    • Tech Talks
    • Events
  • Careers
    • Lifestyle
    • Principles & Practices
    • Benefits
    • FAQ
    • Apply
  • Contact
    • Press Room
    • Press Releases
    • In The News
    • Press Kit
  • All
  • Labs
  • Standup
  • Tracker
Phil Goodwin

Standup 12/17/2010 Comparing indexed date columns, git subtrees, Headless Selenium for CI

Phil Goodwin
Friday, December 17, 2010

Helps

Is there a way to get MySql indexing to speed up queries involving greater and less than operators on date columns?

Postgres handles these operators a little bit better than MySql, but may not actually solve this problem.

Using millis instead of dates would give the DB the best chance of handling this scenario.

We are using Git’s subtree merge facility instead of submodules to stay synced to a different repository for part of our project. How do we push changes back to that repo?

See Tim Connor’s blog post “Git sub-tree merging back to the subtree for pushing to an upstream“. Early in that post is a pointer to an article describing the the subtree merge operation. Tim goes on to explain how to push your changes back through the chain.

Interesting

  • Some cloud environments leave the names of temp files visible even when their contents are not accessible. Be sure to use obfuscated names for your temporary files!

  • The “Headless” gem allows you to easily set up an alternate “display” that allows programs to execute in a headless environment. See this blog post about how to use Headless to run Selenium tests on a CI box: http://www.aentos.com/blog/easy-setup-your-cucumber-scenarios-using-headless-gem-run-selenium-your-ci-server

  • Ccrb will bog down to painfully slow levels if more than a couple of CC Tray clients are pinging it repeatedly during a build.

  • Cron will not honor your .rvmrc file unless you do some work to set up the environment. If you set up your cron job like this:

0 6 * * * /bin/bash -l -c 'echo /home/someuser/.rvm/bin/rvm rvmrc trust ... && cd ... 

the -l & -c parameters cause bash to load your environment as if your were logging in before running the specified commands. Someone also mentioned that rvm-shell can be used as a solution to this problem.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter

Basic Ruby Webapp Performance Tuning (Rails or Sinatra)

Alex Chaffee
Wednesday, April 28, 2010

My company launched our app, Cohuman, a few weeks ago. The rush of finishing features, fixing bugs, and responding to user feedback has subsided a bit, and it’s time to go back and give the little baby a tune-up. I find that a good development process will ebb and flow, and as long as you don’t let something slide for too long, it’s perfectly acceptable to let bugs, or performance issues, or development chores pile up for a bit and then attack them concertedly for an entire day or two. A bug-fest or chore-fest or tuning-fest can actually increase efficiency as you get in a rhythm… and it feels really good at the end of the day when you see all the bugs you slayed or all the milliseconds you shaved.

In this article I’d like to describe some of my techniques. I make no claim of originality or great expertise; I just want to share what I know, and hear (in comments) what other people have learned. I’m using Sinatra and ActiveRecord, but not Rails; hopefully this discussion will help people no matter what framework they’re using.

Metrics and Logs

The first step, and often the most overlooked, is to gather metrics. Without knowing how it’s working now, how are you going to know what to improve? And how are you going to know whether you made things better or worse? Frequently I’ll make a change that I’m sure will improve performance, only to discover that it’s made no change, or helped in one place but hurt in another.

Where to begin? We’re using New Relic for live performance monitoring, so my decision of what to optimize was easy: I went to their Web Transactions panel and looked at the Most Time Consuming and Slowest Average Response Time reports. If you don’t have a flashing signpost like that, it’s easy enough to decide on a path to work on: either go with user reports, or click around your app and see what feels slow, or choose the most popular request (which is usually the home page).

I always pick a single path to work on, from request to controller to database to view, and work on the slowest parts. This demands more metrics! It’s a common mistake to jump in and start tuning the database when the view is actually taking twice as long. What’s the use of cutting the database access from 400 to 200 msec when the view is taking 1200 msec to render?

I also like to grab a copy of the production DB and bring it to my development machine so I can be sure I’m profiling real cases, and not being fooled by artifacts of generated data. We’re lucky that our app is currently small enough to do this; when the app gets bigger we’ll have to write a script that grabs only selected users’ data as a slice of the whole enchilada. (Note that there are some privacy concerns here: we are careful to only log in locally using our own accounts, and only to gather statistics in aggregate, not to look at details of user-entered data unless it’s to diagnose a specific user-reported issue or bug.)

Lots of in-app metrics tools exist (e.g. ruby-prof, benchmark), but I prefer the simple approach: I rolled my own Marker class that spits out basic msec timing information to the logs. In single-request performance tuning, what matters is relative timing between sections of code, so any objections to this technique on grounds of accuracy or detail are outweighed by its advantages: it’s simple, it shows where your bottlenecks are, and it divides the logs into sections so you can get a sense of who’s making what calls.

class Marker
  def self.mark(msg, logger = ActiveRecord::Base.logger)
    start = Time.now
    logger.info("#{start} --> starting #{msg} from #{caller[2]}:#{caller[1]}")
    result = yield
    finish = Time.now
    logger.info("#{finish} --< finished #{msg} --- #{"%2.3f sec" % (finish - start)}")
    result
  end
end

Usage is simple: pick a block you’re interested and wrap it in Marker.mark("foo") do...end. You can then scan the logs using “less” (or a text editor) and search for the name you gave the block. Marking your controller and your view is a natural place to start; later you can insert marks inside interesting blocks of domain code. In Sinatra, you can do something like this:

get '/foo/:id' do
  foo = Marker.mark("loading foo") do
    Foo.find(params[:id])
  end
  Marker.mark("rendering foo") do
    FooWidget.new(:foo => foo).to_s # Erector
  end
end

I’ve also got a nice little Rack middleware component that marks the time spent inside each request. Note here that you can put lots of fun information in the name that can be helpful for debugging.

class Marking
  def initialize(app)
    @app = app
  end

  def call(env)
    response = nil
    Marker.mark("#{env['REQUEST_METHOD']} #{env['SCRIPT_NAME']}#{env['PATH_INFO']}") do
      response = @app.call(env)
    end
    response
  end
end

Figuring out where a particular log message (especially a DB query) is coming from is essential. It’s important not to make assumptions. If you think you know where the call is coming from, put in a stack trace to make sure, and rerun the request to confirm. That’s why Marker is outputting caller — caller[0] is the code that names the mark, so you already know where that is; caller[1] is the line that called it, and caller[2] is the line that called caller[1]. If that’s not enough context, drop in a logger.info(caller.join(”nt”)) so you can scan the entire stack trace back up to the application code that you understand.

I’ve found that while ActiveRecord (2.3.5) tries to show where a result is coming from, it doesn’t always get it right, especially if you’re using plugins or gems that insert themselves into the call chain. So I monkey-patched AR to be a little smarter about its tracing:

module ActiveRecord
  module ConnectionAdapters
    class AbstractAdapter
      # strip library file pathnames from logged stack traces
      def log_info(sql, name, ms)
        if @logger && @logger.debug?
          c = caller.detect{|line| line !~ /(activerecord|active_support|__DELEGATION__|vendor|new_?relic)/i}
          c.gsub!("#{File.expand_path(File.dirname(RAILS_ROOT))}/", '') if defined?(RAILS_ROOT)
          name = '%s (%.1fms) %s' % [name || 'SQL', ms, c]
          @logger.debug(format_log_entry(name, sql.squeeze(' ')))
        end
      end
    end
  end
end

All this leads to log entries that look like this:

Thu Apr 15 11:09:17 -0700 2010 --> starting GET /app from /Users/cohumancomputer27inmac/dev/cohuman/lib/query_caching.rb:15:in `call':/Library/Ruby/Gems/1.8/gems/activerecord-2.3.5/lib/active_record/connection_adapters/abstract/query_cache.rb:34:in `cache'
  User Load (0.8ms) domain/user.rb:362:in `authenticate_from_login_token'   SELECT * FROM "users" WHERE ("users"."login_token" = E'abc123xyz') LIMIT 1
Thu Apr 15 11:09:17 -0700 2010 --> starting rendering ApplicationPage from /Users/cohumancomputer27inmac/dev/cohuman/controllers/app_controller.rb:4:in `GET /app':/Library/Ruby/Gems/1.8/gems/sinatra-0.9.4/lib/sinatra/base.rb:779:in `call'  Project Load (13.4ms) domain/user.rb:146:in `projects'   SELECT "projects".* FROM "projects" INNER JOIN "memberships" ON "projects".id = "memberships".project_id WHERE (("memberships".user_id = 2))
  User Load (3.2ms) domain/user.rb:135:in `coprojectmates'   SELECT "users".* FROM "users" INNER JOIN "memberships" ON memberships.user_id = users.id WHERE (memberships.project_id in (4,129,122,1,66,82,102,684,533,139,3,155,624,106,90,394,399,153) AND memberships.user_id != 2)   Email Load (2.1ms) domain/user.rb:135:in `coprojectmates'   SELECT "emails".* FROM "emails" WHERE ("emails".user_id IN (1,3,5,7,8,11,12,9,6,14,15,16,17,18,22,27,26,35,45,32,30,79,37,109,80,504,507,508,39,165,521,725,727,729,730,731,734,735,736,105,28,58,240,381,51,40,36,785,834,839,844,847,850,842,840,841,843,889))
  User Load (11.7ms) domain/user.rb:128:in `cohumans'   SELECT "users".* FROM "users" INNER JOIN "cohumanities" ON "users".id = "cohumanities".cohuman_id WHERE (("cohumanities".actor_id = 2))
  Email Load (19.4ms) domain/user.rb:128:in `cohumans'   SELECT "emails".* FROM "emails" WHERE ("emails".user_id IN (1,6,108,8,22,35,509,852,853,854,862,864,866,3,895,896,897,929,930,931,30,944,165,827,976,977,978,735,1024,2003,2004,59))
  SQL (0.5ms) domain/user.rb:177:in `temporary?'   SELECT count(*) AS count_all FROM "emails" WHERE ("emails".user_id = 2)
  Email Load (0.3ms) domain/user.rb:173:in `verified?'   SELECT * FROM "emails" WHERE ("emails".user_id = 2)
Thu Apr 15 11:09:17 -0700 2010 --< finished rendering ApplicationPage --- 0.307 sec
Thu Apr 15 11:09:17 -0700 2010 --< finished GET /app --- 0.311 sec

I know it can look daunting, but when scanning logs, it’s important to keep a clear head. Let’s examine this little burst of gibberish and try to make sense of it.

Line 1 says “–> starting GET /app” which means that the user has made a GET request for our main URL. We can skip ahead (search for “–< GET /app”) and see that the entire request took 0.311 seconds. This isn’t bad, but it could be better.

Line 3 says “–> starting rendering ApplicationPage” which means that all the other queries are happening from inside the rendering view code.

Note that the database queries are only taking 49.3 msec out of 311 msec, which means 84% of the time is spent either processing DB results or rendering them. This request is probably not a good candidate for DB-level tuning.

(How’d I add up all those scary milliseconds without an abacus? Piped my log text into this bad boy:

  ruby -e 'x = 0; STDIN.each do |line| if line =~ /(([0-9.]*)ms)/; then x += $1.to_f; end; end; puts x'

)

Indexes

Most (if not all) databases add an index for the primary key of a table. But a quick scan of the database logs will show many fields that are used in queries, and chances are you haven’t added indexes for them. (In fact, you probably shouldn’t add an index for a field until it shows up in the logs, since indexing slows down writes and takes up extra disk space. Not a lot, but it might add up.) In the above example, look at the User Load — every time a user hits the site we check to see if he’s logged in by querying the database for his login cookie. Adding an index for the “login_token” field in the users table sped up this query by a factor of 10. (Yes, that violates my “don’t fix what ain’t slow” dictum, since going from 10 ms to 1 ms isn’t really fixing much, but I figure it adds up over time since it happens on every single app request.)

Avoidance

The only perfect program is the one with zero lines of code. And the fastest code is that which is not run.

Sometimes you can optimize a section of code by removing unnecessary calls from your app layer. One nice trick these days is to move stuff behind an Ajax call. In Cohuman, we do this with some of our tabs: if you switch to a tab, and it hasn’t been loaded yet, it shows a spinny and starts an Ajax call to load it in. As long as we can keep each Ajax call under a second in length, the user-perceived delay is negligible.

Query Caching

ActiveRecord maintains a query cache, so if you run the same query (and I mean the same SQL), it won’t hit the database again. But if you’re not using Rails, query caching is disabled by default. So I wrote yet another Rack middleware so I don’t have to remember to wrap all my controllers in a ActiveRecord::Base.cache do block:

# a Rack middleware component that enables ActiveRecord query caching
# To use, put "use QueryCaching" in your Sinatra app.

class QueryCaching
  def initialize(app)
    @app = app
  end

  def call(env)
    if is_static_file?(env)
      @app.call(env)
    else
      response = nil
      ActiveRecord::Base.cache do
        response = @app.call(env)
      end
      response
    end
  end

  def is_static_file?(env)
     # if the path end with a dot-extension (e.g. 'foo.jpg') then we assume
     # it's a static file and don't enable the query cache. (This will only
     # work for some application URL schemes, naturally.)
    env['PATH_INFO'] =~ //[^/]*.[^/.]+$/
  end

end

Note that this is a query cache, not an object cache (see below).

Query Tuning

Once you’ve identified some troublesome queries, you need to decide how to optimize them. You’ve basically got two choices here; which to choose should be obvious from the logs. Are there many low-latency queries, or a few high-latency queries? High-latency queries are an obvious target, and you should do your best (with indexes and SQL) to cut them down to size, but don’t let them distract you. There are two hidden costs to low-latency queries:

They actually take longer than they say they do – the AR log line only displays the time for the database connector to return the raw data. It doesn’t show the time to create AR instances, build association “classes” (which takes an annoyingly long time, since all their methods are built on the fly for each instance), and run post-load initialization code. (I just did a little experiment loading ~3000 of our User objects, which have a fair number of associations; SELECT * FROM users took 21 msec but User.all took 547 msec. That’s about 25x as long!)

They stack up, and I’m not talking pancakes – chances are you’ve got a lot of webapp processes hitting a single database (or a small number of slaves). As traffic increases, the queries will stack up like airplanes requesting permission to land. At a certain point you’ll hit a cliff (sorry for the mixed metaphor — it’s not fun to imagine a plane hitting a cliff) and per-request latency will rise dramatically. Lowering the number of queries per web request will, um, raise the ceiling? Lengthen the runway? Lower the cliff? Anyway, it’ll make this problem, uh, less worse. It’s kind of counterintuitive, but the limiting factor for modern webapps is really the number of queries, not the amount of data returned by each query.

ActiveRecord associations (like has_many and belongs_to) are great for getting an app up and running, but as you peruse your logs you’ll notice some things they’re doing that aren’t very efficient. Our app loads a lot of objects, each of which has lots of associated objects, some of which associate to other objects. If we’re displaying a list of Users, and each user has associated Emails (via has_many :emails), and we want to render a list of users and their email addresses, we’ll probably see one query that loads all users, and then one query for each user loading his or her emails.

Adding an :include to the declaration is a good way to reduce these from N+1 to 2, but it doesn’t always work. I have never been able to comprehend AR’s alien logic, so my logs are often littered with queries despite my best efforts fiddling with the association declaration. Furthermore, AR is quite naive about object graphs: for example, user.emails.first.user will make an extra query and return a different user than the one you started with, even though they have the same id and you loaded the emails via :include.

So I’ve gotten good performance boosts by moving away from ActiveRecord and doing some nested queries by hand. Not by writing literal SQL, but by doing one query, extracting the necessary ids, and then doing the next query, and saving or plugging in values directly. This led naturally to Treasury (see below).

Sometimes, of course, writing SQL is unavoidable; fortunately, AR allows it, and there are many people who are much better at that than I, so I won’t embarrass myself by discussing it further here.

Object Caching and the Repository Pattern

During my first pass at tuning Cohuman several months ago, I took a little time to write a library that implements a Repository Pattern. The Treasury is a work in progress that sits in front of ActiveRecord (and eventually, other ORMs) and caches object instances as they pass through. If you then request an object via the Treasury, it will check in its cache and return a pointer to the existing object instead of making a query; if you specify a list of ids, then it will only query for the ones it doesn’t yet have. (There are other features I won’t go into here, including a DSL for building queries… expect an upcoming article to officially introduce Treasury to the world.)

I’ve heard that DataMapper has an object cache, but I haven’t yet dug into the details of how it works, so I don’t know if Treasury is redundant with it, or if it would make sense to plug in DM behind it. (I’ve also heard it solves the N+1 query problem gracefully. Anyone want to proselytize DM in the comments?)

All the caches I’ve mentioned only persist within a single request. This is probably a good thing, since allowing instances to persist between requests would open a can of data integrity, thread safety, and multi-host worms. But I can’t shake this vision I have of a sort of in-process memcache for Ruby objects, where multiple processes communicate changes to each other via TCP wormholes… Does anyone else share this vision, or am I doomed to wander the Ruby blog desert, mumbling incoherently at strangers?

  • 0 Shares
  • Share on Facebook
  • Share on Twitter

UTC vs Ruby, ActiveRecord, Sinatra, Heroku and Postgres

Alex Chaffee
Friday, January 22, 2010

Now that I’m starting to use DelayedJob to perform jobs in the future in my Heroku Sinatra app, its important that they happen at the scheduled time. But unless you pay attention, you’ll find that times get mysteriously changed — in my case, since I’m in San Francisco in the wintertime, by +/-8 hours — which means that some conversion to or from UTC is being attempted, but it’s only working halfway.

Trying to keep a handle on which libraries are attempting, and which are failing, to convert times is a losing battle, so I’m trying to do the right thing and save all my times in the database in UTC, and convert them to and from the user’s local time as close to the UI as possible. Unfortunately, a variety of gotchas in Ruby and ActiveRecord and PostgreSQL makes this trickier than it should be. Here’s a little catalog of my workarounds.


You must set both Time.zone = "UTC" and ActiveRecord::Base.default_timezone = :utc. Since I’m using Sinatra, not Rails, this stuff goes either in main (i.e. not inside any class) right after require 'active_record', or in a configure block in your app, depending on your preference.


When ActiveRecord creates queries — which are used for both reading and writing, mind you — it will only convert to UTC times that are instances of ActiveSupport’s proprietary TimeWithZone class. It will not convert regular Ruby Time objects, even though Time objects are perfectly aware of their time zones, and AR is perfectly aware that you’d prefer they be written as UTC (due to the default_timezone setting). This is clearly a bug IMHO, but the Rails core marked the bug as “will not fix”, so w/e. Here’s a monkey patch, courtesy of Peter Marklund:

  module ActiveRecord
    module ConnectionAdapters # :nodoc:
      module Quoting
        # Convert dates and times to UTC so that the following two will be equivalent:
        # Event.all(:conditions => ["start_time > ?", Time.zone.now])
        # Event.all(:conditions => ["start_time > ?", Time.now])
        def quoted_date(value)
          value.respond_to?(:utc) ? value.utc.to_s(:db) : value.to_s(:db)
        end
      end
    end
  end

When outputting timestamps to a UI — either inside HTML or in a JSON API — you’ll probably want to use Time#strftime. Beware: on Mac OS X under Ruby 1.8, the %z (lowercase Z) selector will emit the local time zone, not the zone of the Time object you’ve called strftime on. The solution is to either use %Z (capital Z) or just a plain Z which stands for Zulu Time. The latter is OK if you know you’re using UTC, which, if you’ve followed my advice, you probably do. This is a pretty annoying issue, since it’s much safer to use %z‘s hour offsets than %Z‘s three-letter codes, since the three-letter codes can be ambiguous, and in any case require an extra conversion to time offset, so you may as well just emit the offset.

Here are some methods on Time you may want to use that work around this %z issue:

  # Note: do NOT call this file 'time.rb' :-D

  require 'time'

  class Time
    def full_date_and_time
      strftime('%Y-%m-%d %H:%M:%S %Z')
    end

    def iso8601
      strftime('%Y-%m-%dT%H:%M:%SZ') # the final "Z" means "Zulu time" which is ok since we're now doing all times in UTC
    end
  end

That iso8601 method comes in really handy when you’re using the excellent timeago jQuery plugin by Ryan McGeary (@rmm5t).


By default PostgreSQL saves timestamps sans time zone, which means that ActiveRecord interprets them as being in the default_timezone. If you want to be extra clear and save them with time zone, you’ll have to change the Postgres adapter’s type mapping. ActiveRecord doesn’t let you configure this but here’s a monkey patch, courtesy of
Chirag Patel (with a couple of mods):

    require 'active_record/connection_adapters/postgresql_adapter'
    class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter < ActiveRecord::ConnectionAdapters::AbstractAdapter
      def native_database_types
        {
          :primary_key => "serial primary key".freeze,
          :string      => { :name => "character varying", :limit => 255 },
          :text        => { :name => "text" },
          :integer     => { :name => "integer" },
          :float       => { :name => "float" },
          :decimal     => { :name => "decimal" },
          :datetime    => { :name => "timestamp with time zone" },
          :timestamp   => { :name => "timestamp with time zone" },
          :time        => { :name => "time" },
          :date        => { :name => "date" },
          :binary      => { :name => "bytea" },
          :boolean     => { :name => "boolean" }
        }
      end
    end

It turned out that I didn’t need this, so I ended up commenting it out. It may be that storing timestamps with time zones will cause a hiccup with some other random DB code, so watch out. If you do use it, and you’ve already got some data, make sure to write a migration that changes the types of all extant datetime and timestamp fields, and maybe a migration that shifts the times too.


That’s all I’ve got for right now. I’m sure some more problems will come up on March 14, 2010…

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Pivotal Labs

Parallelize Your RSpec Suite

Pivotal Labs
Friday, May 8, 2009

We all have multi-core machine these days, but most rspec suites still run in one sequential stream. Let’s parallelize it!

The big hurdle here is managing multiple test databases. When multiple specs are running simultaneously, they each need to have exclusive access to the database, so that one spec’s setup doesn’t clobber the records of another spec’s setup. We could create and manage multiple test database within our RDBMS. But I’d prefer something a little more … ephemeral, that won’t hang around after we’re done, or require any manual management.

Enter SQLite’s in-memory database, which is a full SQLite instance, created entirely within the invoking process’s own memory footprint.

(Note #1: the gist for this blog is at http://gist.github.com/108780)

(Note #2: The following strategy is relatively well-known, but I thought it might be useful for Pivots-and-friends to see exactly how one Pivotal project has used this tactic for a big speed win.)

Here’s the relevant section of our config/database.yml:

test-in-memory:
  adapter: sqlite3
  database: ':memory:'

Next, we need a way to indicate to the running rails process that it should use the in-memory database. We created an initializer file, config/intializers/in-memory-test.db:

def in_memory_database?
  ENV["RAILS_ENV"] == "test" and
    ENV["IN_MEMORY_DB"] and
    Rails::Configuration.new.database_configuration['test-in-memory']['database'] == ':memory:'
end

if in_memory_database?
  puts "connecting to in-memory database ..."
  ActiveRecord::Base.establish_connection(Rails::Configuration.new.database_configuration['test-in-memory'])
  puts "building in-memory database from db/schema.rb ..."
  load "#{Rails.root}/db/schema.rb" # use db agnostic schema by default
  #  ActiveRecord::Migrator.up('db/migrate') # use migrations
end

Note that in the above, we’re initializing the in-memory database with db/schema.rb, so make sure that file is up-to-date. (Or, you could uncomment the line that runs your migrations.)

Let’s give that a whirl:

$ IN_MEMORY_DB=1 RAILS_ENV=test ./script/console
Loading test environment (Rails 2.3.2)
connecting to in-memory database ...
building in-memory database from db/schema.rb ...
-- create_table("users", {:force=>true})
   -> 0.0065s
-- add_index("users", ["deleted_at"], {:name=>"index_users_on_deleted_at"})
   -> 0.0004s
-- add_index("users", ["id", "deleted_at"], {:name=>"index_users_on_id_and_deleted_at"})
   -> 0.0003s

...

>>

Super, we can see that the database is being initialized our of our schema.rb, and we get our console prompt. We’re ready to roll!

But, running this:

IN_MEMORY_DB=yes spec spec

will still only result in a single process, albeit one running off a database that’s entirely in-memory. We want parallelization!

The final step is a script that will run your spec suite for you. You may need to edit this for your particular situation, but then again, maybe not.

#  spec/suite.rb

require "spec/spec_helper"

if ENV['IN_MEMORY_DB']
  N_PROCESSES = [ENV['IN_MEMORY_DB'].to_i, 1].max
  specs = (Dir["spec/**/*_spec.rb"]).sort.in_groups_of(N_PROCESSES)
  processes = []

  interrupt_handler = lambda do
    STDERR.puts "caught keyboard interrupt, exiting gracefully ..."
    processes.each { |process| Process.kill "KILL", process }
    exit 1
  end

  Signal.trap 'SIGINT', interrupt_handler
  1.upto(N_PROCESSES) do |j|
    processes << Process.fork {
      specs.each do |array|
        if array[j-1]
          require array[j-1]
        end
      end
    }
  end
  1.upto(N_PROCESSES) { Process.wait }

else
  (Dir["spec/**/*_spec.rb"]).each do |file|
    require file
  end
end

Then, you simply run IN_MEMORY_DB=2 spec spec/suite.rb to run two parallel processes. Increase the number on larger machines for better results!

There’s room for improvement here, notably in the naive method used to allocate the spec files to processes, but even as simple as this method is, our spec suite runs in about half the time it used to, on a dual-core machine.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Pivotal Labs

New York Standup 9/30/2008

Pivotal Labs
Tuesday, September 30, 2008
  • What’s the best way to import a million records into a postgres database via ActiveRecord (which is needed to implement some application-specific logic)? We anticipate waiting a second (or so) between inserts to avoid slowing down the production database (which is under load, almost entirely reads). If there is any ActiveRecord feature which helps batch together inserts, noone knew about it. As for generally how long this will take (estimates range from 9 to 27 hours), and what the load on the production database will be, we planned on answering that with a trial run of a small number of these records.

  • We’re thinking of having capistrano deploy to two demo servers, one particularly aimed at showing to prospective users of our application, and the other mostly for story acceptance. The former would be hosted at a hosting company; the latter an internally run machine. Several people reported they have done this on their projects, and the problems were minor, mostly having to do with whether the deployed location (/u/apps/whatever or some such) is different on the two machines (the solution would be to use the capistrano variables, but tracking down all the places that need to do that could be an issue).

  • Erector tip of the day: in a Rails project, you can put a file (named edit.rb or edit.html.rb) in your view directory, and Rails/Erector will find the template implicitly (as it would for ERB, HAML, etc). It is not necessary to explicitly call render from your controller method.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter
Pivotal Labs

Standup 08/29/2008

Pivotal Labs
Friday, August 29, 2008
  • Using multiple buckets for Amazon S3. One of our sites has a lot of images (perhaps 30+ photos per page, different for each page and user) and got significant benefits from using four buckets instead of one. Multiple buckets allows browsers to fetch several images in parallel. Increasing it beyond four probably wouldn’t help, as browsers have a limit on how many parallel requests they will send.

  • Amazon S3 now has a copy command. This could be useful, for example, if you have a lot of data in a single bucket and want to move it to multiple buckets. Copy is faster than downloading and re-uploading all that data. The ruby S3 gem, however, only lets you copy in one bucket, so you’ll need to bypass the S3 gem.

  • We wrote a script to dump a local SQL database and copy it up to a remote server (for example, a demo or production server). This is in contrast with a script we wrote some months ago which copies from demo to a local workstation (for test data, reproducing data-driven bugs, etc). The push to remote feature was for a situation in which there was a bunch of data to be generated (based on some XML input files) and we could afford to bog down a workstation for half an hour, but not an overloaded (and perhaps underpowered) server.

  • Deprec is a set of capistrano recipes for setting up a remote server (in conjunction with deploying an application), for example creating accounts, ssh keys, init scripts, logrotate, etc.

  • Capistrano 2.3 has weird sudo issues (deleting old releases or something). Recommend Capistrano 2.5.

  • 0 Shares
  • Share on Facebook
  • Share on Twitter

Collapsing Migrations

Alex Chaffee
Wednesday, December 12, 2007

(6:30 pm: updated to use mysqldump)
(12/14/07: updated to remove db:reset since the Rails 2.0 version now does something different.)
(12/15/07: updated to not set ENV['RAILS_ENV'] since that gets passed down to child processes)

There was an old hacker who lived in a shoe; she had so many migrations she didn’t know what to do. Every time her build ran clean, she spent a whole minute staring at the screen.

Fortunately, she read this blog post and now her db:setup task is so fast she’s started building multiple test environments so she can run tests in parallel!

  • Figure out what migration to collapse to. This number should be less than or equal to the oldest deployed version of your app. E.g. if most of your deployments are on version 348 but there’s one client running a branch that’s only up to version 298, then pick 298 (or 297 if you’re afraid of off-by-one errors). For this example we will use 100.

  • Install lib/tasks/db.rake and lib/db_tasks.rb (source below)

  • Clear the development database by running

    rake db:clear

  • Dump the development structure by running

    rake db:dump

  • Delete all the migrations up to and including your target version. Here’s a sneaky awk script that deletes everything up to and including 100. (Go ahead and run it, it won’t bite, and you can always revert.)

    ls db/migrate/ | awk ‘{split($0, a, “_”); if(a[1]<=100) print $0}’ | xargs svn rm

  • Create a new migration called “100_collapsed_migrations.rb” using the following template.

100_collapsed_migrations.rb:

class CollapsedMigrations < ActiveRecord::Migration
  def self.up
    sql = <<-SQL
  # development_structure.sql goes here
    SQL

    execute("SET FOREIGN_KEY_CHECKS=0")
    sql.split(";").each do |statement|
      execute(statement)
    end
  ensure
    execute("SET FOREIGN_KEY_CHECKS=1")
  end

  def self.down
    raise IrreversibleMigration
  end
end
  • Open up db/development_dump.sql and copy its entire contents into your clipboard, then paste it above the “SQL” line in your new migration 100.

  • Search for the statement that creates the schema_info table and remove it.

Mine looks like this:

CREATE TABLE `schema_info` (
  `version` int(11) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
  • Set up your databases and run your tests.

    rake db:setup test

  • Congratulations! Your migrations are now blazingly fast, just like back in the (scaff)old days. You can run “rake db:setup” any time you get a svn update that looks like it may have done something funky to your schema, rather than shying away from that minute-long migration and just hoping your tests still pass.

Why do we need to use db:dump rather than db:schema:dump? Well, unfortunately, db:schema:dump doesn’t dump everything. It misses CONSTRAINT statements and also seems to get the charset wrong (although that may have been a function of how I constructed the db in my test). And db:structure:dump misses any data that may have been added by your migrations.

Here’s my current db.rake. Unfortunately, it only works with MySQL, but if you want to make it support your favorite DB (or even your least favorite) then please go right ahead.

Oh, and that part about multiple test environments and parallellized tests? Stay tuned… :-)

db.rake:

require "db_tasks"

namespace :db do
  def tasks
    (@db_tasks ||= DbTasks.new(self))
  end

  desc "Drop and recreate database"
  task :clear => :environment do
    tasks.clear
  end

  desc "Clear and migrate dev and test databases, and load fixtures into development db"
  task :setup => :environment do
    tasks.setup
  end

  desc "Dump the current environment's database schema and data to, e.g., db/development_dump.sql (optional param: FILE=foo.sql)"
  task :dump => :environment do
    if ENV['FILE']
      tasks.dump ENV['FILE']
    else
      tasks.dump
    end
  end

  desc "Load an sql file (by default db/development_dump.sql). (Optional param: FILE=foo.sql)"
  task :load => :environment do
    if ENV['FILE']
      tasks.load ENV['FILE']
    else
      tasks.load
    end
  end
end

db_tasks.rb:

# This creates a duplicate of the database config for a db config as defined in database.yml.
# For example, if the "test" database is named "myapp_test",
# for clone number 0, the new environment is named "test0", and the database is "myapp_test0".
# All other settings are preserved (esp. username and password).
module ActiveRecord
  class Base
    def self.clone_config(original_config, worker_number)
      original = configurations[original_config.to_s]
      raise "Could not find conguration '#{original_config}' to clone" if original.nil?
      worker_config = original.dup
      worker_config["database"] += worker_number.to_s
      configurations["#{original_config}#{worker_number}"] = worker_config
    end
  end
end

class DbTasks
  def initialize(rake)
    @rake = rake
  end

  def init
    connect_to('development')
    clear_database
    migrate_database
    dump
    test_environments.each do |test_db|
      if test_db =~ /([0-9]+)$/
        clone_test_config($1.to_i)
      end
      connect_to(test_db)
      clear_database
      load
    end
  end

  # db:clear -> drop and create db for RAILS_ENV
  def clear
    clear_database
  end

  # db:setup -> drop, create, and migrate dbs for test and development environments, and import fixtures into development
  def setup
    init
    connect_to 'development'
    load_fixtures
  end

  def dump(file = "#{RAILS_ROOT}/db/#{environment}_dump.sql")
    puts "Dumping #{database} into #{file}"
    system "mysqldump #{database} -u#{username} #{password_parameter} --default-character-set=utf8 > #{file}"
  end

  def load(sql_file = "#{RAILS_ROOT}/db/development_dump.sql")
    puts "Loading #{sql_file} into #{database}"
    query('SET foreign_key_checks = 0')
    sql_file = File.expand_path(sql_file)
    IO.readlines(sql_file).join.split(";").each do |statement|
      query(statement.strip) unless statement.strip == ""
    end
    query('SET foreign_key_checks = 1')
  end

  protected

  def clone_test_config(worker_num)
    ActiveRecord::Base.clone_config("test", worker_num)
  end

  def connect_to(environment)
    ActiveRecord::Base.establish_connection(environment)
    @environment = environment
    Object.const_set(:RAILS_ENV, environment)
    # Note: don't set ENV['RAILS_ENV'] since that gets passed down to invoked tasks (including 'rake test')
  end

  def environment
    (@environment ||= RAILS_ENV)
  end

  def test_environments
    environments = ['test']
    if Object.const_defined?(:TEST_WORKERS)
      TEST_WORKERS.times do |worker_num|
        environments << "test#{worker_num}"
      end
    end
    environments
  end

  def load_fixtures
    puts "Loading fixtures into #{environment}"
    Rake::Task["db:fixtures:load"].invoke
  end

  def clear_database
    puts "Clearing #{environment} database"
    sql = "drop database if exists #{database}; create database #{database} character set utf8;"
    cmd = %Q|mysql -u#{username} #{password_parameter} -e "#{sql}"|
    # puts "executing #{cmd.inspect}"
    system(cmd)
  end

  def migrate_database
    puts "Migrating #{environment} database"
    ActiveRecord::Migration.verbose = false
    Rake::Task["db:migrate"].invoke
  end

  def config(env = environment)
    ActiveRecord::Base.configurations[env]
  end

  def query(sql)
    ActiveRecord::Base.connection.execute(sql)
  end

  def database
    config["database"]
  end

  def username
    config["username"]
  end

  def password
    config["password"]
  end

  def password_parameter
    if password.nil? || password.empty?
      ""
    else
      "-p#{password}"
    end
  end

  def execute(cmd)
    puts "t#{cmd}"
    unless system(cmd)
      puts "tFailed with status #{$?.exitstatus}"
    end
  end

  def system(cmd)
    @rake.send(:system, cmd)
  end
end
  • 0 Shares
  • Share on Facebook
  • Share on Twitter

Topics

  • agile (783)
  • rails (117)
  • testing (90)
  • ruby (85)
  • ruby on rails (71)
  • jobs (62)
  • javascript (59)
  • techtalk (44)
  • ironblogger (42)
  • rspec (39)
  • bloggerdome (34)
  • productivity (34)
  • activerecord (30)
  • rubymine (30)
  • git (29)
  • gogaruco (29)
  • nyc (27)
  • design (24)
  • mobile (23)
  • pivotal tracker (22)
  • process (21)
  • cucumber (21)
  • jasmine (19)
  • ios (18)
  • tracker ecosystem (17)
  • webos (17)
  • objective-c (17)
  • fun (16)
  • android (16)
  • palm (16)
  • ci (16)
  • "soft" ware (16)
  • bdd (15)
  • tdd (15)
  • cedar (15)
  • rails3 (14)
  • performance (14)
  • css (14)
  • gem (13)
  • mouse-free development (12)
  • selenium (12)
  • goruco (12)
  • bundler (12)
  • api (12)
  • keyboard (11)
  • meetup (11)
  • railsconf (11)
  • nyc-standup (11)
  • capybara (10)
  • mac (10)
Subscribe to database Feed
  • About
  • Case Studies
  • Team
  • Community
  • Careers
  • Contact
  • Labs
  • Events

Contact Us

contact@pivotallabs.com
+1 415-77-PIVOT
TwitterLinkedInFacebook

Pivotal Tracker

Tracker is the award-winning agile project management tool that enables real-time collaboration around a shared, prioritized backlog.
Visit pivotaltracker.com >