Josh Susser's blog



Josh SusserJosh Susser
Becoming more flexible
edit Posted by Josh Susser on Wednesday July 01, 2009 at 02:32PM

Our friends at Engine Yard have just launched the beta of their new cloud hosting product, Flex. If you're familiar with their Solo product you'll find Flex to be pretty similar, just more... flexible. Where solo lets you run your Ruby on Rails application on an Engine Yard stack on an Amazon EC2 instance, Flex lets you run it on a cluster of EC2 instances.

In the last month I've put a handful of applications up on Solo, mostly demo or staging servers for doing story acceptance and release testing. Solo is great for that. After you've gone through the setup process once, you can easily spool up a server for a few hours when you need it on just a few minutes notice, then turn it off when it's not needed anymore. Flex gives you that same kind of adaptability, allowing you to add instances to your cluster to match traffic demands as needed.

Last week I got my first production application running on Flex. The Pivotal Labs company website is now hosted on Flex, and it's humming along quite nicely. There were a few rough spots to work through since we were working with a pre-release product, but I'm pretty happy with our setup now. Since Flex is new, I thought it might be useful to share some of the things my fellow pivots and I learned getting things running there.

Josh SusserJosh Susser
One-stop media shop
edit Posted by Josh Susser on Monday April 13, 2009 at 04:19PM

When Leah Silber and I first sat down to talk about putting on Golden Gate Ruby Conference, there were a few things we wanted to do but weren't sure we could manage given it was our first year and we didn't know if we could afford everything. The biggest of those things, and the one we get the most questions about, was whether we'd be able to record videos of the presentations. Well, I'm happy to say we'll be doing just that. Pivotal Labs, already one of our platinum sponsors, stepped up and offered to cover not only the the cost of producing videos of all the sessions but also the hosting of them. This makes me pretty happy, as Pivotal has been recording its own tech talk series for many months and Chris Odell does a great job with the videos.

We also wanted to guarantee there would be good live-blogging of the conference. Often a good blogger can make a big difference in capturing the feel of a conference, but mostly those things just happen by accident. We decided we wanted to help it happen, so we arranged for an official blogger for the conference. Chad Woolley will be leading a team of pivots to chronicle the conference as it happens.

And to top everything off, the conference will be streamed live by justin.tv. Did you know justin.tv was one of the biggest Ruby-powered websites on the internet? Not many people realize that. Anyway, they'll have someone there covering all the conference presentations and streaming it live. If you're watching that way, you might want to get on IRC on freenode and follow along on the #gogaruco channel.

That's a lot of stuff, but there's only one place you have to go for it all: pivotallabs.com/gogaruco

Josh SusserJosh Susser
Avoid collisions by naming asset packages
edit Posted by Josh Susser on Monday March 16, 2009 at 09:50PM

Rails has a handy feature to easily package multiple CSS or JavaScript files into a single asset. You can use the :cache option with stylesheet_link_tag or javascript_include_tag to bundle several files into a single file (requires config.action_controller.perform_caching to be set to true). This is good because it reduces page download times by eliminating the latency from multiple requests, among other things.

For example:

<%= stylesheet_link_tag "main", "nav", "blog" %>

The above snippet creates the following HTML:

<link href="/stylesheets/main.css?1234567890" media="screen" rel="stylesheet" type="text/css" />
<link href="/stylesheets/nav.css?1234567890" media="screen" rel="stylesheet" type="text/css" />
<link href="/stylesheets/blog.css?1234567890" media="screen" rel="stylesheet" type="text/css" />

But this:

<%= stylesheet_link_tag "main", "nav", "blog", :cache => true %>

produces:

<link href="/stylesheets/all.css?1234567890" media="screen" rel="stylesheet" type="text/css" />

Using the :cache => true option packaged all those files into a single one, and generated only a single link tag to use it on the page. However, this is not exactly what you want to do. Consider this...

views/layouts/application.html.erb:

<%= stylesheet_link_tag "main", "nav", "blog", :cache => true %>

views/layouts/admin.html.erb:

<%= stylesheet_link_tag "main", "nav", "admin", :cache => true %>

It makes total sense to want to create several bundled packages of the same kind of asset. In this example, an application may have a generic user style, and a different set of styles for the admin console. But using :cache => true will get you in trouble, since each layout will try to generate an all.css file with its own set of css files. That's why you should always use a string for the option value to give a particular name to the package file.

views/layouts/application.html.erb:

<%= stylesheet_link_tag "main", "nav", "blog", :cache => "blog_all" %>

views/layouts/admin.html.erb:

<%= stylesheet_link_tag "main", "nav", "admin", :cache => "admin_all" %>

That creates a link like:

<link href="/stylesheets/admin_all.css?1234567890" media="screen" rel="stylesheet" type="text/css" />

that links to blog_all.css or admin_all.css

I've taken to naming packages with a name like LAYOUT_all.css (where LAYOUT is the name of the layout template) to make it easy to tell what's going on.

Josh SusserJosh Susser
Detecting invalid encoding in CSV uploads
edit Posted by Josh Susser on Friday January 16, 2009 at 07:43PM

We ran into an odd bug using FasterCSV to import some data. We were requiring the CSV files to be UTF-8 encoded, but some users tried to upload files in other encodings. FasterCSV ended up choking on characters that weren't valid UTF-8 and truncating the data to the end of the line and leaving fields blank. We didn't want to ask the user to select an encoding, because they'd probably get it wrong anyway, so we decided to reject any files with characters that would cause problems. The trick then, is how to detect that.

First, the tests. We want to detect if an input string contains valid characters in UTF-8 encoding. And we need to deal with both strings and IO (File or StringIO) objects (more on that in a bit).

describe "::encoding_is_utf8? checks strings and IOs" do
  before do
    @utf8 = "This is a test with รง international characters"
  end

  it "returns true when all characters are valid" do
    Importer.encoding_is_utf8?(@utf8).should be_true
    Importer.encoding_is_utf8?(StringIO.new(@utf8)).should be_true
  end

  it "returns false when any characters are invalid" do
    bogus = Iconv.conv('ISO-8859-1', 'UTF-8', @utf8)
    Importer.encoding_is_utf8?(bogus).should be_false
    Importer.encoding_is_utf8?(StringIO.new(bogus)).should be_false
  end
end

Here's the implementation:

class Importer
  def self.encoding_is_utf8?(file_or_string)
    file_or_string = [file_or_string] if file_or_string.is_a?(String)
    is_utf8 = file_or_string.all? { |line| Iconv.conv('UTF-8//IGNORE', 'UTF-8', line) == line }
    file_or_string.rewind if file_or_string.respond_to?(:rewind)
    is_utf8
  end
#...

So the meat of the check is that we are using the Iconv library to detect bad characters. We convert from an assumed UTF-8 to UTF-8, ignoring any characters that can't be represented in UTF-8. If the output and input aren't identical, that means there were bogus characters and the uploaded file should be rejected.

The #rewind is needed to reset the read position in the file so FasterCSV can start over from the beginning. Specs for that aren't included here.

Then in our controller, we ensure the CSV doesn't have any bad characters before we give it to FasterCSV. We extracted that check into its own method, shown here:

def require_utf8!(csv_content)
  unless Importer.encoding_is_utf8?(csv_content)
    raise "Import file must be UTF-8 only. You can paste non-UTF-8 CSV directly into the CSV Text field for automatic conversion."
  end
end

As you can read in the exception message (which ends up in the flash), the user can work around the encoding issue by pasting the CSV into a textarea input in the browswer, which automatically transcodes the data into UTF-8. Aren't browsers awesome? The other option would be to transcode the CSV file, but the textarea is easier if the files aren't gigundous. Anyway, since we can input CSV as either a file or a textarea string, that's why #encoding_is_utf8? needs to check both files and strings.

This approach and implementation seem fine to me. I get the feeling there might be a much simpler way, though. Anyone got a better idea?

Josh SusserJosh Susser
Hacking a subselect in ActiveRecord
edit Posted by Josh Susser on Wednesday October 29, 2008 at 08:08PM

This week, Damon and I were doing a performance optimization for some slow queries. The most performant solution involved denormalizing some data into a join table and doing a subselect to get the ids of the records we wanted. Not rocket science, but also a bit ugly to construct the SQL by hand. Our solution was to cheat a tiny bit and use an ActiveRecord internal method to generate the SQL for us.

def favorite_posts(options={})
  subselect = Favorite.send(
                :construct_finder_sql,
                  :select => "post_id",
                  :conditions => {:blog_id => self.id},
                  :order => "published_at DESC",
                  :limit => options[:limit] || 10, :offset => options[:offset])
  Post.find(:all, :conditions => "posts.id IN (#{subselect})", :order => "published_at DESC")
end

That code uses the private method Favorite.construct_finder_sql to generate the following SQL:

SELECT * FROM posts WHERE posts.id IN (
    SELECT post_id FROM favorites WHERE blog_id = 42 ORDER BY published_at DESC LIMIT 10 OFFSET 10
  ) ORDER BY published_at DESC

The Ruby may look like more code than the SQL, and in that form it is... but if you go the hack up a string route, once you start using string operations or interpolation to deal with the variable parts of the query it gets ugly pretty fast. Using the ActiveRecord code to put it all together keeps it nice and clean, and even makes sure things are sanitized and quoted properly too.

Josh SusserJosh Susser
Brown Bag: Blaine Cook on Starling
edit Posted by Josh Susser on Saturday January 26, 2008 at 02:05AM

This week we were treated to a lunchtime tech-talk by Blaine Cook of Twitter. He came to talk to us about Starling, the all-Ruby message queue system that runs much of Twitter. Blaine spoke about the history and motivation for creating Starling, then showed how it worked, and talked about possible future enhancements and directions for further development.

Starling looks quite simple to use. The Starling server speaks the memcache protocol, so to talk to it you just need to load up the memcache-client gem and create a client instance. Note, the Starling server doesn't use memcached for its implementation at all; it just speaks the protocol.

Some interesting bits about why Blaine built Starling. It basically comes down to that every other solution had some problem that made it unsuitable for them to use. Here's the list:

  • rq (by Ara Howard) - nfs/disk based, high latency
  • DRb - not robust under load
  • Rinda - very slow! O(n) for take operations
  • Apache ActiveMQ - super complex
  • RabbitMQ - Erlang dependency

In the last few months we've seen a lot of Starling-like things appear, some inspired by Starling itself.

  • beanstalkd - uses memcached for storage, not persistent or recoverable
  • bj - database backed
  • thruqueue - uses Thrift protocol, ugly
  • sparrow - Starling imitator
  • ap4r - full-featured

Interesting new directions for Starling... Currently Starling has some overhead from polling on both client and server sides. Kevin Clark and Chris Wanstrath have hacked it to run using EventMachine to eliminate polling. Not sure what happens if clients die while request is waiting to be filled. Also, some issues with load balancing and starvation need to be looked at. And there are opportunities to build a richer client API.