To make all this process as simple as possible, a variation of the third approach (Rack middleware + Selenium Webdriver + no caching) is available here as a Gem. Drop it in your project, have the dependencies installed, and may the SEO gods bless you!
The whole story
Luckily, there are several approaches one can take to circumvent the lack of faith from certain crawlers – there’s obviously no “one size fits all” approach, so let’s take a minute to go through three of the most commonly used approaches, highlighting pros and cons of each one of them.
Render everything in two “modes” (the no script approach)
This strategy consists of rendering your app normally, BUT with pieces of “static” content already baked in (usually inside a
<noscript> block). In other words, the client-side templates your app servers to be rendered will have at least some level of server-side logic on them – hopefully, very little.
It’s worth noting that you should NOT rely on rendering just bits of content when serving pages to bots, as some of them state that they expect the full content of the page. It’s also worth pointing that Google also the snapshots it takes when crawling to compose the tiny thumbnails you see on search results, so you want these to be as close to the real thing as possible – which just compounds on the maintenance issues of this approach.
The hash fragment approach
This technique is supported by Google bot alone (with limited support by some other minor search bots – Facebook’s bot works too, for instance) and is explained in detail here.
In short, the process happens as follows:
The search bot detects that there are hash parameters in your URL (eg.:
The search bot then makes a SECOND request to your server, passing a special parameter (
_escaped_fragment_) back. Eg.:
www.example.com/something?_escaped_fragment_=foo=bar – it’s now up to your server-side implementation to return a static HTML representation of the page.
Notice that for pages that don’t have a hash-bang on their URL (eg.: your site root), this also requires that you add a meta tag to your pages, allowing the bot to know that those pages are crawlable.
<meta name="fragment" content="!">
Notice the meta tag above is mandatory if your URLs don’t use hash fragments (which is becoming the norm these days, due to the amazing adoption of html5 across browsers) – analogously, this is probably the only technique of these three that will work if you depend on hash-bang urls on your site (please don’t!).
You still have to figure out a way to handle the
_escaped_fragment_ request arrives. Which leads us directly to the third approach…
Crawl your own content when the client is a bot
The idea is simple: if the requester of your content is a search bot, spawn a SECOND request to your own server, render the page using a headless browser (thankfully, there are many, many options to choose from in Ruby) and return that “pre-compiled” content back to the search bot. Boom.
The biggest advantage of this approach is that you don’t have to duplicate anything: with a single intercepting script, you can render any different page and retrieve them as needed. Another positive point of this approach is that the content search bots will see is exactly what a final user would.
You can implement this rendering in several ways:
before_filteron your controllers checks the user-agent making the request, then fetches the desired content and return it. PROS: all vanilla-Rails approach. CONS: you’re hitting the entire rails stack TWICE.
- Have a Rack middleware detect the user-agent and initiate the request for the rendered content. PROS: still self-contained on the app approach. CONS: need to be careful on which content will be served, since the middleware will intercept all requests.
- Have the web server (nginx, apache) handle the user-agent and send requests to a different server/endpoint on your server (eg.: example.com/static/original/route/here) that will serve the static content. PROS: only one request hits your app, CONS: requires poking around the underlying server infrastructure.
As for how to store the server-side rendered content (again, from worst to best):
- Re-render things on every request. PROS: no cache-validation hell, CONS: performance.
- Cache rendered pages, optimally with a reasonable expiration time PROS: much faster than re-fetching/rendering pages every time, CONS: cache maintenance might be an issue.
- Cache rendered pages in a way that the web server can fetch them directly (eg.: save them as temp files). PROS: INSANELY FAST. CONS: huge cache maintenance overhead.
There are a few obvious issues you have to keep in mind when using this approach:
- Every request made by a search bot will consume two processes on your web server (since you’re calling yourself back to get content to return to the bot)
- The render time is a major factor when search engines rank your content, so fast responses here are paramount (caching the static versions of pages, therefore, is very important).
- Some hosting services (I’m looking at you, Heroku) might not support the tools you use to render the content on the server side (capybara-webkit, selenium, etc). Either switch servers or simply host your “staticfying” layer somewhere else.