There was a feature to generate some documents and email a zip file on a weekly basis. After a lively discussion, it was decided this was the MVP to solve the need which only impacted one end-user performing business administration tasks. My MVP would have been minimizing maintenance with an index page that the zips gets dumped to, but it was decided we didn’t want to store the files.
It was actually relatively simple to implement and we were leveraging code to do the document generation for us. We tested zipping up 500 documents, sending it and all ran within specifications.
First week in production, the user was reporting that it didn’t arrive. Instantly, ideas will rush in to your head about the common suspects for a scheduled email task failure. It was none of them.
After a good deal of digging around and rerunning the job on a staging server with sporadic failures and annoyingly no backtrace, we ran some queries and discovered that there were over 1000 documents that needed to be generated. Our email attachment limit budgeted for 400. However this was not the problem, but it did lead to the next discovery.
Deep within the document generation code, Ruby Tempfile was being used and the files were never being closed. The OS borked at 1024 open file handles and worse, part of the document generation Kernel.system’ed out which made tracing the proper point of failure difficult. The code actually silently failed until later in the process. Half a day on a refactor.
Running the job on development through Resque, it worked a charm. We cherry picked the fix up on to staging and reran the job. Failed. Head in my hands, I couldn’t help but be mildly amused. Our workers apparently didn’t load the zip library by default in any environment except development. I’m sure there’s a good reason for this, but it was unexpected given that it worked on development. With a simple require directive, the job was on the way and worked.
Reflecting on this, again what failed me were not the task being complex, but my assumptions. The key one being that there weren’t going to be more than 400 documents to generate – however on the first run there was going to be many more than subsequent weeks as one of the criteria for document generation was that for some large set of objects there were documents relating to it which hadn’t yet been generated.
Assumptions are a good thing. They’re useful heuristics that allow you to keep just enough in your head to make reasonable calculations. Just be prepared to peel them back methodically when you’re debugging.