On a crisp October morning, fellow Pivot Oli Peate and I strolled into the Shoreditch Village Hall to participate in Span, a one-day, one-track conference about the craft of scaling data and the systems which process it. This is the first time Span has been held, but it was a roaring success, featuring interesting talks, and stimulating conversation with other attendees.
One clear idea ran through a remarkable number of the talks and conversations: streaming. For much of its history, big data has been synonymous with map-reduce, and the batch-oriented architecture that entails, but as several speakers made clear, a streaming architecture is now an option that demands consideration.
In the first talk, Richard Conway discussed about his experience of deploying Spark in the cloud. I now finally understand that Spark is functional programming blown up to a giant size. The bread and butter of the functional programmer is code which takes some collection of data and applies a chain of operations to it: some combination of mapping each value to some other value using a function, filtering values using a predicate, reducing many values to one using an operator, and so on, each operation feeding into the next. Spark simply – really, almost embarrassingly simply – takes the idea of a chain of operations and makes each operation a standalone job which be deployed to a compute cluster, with the values being streamed between them over the network. The result is a data-processing framework which is both easy to use for anyone with experience of functional programming, and flexible – so much so that it subsumes the familiar map-reduce paradigm as just one possible topology.
In Conway’s experience, Spark works rather well. It defends the programmer against node failure and memory pressure by representing each stage in the data flow as a resilient distributed dataset (RDD), which contains a definition of a computation and an optional cache of its results. If computation fails, it can be repeated according to the definition. If memory is short, any cached results can be spilled to disk, or discarded, then computed again later. This does make Spark rather demanding of its hardware. Richard’s advice was to give Spark enough memory to hold at least 40% of its dataset in memory, as well as fast storage, which in his case was provided by Windows Azure Storage. Still, when its demands are met, it seems that Spark can produce results one or two orders of magnitude faster than Hadoop.
Two speakers talked about the implication of streaming for the rest of the architecture: Alex Dean told us why our companies need unified logs, and Martin Kleppmann tried to help us stay agile in the face of data deluge. Both were preaching from the gospel according to Jay Kreps, telling us how to restore sanity to our ever-growing data-centric businesses by building them around a unified log.
For those who haven’t heard the good news about the unified log, this is a single, central, unopinionated pipe which consumes all the events which pass through an organisation the moment they are received or generated, and makes them available to downstream systems for processing. Consider an e-commerce operation: a shop front takes orders, an analytics system tracks page impressions, a fulfilment system reports shipments and returns, an inventory system monitors deliveries and stock counts, etc. Events ultimately flow from some of these systems to others, but also to systems which calculate stock levels, forecast demand, entice users with tailored emails, tally revenues and marketing expenditures, and so on. Now imagine connecting each of these flows without a unified log: you make a point-to-point integration between the producing and consuming systems; from your N systems now spring on the order of N^2 integrations. The yawning pit of despair you feel opening up inside you is the reason you need a unified log. With a unified log, there are only N integration points, between each system and the log; the systems simply throw a description of everything which happens into the log, and look out for events which are relevant to their interests emerging from it. The log does nothing but collect the events and make them available; it is simply a stream.
Alex Dean focused on issues around our confidence in our data. He pointed out that a unified log also solved the “Rashomon problem”, where the N point-to-point integrations to a producing system can each tell a subtly different story about its data, according to how they are implemented. Different consumers might see different sets of events, events in a different order, or different subsets of attributes of an event. With a unified log, there is a single authoritative record. He explained how his company Snowplow Analytics, brings order to the torrent of JSON-encoded events passing through their log by imposing schemas using Iglu, easing the creation of new consuming applications. He cautioned that the software used to implement a unified log – right now, the default choice is Kafka – might not provide the durability or delivery semantics some systems might require. The current solution to this problem would be to use an alternative channel, or layer another protocol on top of the log.
Martin Kleppmann outlined the use of Samza as a processing framework for data passing through Kafka, and suggested change data capture as a simple way of generating events from an existing database-backed application by mining its transaction log; this is currently possible for a number of proprietary databases, and might soon be possible for PostgreSQL via logical log streaming replication.
So where does this leave batch processing? As Javier Ramirez related, in the form of a cloud-based database-as-a-service like BigQuery, it can be virtually effortless to use, and the lambda architecture gives us a way of using batch and streaming architectures together. But do we need to? As tools like Spark and Samza mature, they can be used with approaches like CQRS and event sourcing to make batch processing entirely redundant.
Data, of course, is useless without processing, and several of the speakers looked at this side of the equation. Here, the actor model was the common thread. We heard about the genesis of Erlang from Robert Virding. When he and his colleagues created Erlang, over twenty years ago, data processing was the furthest thing from their minds; he explained that although they took inspiration from various sources, they had exactly one goal in mind: to write the software for Ericsson’s AXD telephone switch. Moreover, the language and its runtime system were written at the same time as the switch software. The powerful and immediate feedback from that experience led its creators to give Erlang its capabilities for concurrency, error recovery, and hot code loading. As Virding emphatically put it:
We were NOT trying to implement a functional language
We were NOT trying to implement the actor model
WE WERE TRYING TO SOLVE THE PROBLEM
As devotees of agile development, we at Pivotal Labs already know how powerful that approach is! The resulting simplicity in Erlang’s design is why it was able to be adapted to its new use in event-based data processing.
At the more recent end of the history of the actor model, Richard Astbury told us about Microsoft’s new platform, Orleans. Orleans is a programming model and associated runtime for highly available fine-grained concurrent components. You might think that sounds like Erlang, but Orleans does more than that: it provides on-demand activation and deactivation of components, dispatching of messages to methods, remote messaging within the cluster, and persistence of components’ state, all aimed at creating a more comfortable programming experience. In that regard, it actually reminded me much more of another technology.
At a larger scale, the Guardian’s Phil Wills explained how their architecture has evolved over time to handle its ever-rising load. A carefully-configured content delivery network routes all requests to one of a number of microservices which serve up specific features of the site. Each microservice has its own AWS autoscaling group, so each feature of the site can be scaled independently. When overload does cause failure or performance degradation, it is confined to just the overloaded microservice. In both respects, the microservices present a macrocosm of the actor model.
It was with that faintly Hermetic thought about the symmetry between actors and microservices, as between functional programming and streaming architecture, that I wandered across Hoxton Square to join my fellow Spanners for a bite to eat in a local restaurant. Judging by the conversation, they enjoyed it as much as I did; I’m certainly keen to go again in future, and I recommend that anyone else interested in the future of big data does too.