Data, Big and Small
As part of Pivotal, Labs associate more and more with the “Big Data” discussions, tools and methods. This is very exciting. Your applications serve users and deliver immediate value, and a whole world of past data is unlocked, providing deep insight into your business.
And a high quality view into the past of your application is often not just something you can graft on top, separate from your application’s logic. Many services that your application provides to the user, request by request, tap by tap, are also influenced by previous input arriving from the user(s).
In this post I want to share with you some of the works and presentations. They had a strong influence on my understanding of how applications can respond to incoming data and provide maximum insight possible.
These 4 videos(!1hr each) and 1 book pack a lot of information and invite a lot of thought and ideas. Here at Pivotal Labs, many of us have been discussing these resources and now find new applications of these ideas again and again. I’ll provide a few of my thoughts with each link.
Most of the message is in the vocabulary. More on that later, but these works draw their power from having words denote relationships in software that we may find too subtle to discern.
Suspend what you know about performance. Some of the points made by all these resources rely on performance characteristics. Its important not to get hung up on a particular component seeming unfeasible due to what may or may not be true about its performance of its related components.
NOTE: All of the speakers below, like many humans, apply a straw man fallacy in presenting their point. What they draw as the opposition to their principles is exaggerated. Dismissal of OO paradigm in particular is disappointing, because some of the concepts have fantastic implications in object design. It’s a difficult fallacy to avoid, and I urge you not to judge the speakers for misrepresentation of other positions. They clearly thought deeply on their respective subjects and offer exceptional insight.
Nathan Marz and the Lambda Architecture
Out of all these resources, this is the most practical. Nathan Marz put these approaches to work at Backtype and Twitter. The book was extracted out of real experience.
- First 3 chapters do a great job at presenting the issues and approaches of working with data. A lot of fantastic vocabulary and distinctions are introduced here.
- Chapters 4-6 dive into the details of working with Hadoop and some very custom libraries written by Nathan. If you’re interested in the broader philosophy, feel free to skip.
- Chapters 7-8 pick up on insight when it comes to analyzing performance of databases and draw interesting distinctions. It also paints are world where denormalization/normalization is not a tradeoff and you get the best of both. Not bad.
- Future chapters(yet unreleased) promise to re-examine the paradigm of queues and workers, which is very interesting.
The key to this book is to suspend what we may assume about performance. If you work with interactive systems(CMSs/e-commerce/etc) like I do, and are not used to delayed computation and eventual accuracy, the message may feel inapplicable at times. I felt that way too, but the message is still very applicable.
The influences on the Lambda Architecture are more clear in this presentation. Dedication and commitment to immutability are emphasized. The presentation also feels very accessible and many of the issues are presented in familiar terms.
Both the book and a presentation mention the benefits of immutability that are not related to performance or concurrency. This is important. Implications on simplicity, maintenance, reliability of your system are substantial. To me, that was the whole point! When your data is small, these “side” benefits become primary goals and gain you productivity and clarity.
Rich Hickey talks
Nathan Marz has, in turn, been influenced by Clojure and Rich Hickey, who pushes even harder for immutability in all systems: databases, big data, and local single-machine applications.
This is probably the most popular of these presentations. Some of the concepts of values, simplicity, immutability are presented very well here, even if with a bit of straw man bashing. Coming out from 2011, it’s a classic and I even enjoyed rewatching it at times.
By far, one of the most theoretical, technical, academic resources in the list. But the vocabulary and distinctions are invaluable. Much effort is exerted here in revising the strongest of concepts: value, state, identity, and time. Fundamental concepts, followed by: process, transactions, indexing, query.
Building on those block, Datomic is presented as a dramatically simpler and more powerful database experience. I look forward to seeing this product evolve and teach us more and more of it’s ways.
Getting further away from immediate big data and performance applications, this talk zeroes in on defining simplicity even more.
It’s interesting to note the presentation style in this one and the tone. The talk is at Conj, a Clojure conference. The speaker is at ease and shares a message with friends. Other, later talks I’ve seen by Halloway, air on the side of hostile and adversarial. It’s unfortunate, because the content is a good gift to almost any programmer.
These resources have started great conversations at Pivotal Labs and I’d love to hear your thoughts about it too. What would you add to this list? What is your favorite of these? Let me know in the comments.