The Data-Centric Gambit

October 20th, 2008 by Peter Lee Post a comment »

Things always change fast in computing. But the rate of change seems to be on a major uptick recently. In this post, I want to focus on an accelerating driver of that change, a looming crisis on the horizon, and a surprising link between the two that may have big promise. In the spirit of blog discourse, let’s lay this out in broad strokes:

  • The Industrial Revolution of Data. Today’s world-wide web remains a staggering tribute to the typing abilities of the human race. But even with a growing global population, typists are not a scalable source of bit-production going forward. We are entering an era where the overwhelming majority of information will not be hand-crafted. It will be stamped out by machines: software logs, cameras, microphones, GPS transceivers, sensor networks, RFID readers, and so on. This is inevitable. It has already begun to change the computing marketplace: most organizations of size now realize they can afford to save and mine all their logs, and are looking for inexpensive ways to do so. The startup world has responded with a flurry of parallel database and data analytics companies.
  • The Crisis of the Three C’s: Coders, Clouds and Cores. Meanwhile, it’s no news that software development is far, far too difficult. In his Turing Award talk a decade ago, Jim Gray identified radical improvements in programming among his 13 remaining long-term challenges for computing — alongside passing the Turing test and building Vannevar Bush’s Memex. What’s changed on this front since 1998 is the rapid rise of parallelism that my colleagues have been blogging about here. Cloud computing infrastructure, with its “shared-nothing” clusters of machines, demands parallel and distributed programs today. Manycore architectures will demand parallelism at a finer grain in the next few years. The pressing need for parallel software — and armies of fluent software developers to build it — raises both the difficulty and the stakes of the Grand Challenge that Jim Gray highlighted in 1998.

Given this background, what excites me these days is that the trend may bring some new solutions to the crisis, in a surprisingly organic way. 

For over twenty years, “Big Data” has been a sustained bright spot in parallel computing. SQL has been a successful, massively parallel programming language since the late 1980′s, when Teradata (a survivor that evaded Dave Patterson’s Rolls of the Dead) first commercialized parallel database research from projects like Gamma and Bubba. In recent years, SQL has been joined by Google’s MapReduce framework, which is bringing algorithmicists into massive data processing in a way that SQL never did. Both SQL and MapReduce will likely thrive, and may well converge: two parallel database startup companies recently announcedintegrated implementations of SQL and MapReduce. (Full disclosure: I advise Greenplum, one of those companies.)

SQL and MapReduce programmers do not think much about parallelism. Rather than trying to unravel an algorithm into separate threads, they focus on chopping up sets of input data into pieces, which get pumped through copies of a single sequential program running asynchronously on each processor. In parallel programming jargon, this kind of code is sometimes dismissively referred to as being “embarrassingly parallel”. But very often, the simplest ideas are the most fertile. Programmers “get” these approaches to parallelism. And remember: the Coders are part of the Crisis of the Three C’s, and the key is to make lots of them happy and productive.

But can those programmers lead us anywhere interesting? The most intriguing part of this story is that in the last 5-10 years, the set-oriented, data-centric approach has been gaining footholds well outside of batch-oriented data parallelism. There has been a groundswell of work on “declarative”, data-centric languages for a variety of domain-specific tasks, mostly using extensions of Datalog. These languages have been popping up in networking and distributed systemsnatural language processingcompiler analysismodular roboticssecurity, and video games, among other applications. And they are being proposed for tasks that are not embarrassingly parallel. It turns out that focusing on the data can make a broad class of programs simpler — much simpler! — to express.

Networking and distributed coordination protocols are one good example. They run in parallel, and are themselves a key to cloud services. In our work on Declarative Networking over the last years, we showed that a wide range of network and distributed coordination protocols are remarkably easy to express in a data-centric language. For example, our version of the Chord Distributed Hash Table (DHT) protocol is 47 lines of our Overlog language; the reference implementation is over 10,000 lines of C++. (DHTs are a key component of cloud services like Amazon’s Dynamo.) Meanwhile, graduate students at Harvard prototyped a simple version of the tricky Paxos consensus protocol in an alpha edition of Overlog in 44 rules. (Paxos is a key component in cloud infrastructure like Google’s GFS file system and Chubby lock manager.) We have other examples where our declarative programs are line-for-line translations of pseudo-code from research papers. These are the kinds of scenarios where the quantitative differences are best captured qualitatively. You can print out our Chord implementation on one sheet of paper, take it down to the coffee shop, and figure it out. Doing that with 10,000 lines of C++ would be a superhuman feat of Programmer-Fu, and a big waste of paper.

Machine Learning is another area where data-centric declarative programming seems to help with parallelism and distribution. A group at Stanford pointed to a range of standard machine learning tasks that can be expressed almost trivially as MapReduce programs, without any requirement for parallel programming expertise. More deeply, a number of the fundamental algorithms driving Machine Learning center on “message-passing” algorithms like Belief Propagation and Junction Trees that work on a computational model of explicit dataflow, rather than shared memory. Research in ML over sensornets showed how to overlay that logical communication onto a physical network. And these inference networks — much like DHTs — turn out to be a good fit to Overlog. (Distributed Junction Trees in 39 rules, anyone?) The Dyna language is another good example, with a focus on (currently single-node) Natural Language Processing.

I’ll be the first to admit that Datalog syntax is horrible, and it is not a reasonable language for developers. There is much to be done before adoption of complex data-centric languages can occur. But what excites me here is that the main positive trend in parallel programming — the one driven by the Industrial Revolution of Data, the one with programmer feet on the street — that trend feeds into this promising new generation of much richer data-centric languages. If MapReduce is the boot camp for a next round of parallel languages, those languages are likely to be data-centric. And there’s growing reason to believe that the data-centric approach will suit a wide range of tasks.

– Joe Hellerstein is a Professor of Computer Science at the University of California, Berkeley. His research focuses on data management and networking.

  • anonymous

    Teradata actually predates Gamma and Bubba. Teradata was founded in 1979 and shipped their first commercial system in 4Q1983. Possible Gamma and Bubba research was used in later releases.

  • http://alchemy.cs.washington.edu/ Pedro Domingos

    It gets better. When you couple a declarative language with machine learning from data, the compactness of your programs improves again by orders of magnitude, because you only have to specify the basic structure of the program, and you can let the rest of it “grow” from the data. We’ve developed a language called Markov logic that’s a probabilistic extension of Datalog for this, and a series of learning and inference algorithms for it that are available in the open-source Alchemy package (http://alchemy.cs.washington.edu). Alchemy has been used with spectacular success on a number of very hard problems. For example, we can build a complete information extraction system with an Alchemy program that fits on a single slide.

  • Jean Richard

    Since the dawn of humanity (pardon the cliché), r&d have accelerated. Most people assume it goes by a geometric progression from now on instead of continuing as a square law. Curiuosly, Moore’s law of computing devices say the computers will double power every 12 or 15 or 18 months. This is the square law. Computer evolution is only a way to mesure it.

    The real challenge of 40 years ago was to be able to cope with greater amount of data with the same brain we used 10,000 years ago. The Computer helped us there but the actual problem is to be able to find significant data in all this. The answers are partly there with data mining tools, data warehouse, etc. The next problem is to be able to make the next generation of tools for us to be able to develop… new tools. We have new more complex computers but we lack the tool to do the fundamental objective of computing science; adapting these tools to human beings. The problem is even more interresting since it was for computer scientist to adapt the system for users. It’s become adapting the system for ourselves so we can adapt it to users.

    Interresting future ahead…

  • Pingback: … BLOG ~ BeLOG ~ Belong … « Data Cronopio

  • Pingback: … BLOG ~ BeLOG ~ Belong … « Data Snacks

  • Pingback: 2 Blog ~ 2 Belong « Data Beta

  • Pingback: Cognizant Transmutaion » The Commoditization of Massive Data Analysis

  • Pingback: Dana Gardner’s BriefingsDirect mobile edition

  • Pingback: MapReduce-scale analytics change BI game as enterprises need to mine ever-expanding data sets - Dana Gardner's BriefingsDirect

  • Pingback: EMC += Greenplum « Data Beta

  • http://www.modularhomesnetwork.com/ modular homes

    there are some more good information!