Things always change fast in computing. But the rate of change seems to be on a major uptick recently. In this post, I want to focus on an accelerating driver of that change, a looming crisis on the horizon, and a surprising link between the two that may have big promise. In the spirit of blog discourse, let’s lay this out in broad strokes:

  • The Industrial Revolution of Data. Today’s world-wide web remains a staggering tribute to the typing abilities of the human race. But even with a growing global population, typists are not a scalable source of bit-production going forward. We are entering an era where the overwhelming majority of information will not be hand-crafted. It will be stamped out by machines: software logs, cameras, microphones, GPS transceivers, sensor networks, RFID readers, and so on. This is inevitable. It has already begun to change the computing marketplace: most organizations of size now realize they can afford to save and mine all their logs, and are looking for inexpensive ways to do so. The startup world has responded with a flurry of parallel database and data analytics companies.
  • The Crisis of the Three C’s: Coders, Clouds and Cores. Meanwhile, it’s no news that software development is far, far too difficult. In his Turing Award talk a decade ago, Jim Gray identified radical improvements in programming among his 13 remaining long-term challenges for computing — alongside passing the Turing test and building Vannevar Bush’s Memex. What’s changed on this front since 1998 is the rapid rise of parallelism that my colleagues have been blogging about here. Cloud computing infrastructure, with its “shared-nothing” clusters of machines, demands parallel and distributed programs today. Manycore architectures will demand parallelism at a finer grain in the next few years. The pressing need for parallel software — and armies of fluent software developers to build it — raises both the difficulty and the stakes of the Grand Challenge that Jim Gray highlighted in 1998.

Given this background, what excites me these days is that the trend may bring some new solutions to the crisis, in a surprisingly organic way. 

For over twenty years, “Big Data” has been a sustained bright spot in parallel computing. SQL has been a successful, massively parallel programming language since the late 1980’s, when Teradata (a survivor that evaded Dave Patterson’s Rolls of the Dead) first commercialized parallel database research from projects like Gamma and Bubba. In recent years, SQL has been joined by Google’s MapReduce framework, which is bringing algorithmicists into massive data processing in a way that SQL never did. Both SQL and MapReduce will likely thrive, and may well converge: two parallel database startup companies recently announcedintegrated implementations of SQL and MapReduce. (Full disclosure: I advise Greenplum, one of those companies.)

SQL and MapReduce programmers do not think much about parallelism. Rather than trying to unravel an algorithm into separate threads, they focus on chopping up sets of input data into pieces, which get pumped through copies of a single sequential program running asynchronously on each processor. In parallel programming jargon, this kind of code is sometimes dismissively referred to as being “embarrassingly parallel”. But very often, the simplest ideas are the most fertile. Programmers “get” these approaches to parallelism. And remember: the Coders are part of the Crisis of the Three C’s, and the key is to make lots of them happy and productive.

But can those programmers lead us anywhere interesting? The most intriguing part of this story is that in the last 5-10 years, the set-oriented, data-centric approach has been gaining footholds well outside of batch-oriented data parallelism. There has been a groundswell of work on “declarative”, data-centric languages for a variety of domain-specific tasks, mostly using extensions of Datalog. These languages have been popping up in networking and distributed systemsnatural language processingcompiler analysismodular roboticssecurity, and video games, among other applications. And they are being proposed for tasks that are not embarrassingly parallel. It turns out that focusing on the data can make a broad class of programs simpler — much simpler! — to express.

Networking and distributed coordination protocols are one good example. They run in parallel, and are themselves a key to cloud services. In our work on Declarative Networking over the last years, we showed that a wide range of network and distributed coordination protocols are remarkably easy to express in a data-centric language. For example, our version of the Chord Distributed Hash Table (DHT) protocol is 47 lines of our Overlog language; the reference implementation is over 10,000 lines of C++. (DHTs are a key component of cloud services like Amazon’s Dynamo.) Meanwhile, graduate students at Harvard prototyped a simple version of the tricky Paxos consensus protocol in an alpha edition of Overlog in 44 rules. (Paxos is a key component in cloud infrastructure like Google’s GFS file system and Chubby lock manager.) We have other examples where our declarative programs are line-for-line translations of pseudo-code from research papers. These are the kinds of scenarios where the quantitative differences are best captured qualitatively. You can print out our Chord implementation on one sheet of paper, take it down to the coffee shop, and figure it out. Doing that with 10,000 lines of C++ would be a superhuman feat of Programmer-Fu, and a big waste of paper.

Machine Learning is another area where data-centric declarative programming seems to help with parallelism and distribution. A group at Stanford pointed to a range of standard machine learning tasks that can be expressed almost trivially as MapReduce programs, without any requirement for parallel programming expertise. More deeply, a number of the fundamental algorithms driving Machine Learning center on “message-passing” algorithms like Belief Propagation and Junction Trees that work on a computational model of explicit dataflow, rather than shared memory. Research in ML over sensornets showed how to overlay that logical communication onto a physical network. And these inference networks — much like DHTs — turn out to be a good fit to Overlog. (Distributed Junction Trees in 39 rules, anyone?) The Dyna language is another good example, with a focus on (currently single-node) Natural Language Processing.

I’ll be the first to admit that Datalog syntax is horrible, and it is not a reasonable language for developers. There is much to be done before adoption of complex data-centric languages can occur. But what excites me here is that the main positive trend in parallel programming — the one driven by the Industrial Revolution of Data, the one with programmer feet on the street — that trend feeds into this promising new generation of much richer data-centric languages. If MapReduce is the boot camp for a next round of parallel languages, those languages are likely to be data-centric. And there’s growing reason to believe that the data-centric approach will suit a wide range of tasks.

– Joe Hellerstein is a Professor of Computer Science at the University of California, Berkeley. His research focuses on data management and networking.

viagra
free viagra
buy viagra online
generic viagra
how does viagra work
cheap viagra
buy viagra
buy viagra online inurl
viagra 6 free samples
viagra online
viagra for women
viagra side effects
female viagra
natural viagra
online viagra
cheapest viagra prices
herbal viagra
alternative to viagra
buy generic viagra
purchase viagra online
free viagra without prescription
viagra attorneys
free viagra samples before buying
buy generic viagra cheap
viagra uk
generic viagra online
try viagra for free
generic viagra from india
fda approves viagra
free viagra sample
what is better viagra or levitra
discount generic viagra online
viagra cialis levitra
viagra dosage
viagra cheap
viagra on line
best price for viagra
free sample pack of viagra
viagra generic
viagra without prescription
discount viagra
gay viagra
mail order viagra
viagra inurl
generic viagra online paypal
generic viagra overnight
generic viagra online pharmacy
generic viagra uk
buy cheap viagra online uk
suppliers of viagra
how long does viagra last
viagra sex
generic viagra soft tabs
generic viagra 100mg
buy viagra onli
generic viagra online without prescription
viagra energy drink
cheapest uk supplier viagra
viagra cialis
generic viagra safe
viagra professional
viagra sales
viagra free trial pack
viagra lawyers
over the counter viagra
best price for generic viagra
viagra jokes
buying viagra
viagra samples
viagra sample
cialis
generic cialis
cheapest cialis
buy cialis online
buying generic cialis
cialis for order
what are the side effects of cialis
buy generic cialis
what is the generic name for cialis
cheap cialis
cialis online
buy cialis
cialis side effects
how long does cialis last
cialis forum
cialis lawyer ohio
cialis attorneys
cialis attorney columbus
cialis injury lawyer ohio
cialis injury attorney ohio
cialis injury lawyer columbus
prices cialis
cialis lawyers
viagra cialis levitra
cialis lawyer columbus
online generic cialis
daily cialis
cialis injury attorney columbus
cialis attorney ohio
cialis cost
cialis professional
cialis super active
how does cialis work
what does cialis look like
cialis drug
viagra cialis
cialis to buy new zealand
cialis without prescription
free cialis
cialis soft tabs
discount cialis
cialis generic
generic cialis from india
cheap cialis sale online
cialis daily
cialis reviews
cialis generico
how can i take cialis
cheap cialis si
cialis vs viagra
levitra
generic levitra
levitra attorneys
what is better viagra or levitra
viagra cialis levitra
levitra side effects
buy levitra
levitra online
levitra dangers
how does levitra work
levitra lawyers
what is the difference between levitra and viagra
levitra versus viagra
which works better viagra or levitra
buy levitra and overnight shipping
levitra vs viagra
canidan pharmacies levitra
how long does levitra last
viagra cialis levitra
levitra acheter
comprare levitra
levitra ohne rezept
levitra 20mg
levitra senza ricetta
cheapest generic levitra
levitra compra
cheap levitra
levitra overnight
levitra generika
levitra kaufen

Comments

9 Responses to “The Data-Centric Gambit”

  1. anonymous on October 21st, 2008 11:16 am

    Teradata actually predates Gamma and Bubba. Teradata was founded in 1979 and shipped their first commercial system in 4Q1983. Possible Gamma and Bubba research was used in later releases.

  2. Pedro Domingos on October 21st, 2008 5:52 pm

    It gets better. When you couple a declarative language with machine learning from data, the compactness of your programs improves again by orders of magnitude, because you only have to specify the basic structure of the program, and you can let the rest of it “grow” from the data. We’ve developed a language called Markov logic that’s a probabilistic extension of Datalog for this, and a series of learning and inference algorithms for it that are available in the open-source Alchemy package (http://alchemy.cs.washington.edu). Alchemy has been used with spectacular success on a number of very hard problems. For example, we can build a complete information extraction system with an Alchemy program that fits on a single slide.

  3. Jean Richard on October 23rd, 2008 7:40 am

    Since the dawn of humanity (pardon the cliché), r&d have accelerated. Most people assume it goes by a geometric progression from now on instead of continuing as a square law. Curiuosly, Moore’s law of computing devices say the computers will double power every 12 or 15 or 18 months. This is the square law. Computer evolution is only a way to mesure it.

    The real challenge of 40 years ago was to be able to cope with greater amount of data with the same brain we used 10,000 years ago. The Computer helped us there but the actual problem is to be able to find significant data in all this. The answers are partly there with data mining tools, data warehouse, etc. The next problem is to be able to make the next generation of tools for us to be able to develop… new tools. We have new more complex computers but we lack the tool to do the fundamental objective of computing science; adapting these tools to human beings. The problem is even more interresting since it was for computer scientist to adapt the system for users. It’s become adapting the system for ourselves so we can adapt it to users.

    Interresting future ahead…

  4. … BLOG ~ BeLOG ~ Belong … « Data Cronopio on October 29th, 2008 4:24 am

    [...] to guest blog at CCCBlog and [...]

  5. … BLOG ~ BeLOG ~ Belong … « Data Snacks on October 30th, 2008 6:05 am

    [...] to guest blog at CCCBlog and [...]

  6. 2 Blog ~ 2 Belong « Data Beta on November 7th, 2008 12:29 am

    [...] to guest blog at CCCBlog and [...]

  7. Cognizant Transmutaion » The Commoditization of Massive Data Analysis on November 20th, 2008 4:42 am

    [...] calls all of this a renaissance in computer science research and calls for folks to look towards standardizing the upper layers of the Hadoop hierarchy, [...]

  8. Dana Gardner’s BriefingsDirect mobile edition on December 16th, 2008 12:23 pm

    [...] rapidly, even though we have harsh economic times. To provide an in-depth look at how parallelism, modern data infrastructure, and MapReduce technologies come together in the new age, BriefingsDirect’s Dana Gardner recently [...]

  9. MapReduce-scale analytics change BI game as enterprises need to mine ever-expanding data sets - Dana Gardner's BriefingsDirect on December 16th, 2008 12:29 pm

    [...] provide an in-depth look at how parallelism, modern data infrastructure, and MapReduce technologies come together in the new age, BriefingsDirect’s Dana Gardner [...]

Leave a Reply