Archive for October, 2008

 

CRA and CCC Promote “Research Highlight of the Week”

October 31st, 2008

As recently announced on the Computing Research Policy Blog, the CRA and CCC web sites are now providing a weekly feature called “Computing Research Highlight of the Week.” If you are doing computing research, you are invited to submit your own work for possible inclusion in this weekly feature.

These highlights are designed to provide easily digestible, compelling nuggets of computing research work. Members of Congress, the Administration, and funding agency managers and directors are some of the main audiences for these web pages. We believe the highlights should also prove to be useful for the entire research community. The highlights can be accessed directly, received by email, RSS feed, or even embedded in your own web page.

The current Computing Research Highlight of the Week describes a new algorithm from UCSD researchers that performs route computation in a way that may lead to major improvements in network efficiency. Check it out — it is punchy, informative, and makes good use of some simple graphics while at the same time providing links to the scientific publication and full press release.

So, please submit your own highlights! The response thus far has been very good, and we expect that many people outside of our community, including key decision makers, will make good use of the information.

The Data-Centric Gambit

October 20th, 2008

Things always change fast in computing. But the rate of change seems to be on a major uptick recently. In this post, I want to focus on an accelerating driver of that change, a looming crisis on the horizon, and a surprising link between the two that may have big promise. In the spirit of blog discourse, let’s lay this out in broad strokes:

  • The Industrial Revolution of Data. Today’s world-wide web remains a staggering tribute to the typing abilities of the human race. But even with a growing global population, typists are not a scalable source of bit-production going forward. We are entering an era where the overwhelming majority of information will not be hand-crafted. It will be stamped out by machines: software logs, cameras, microphones, GPS transceivers, sensor networks, RFID readers, and so on. This is inevitable. It has already begun to change the computing marketplace: most organizations of size now realize they can afford to save and mine all their logs, and are looking for inexpensive ways to do so. The startup world has responded with a flurry of parallel database and data analytics companies.
  • The Crisis of the Three C’s: Coders, Clouds and Cores. Meanwhile, it’s no news that software development is far, far too difficult. In his Turing Award talk a decade ago, Jim Gray identified radical improvements in programming among his 13 remaining long-term challenges for computing — alongside passing the Turing test and building Vannevar Bush’s Memex. What’s changed on this front since 1998 is the rapid rise of parallelism that my colleagues have been blogging about here. Cloud computing infrastructure, with its “shared-nothing” clusters of machines, demands parallel and distributed programs today. Manycore architectures will demand parallelism at a finer grain in the next few years. The pressing need for parallel software — and armies of fluent software developers to build it — raises both the difficulty and the stakes of the Grand Challenge that Jim Gray highlighted in 1998.

Given this background, what excites me these days is that the trend may bring some new solutions to the crisis, in a surprisingly organic way. 

For over twenty years, “Big Data” has been a sustained bright spot in parallel computing. SQL has been a successful, massively parallel programming language since the late 1980′s, when Teradata (a survivor that evaded Dave Patterson’s Rolls of the Dead) first commercialized parallel database research from projects like Gamma and Bubba. In recent years, SQL has been joined by Google’s MapReduce framework, which is bringing algorithmicists into massive data processing in a way that SQL never did. Both SQL and MapReduce will likely thrive, and may well converge: two parallel database startup companies recently announcedintegrated implementations of SQL and MapReduce. (Full disclosure: I advise Greenplum, one of those companies.)

SQL and MapReduce programmers do not think much about parallelism. Rather than trying to unravel an algorithm into separate threads, they focus on chopping up sets of input data into pieces, which get pumped through copies of a single sequential program running asynchronously on each processor. In parallel programming jargon, this kind of code is sometimes dismissively referred to as being “embarrassingly parallel”. But very often, the simplest ideas are the most fertile. Programmers “get” these approaches to parallelism. And remember: the Coders are part of the Crisis of the Three C’s, and the key is to make lots of them happy and productive.

But can those programmers lead us anywhere interesting? The most intriguing part of this story is that in the last 5-10 years, the set-oriented, data-centric approach has been gaining footholds well outside of batch-oriented data parallelism. There has been a groundswell of work on “declarative”, data-centric languages for a variety of domain-specific tasks, mostly using extensions of Datalog. These languages have been popping up in networking and distributed systemsnatural language processingcompiler analysismodular roboticssecurity, and video games, among other applications. And they are being proposed for tasks that are not embarrassingly parallel. It turns out that focusing on the data can make a broad class of programs simpler — much simpler! — to express.

Networking and distributed coordination protocols are one good example. They run in parallel, and are themselves a key to cloud services. In our work on Declarative Networking over the last years, we showed that a wide range of network and distributed coordination protocols are remarkably easy to express in a data-centric language. For example, our version of the Chord Distributed Hash Table (DHT) protocol is 47 lines of our Overlog language; the reference implementation is over 10,000 lines of C++. (DHTs are a key component of cloud services like Amazon’s Dynamo.) Meanwhile, graduate students at Harvard prototyped a simple version of the tricky Paxos consensus protocol in an alpha edition of Overlog in 44 rules. (Paxos is a key component in cloud infrastructure like Google’s GFS file system and Chubby lock manager.) We have other examples where our declarative programs are line-for-line translations of pseudo-code from research papers. These are the kinds of scenarios where the quantitative differences are best captured qualitatively. You can print out our Chord implementation on one sheet of paper, take it down to the coffee shop, and figure it out. Doing that with 10,000 lines of C++ would be a superhuman feat of Programmer-Fu, and a big waste of paper.

Machine Learning is another area where data-centric declarative programming seems to help with parallelism and distribution. A group at Stanford pointed to a range of standard machine learning tasks that can be expressed almost trivially as MapReduce programs, without any requirement for parallel programming expertise. More deeply, a number of the fundamental algorithms driving Machine Learning center on “message-passing” algorithms like Belief Propagation and Junction Trees that work on a computational model of explicit dataflow, rather than shared memory. Research in ML over sensornets showed how to overlay that logical communication onto a physical network. And these inference networks — much like DHTs — turn out to be a good fit to Overlog. (Distributed Junction Trees in 39 rules, anyone?) The Dyna language is another good example, with a focus on (currently single-node) Natural Language Processing.

I’ll be the first to admit that Datalog syntax is horrible, and it is not a reasonable language for developers. There is much to be done before adoption of complex data-centric languages can occur. But what excites me here is that the main positive trend in parallel programming — the one driven by the Industrial Revolution of Data, the one with programmer feet on the street — that trend feeds into this promising new generation of much richer data-centric languages. If MapReduce is the boot camp for a next round of parallel languages, those languages are likely to be data-centric. And there’s growing reason to believe that the data-centric approach will suit a wide range of tasks.

– Joe Hellerstein is a Professor of Computer Science at the University of California, Berkeley. His research focuses on data management and networking.

Update on NetSE

October 13th, 2008

One of the visioning activities supported by the CCC is exploring the possibility of a compelling research agenda in the theoretical, experimental, and societal aspects of “network science and engineering” (NetSE). A NetSE Council has been established.  It’s chair, Ellen Zegura, provides this brief status report on the NetSE Council’s activities.

Thanks for the opportunity to update the community on what has been happening recently with the Network Science and Engineering (NetSE) effort, from my perspective as chair of the NetSE Council.

Let me explain my take on NetSE with an anecdote from my Georgia Tech colleague Mike Best based on a recent trip he made to Africa. Mike and his group met with a group of chiefs of the Acholi people in Northern Uganda. This is an area that has suffered through profound conflict and lacks for essentially any communication technology. Mike and his team wanted to engage in participatory design to understand the existing communication needs, unmet needs and requirements, and latent requirements.

They were very cautious not to influence the conversation towards modern communication technologies so they did not mention specific systems. But after about thirty minutes of this exercise one of the chiefs finally stated, “We want the internet. Unless you have something better.”

To me, NetSE is about the potential for something better. That isn’t to take away from how incredible the Internet is, but that success has led to a dependence on an infrastructure that we understand surprisingly little about. Figuring out what “better” means and how we might get there is a challenge that is intellectual, economic, political and social. In other words, hard, but incredibly important.

The last couple of months have been busy for the NetSE community. Five workshops and meetings have taken place since mid-June covering Network Design and X, where X has been Network Science, Societal Values, Theoretical Computer Science, Behavioral Economics, and Network Engineering. The goal of these activities has been to add to all the good work on research opportunities done under the auspices of GENI, but without the yoke of justifying a large facility.

NetSE is shaping up to be strongly disciplinary AND interdisciplinary. There remain major challenges and opportunities in the core disciplines of networking and distributed systems, as well as across disciplines in and out of CISE. For example, technology advances are producing the ability to program all the way down to the photon or RF wavelength. How can and should future networks take advantage of programmability at this extreme? In the interdisciplinary vein, there are important and exciting opportunities at the intersection of human behavior and network behavior. How should home networks be structured so that mere mortals can deploy and manage them?

Over the next couple of months, we will be synthesizing the output of the various activities into a NetSE research agenda that will include recommendations to funding agencies about what is needed to advance the agenda. You can watch for updates on the NetSE page hosted by the CCC at www.cra.org/ccc/netse.php.

Ellen Zegura is Professor and Chair of Computer Science, School of Computer Science, College of Computing, at the Georgia Institute of Technology.