Dec
19
Today’s main message is: Check out http://www.cra.org/ccc/initiatives. Please! And tell your friends and colleagues! (Any reactions or suggestions can be posted here as comments on this article.)
Now, the full story:
The CCC’s mission is “to foster exciting new research visions in the computing community which attract support.” Looking back at what has transpired over the past year, community participation has been tremendous. Many dozens of people have stepped up to propose workshops, make presentations, write articles for this blog, and chip in with thoughtful feedback and ideas. It’s been productive and, well, fun.
Of course, the name of the game is to turn research visions into reality, and one of the core strategies for doing this is to “improve public and policymaker understanding of the importance of computing and computing research in our society”. This seems particularly important right now, as our nation makes a historic transition, hopefully ushering in a new era in the government’s approach to research support.
Without really knowing what kinds of results we would get, we put out a challenge to a small number of people to write very briefly (we asked for two pages) on “computing research initiatives for the 21st century.” What does the new government need to know about the value of computing research? What are some of the most promising and exciting research opportunities in the field? What computing capabilities are critical for the nation today and into the future?
Well, the response has been tremendous. A sample of what we received is now posted on the CCC web site at http://www.cra.org/ccc/initiatives. There are essays on the central role that computing research has in our economy, ideas for research/education infrastructure, “re-envisioning DARPA”, and proposals for research initiatives in personalized medicine, transportation, “big data” computing, computer architecture, networking, cyber-physical systems, and more. WIth this treasure trove of thoughtful inputs, we are now using available channels and the CRA and CCC’s resources to get these noticed by as many policymakers as possible.
We have been so heartened by the response that we are now talking about having a more organized process for soliciting and publishing these sorts of idea-pieces. Stay tuned here and on the CCC web site for more details, some time early in 2009. We’ll also be asking some of the authors of these writeups to post followup discussion pieces on this blog.
Again, thanks for all of your support and participation. 2009 is looking like a truly exciting year.
– Peter Lee and Ed Lazowska
Nov
17
In previous posts on this blog, Berkeley’s David Patterson, Intel’s Andrew Chien, and Microsoft’s Dan Reed presented their views on why research advances are needed to overcome the problems posed by multicore processors. In this piece — the fourth (and possibly final) entry in the series -– Marc Snir from UIUC argues that there are major challenges facing us but yet, the sky is not falling.
–
The CCC blog has published a couple of articles on the multi-core challenge, all emphasizing the difficulty of making parallel programming prevalent and, hence, the difficulty of leveraging multi-core systems in mass markets. The challenge is, indeed, significant and requires important investments in research and development; but, at UPCRC Illinois, we do not believe that the sky is falling.
Parallel programming, as currently practiced, is hard: Programs, especially shared memory programs, are prone to subtle, hard-to-find synchronization bugs and parallel performance is elusive. One can reach two possible conclusions from this situation: It is possible that parallel programming is inherently hard, in which case, indeed the sky is falling. An alternative view is that, intrinsically, parallel programming is not significantly harder than sequential programming; rather, it is hampered by the lack of adequate languages, tools and architectures. In this alternative view, different practices, supported by the right infrastructure, can make parallel programming prevalent.
This alternative, optimistic view is based on many years of experience with parallel programming. While some concurrent code, e.g., OS code, is often hard to write and debug, there are many forms of parallelism that are relatively easy to master: Many parallel scientific codes are written by scientists with limited CS education; the time spent handling parallelism is a small fraction of the time spent developing a large parallel scientific code. Parallelism can be hidden behind an SQL interface and exploited by programmers with little difficulty. Many programmers develop GUI’s that are, in effect, parallel programs, using specialized frameworks. Parallelism can be exposed using scripted object systems such as Squeak Etoys in ways that enable young children to write parallel programs. These examples seem to indicate that it is not parallelism per se that is hard to handle; rather it is the unstructured, unconstrained interaction between concurrent threads that result in code that is hard to understand both from a correctness and performance view, hence hard to debug and tune.
The state-of-the-art in parallel programming is what sequential computing was several decades ago. A major reason for this situation is that parallel programming has been an art exercised by a group of experts whose small population did not justify major investments in programming environments aimed at making their life easier. This reason disappears as parallelism becomes available on all platforms. Furthermore, we can make faster progress now because we understand well the principles it takes to make programming easier — principles such as safety, encapsulation, modularity, or separation of concerns; we also have more experience in developing sophisticated IDE’s.
What will it take to bring these principles of computer science to parallel programming? It will require a broad based attack across the system stack. As has been said in these blogs, we need research in languages, compilers, runtime, libraries, tools, hardware … What has not been said explicitly is that none of these areas are likely to produce the silver bullet on their own. The solution that will work eventually will be one that brings together technologies from all these areas to bear on each other. However, we do not have the luxury of doing this via incremental and reactive changes over decades. The research truly needs to be interdisciplinary and the idea of co-design needs to be internalized. Unfortunately, the mainstream systems community has all but abandoned this mode of research in the last several years. Language researchers are locked into mechanisms that will only be supported by commodity hardware and hardware researchers are locked into a mode that requires supporting the lowest common denominator software. It is imperative that we break out of these shells and get the research community into a mindset that we are truly looking to define a new age of computing — a mindset that nurtures research where a clean system slate is an acceptable starting point.
The sky is not falling, but the ground is shifting rapidly. The multi-core challenge requires a concerted effort of academia and industry to generate new capabilities. We are confident that in the future, as in the past, new capabilities will breed new applications. Multi-core parallelism can be leveraged to develop human-centered consumer products that provide more intelligent and more intuitive interfaces through better graphics and vision, better speech and text processing and better modeling of the user and the environment.
The task of providing better performance is shifting from the hardware to the software. This is an exciting time for Computer Science.
Marc Snir
4323 Siebel Center, 201 N Goodwin, IL 61801
Tel (217) 244 6568
Web http://www.cs.uiuc.edu/homes/snir
Oct
20
The Data-Centric Gambit
Filed Under research horizons | 9 Comments
Things always change fast in computing. But the rate of change seems to be on a major uptick recently. In this post, I want to focus on an accelerating driver of that change, a looming crisis on the horizon, and a surprising link between the two that may have big promise. In the spirit of blog discourse, let’s lay this out in broad strokes:
- The Industrial Revolution of Data. Today’s world-wide web remains a staggering tribute to the typing abilities of the human race. But even with a growing global population, typists are not a scalable source of bit-production going forward. We are entering an era where the overwhelming majority of information will not be hand-crafted. It will be stamped out by machines: software logs, cameras, microphones, GPS transceivers, sensor networks, RFID readers, and so on. This is inevitable. It has already begun to change the computing marketplace: most organizations of size now realize they can afford to save and mine all their logs, and are looking for inexpensive ways to do so. The startup world has responded with a flurry of parallel database and data analytics companies.
- The Crisis of the Three C’s: Coders, Clouds and Cores. Meanwhile, it’s no news that software development is far, far too difficult. In his Turing Award talk a decade ago, Jim Gray identified radical improvements in programming among his 13 remaining long-term challenges for computing — alongside passing the Turing test and building Vannevar Bush’s Memex. What’s changed on this front since 1998 is the rapid rise of parallelism that my colleagues have been blogging about here. Cloud computing infrastructure, with its “shared-nothing” clusters of machines, demands parallel and distributed programs today. Manycore architectures will demand parallelism at a finer grain in the next few years. The pressing need for parallel software — and armies of fluent software developers to build it — raises both the difficulty and the stakes of the Grand Challenge that Jim Gray highlighted in 1998.
Given this background, what excites me these days is that the trend may bring some new solutions to the crisis, in a surprisingly organic way.
For over twenty years, “Big Data” has been a sustained bright spot in parallel computing. SQL has been a successful, massively parallel programming language since the late 1980’s, when Teradata (a survivor that evaded Dave Patterson’s Rolls of the Dead) first commercialized parallel database research from projects like Gamma and Bubba. In recent years, SQL has been joined by Google’s MapReduce framework, which is bringing algorithmicists into massive data processing in a way that SQL never did. Both SQL and MapReduce will likely thrive, and may well converge: two parallel database startup companies recently announcedintegrated implementations of SQL and MapReduce. (Full disclosure: I advise Greenplum, one of those companies.)
SQL and MapReduce programmers do not think much about parallelism. Rather than trying to unravel an algorithm into separate threads, they focus on chopping up sets of input data into pieces, which get pumped through copies of a single sequential program running asynchronously on each processor. In parallel programming jargon, this kind of code is sometimes dismissively referred to as being “embarrassingly parallel”. But very often, the simplest ideas are the most fertile. Programmers “get” these approaches to parallelism. And remember: the Coders are part of the Crisis of the Three C’s, and the key is to make lots of them happy and productive.
But can those programmers lead us anywhere interesting? The most intriguing part of this story is that in the last 5-10 years, the set-oriented, data-centric approach has been gaining footholds well outside of batch-oriented data parallelism. There has been a groundswell of work on “declarative”, data-centric languages for a variety of domain-specific tasks, mostly using extensions of Datalog. These languages have been popping up in networking and distributed systems, natural language processing, compiler analysis, modular robotics, security, and video games, among other applications. And they are being proposed for tasks that are not embarrassingly parallel. It turns out that focusing on the data can make a broad class of programs simpler — much simpler! — to express.
Networking and distributed coordination protocols are one good example. They run in parallel, and are themselves a key to cloud services. In our work on Declarative Networking over the last years, we showed that a wide range of network and distributed coordination protocols are remarkably easy to express in a data-centric language. For example, our version of the Chord Distributed Hash Table (DHT) protocol is 47 lines of our Overlog language; the reference implementation is over 10,000 lines of C++. (DHTs are a key component of cloud services like Amazon’s Dynamo.) Meanwhile, graduate students at Harvard prototyped a simple version of the tricky Paxos consensus protocol in an alpha edition of Overlog in 44 rules. (Paxos is a key component in cloud infrastructure like Google’s GFS file system and Chubby lock manager.) We have other examples where our declarative programs are line-for-line translations of pseudo-code from research papers. These are the kinds of scenarios where the quantitative differences are best captured qualitatively. You can print out our Chord implementation on one sheet of paper, take it down to the coffee shop, and figure it out. Doing that with 10,000 lines of C++ would be a superhuman feat of Programmer-Fu, and a big waste of paper.
Machine Learning is another area where data-centric declarative programming seems to help with parallelism and distribution. A group at Stanford pointed to a range of standard machine learning tasks that can be expressed almost trivially as MapReduce programs, without any requirement for parallel programming expertise. More deeply, a number of the fundamental algorithms driving Machine Learning center on “message-passing” algorithms like Belief Propagation and Junction Trees that work on a computational model of explicit dataflow, rather than shared memory. Research in ML over sensornets showed how to overlay that logical communication onto a physical network. And these inference networks — much like DHTs — turn out to be a good fit to Overlog. (Distributed Junction Trees in 39 rules, anyone?) The Dyna language is another good example, with a focus on (currently single-node) Natural Language Processing.
I’ll be the first to admit that Datalog syntax is horrible, and it is not a reasonable language for developers. There is much to be done before adoption of complex data-centric languages can occur. But what excites me here is that the main positive trend in parallel programming — the one driven by the Industrial Revolution of Data, the one with programmer feet on the street — that trend feeds into this promising new generation of much richer data-centric languages. If MapReduce is the boot camp for a next round of parallel languages, those languages are likely to be data-centric. And there’s growing reason to believe that the data-centric approach will suit a wide range of tasks.
– Joe Hellerstein is a Professor of Computer Science at the University of California, Berkeley. His research focuses on data management and networking.


