Nov
4
We’d like your help with a brainstorming exercise: Identify about a dozen game-changing advances from computing research conducted in the past 20 years. Here’s what we mean:
- The advance needs to be “game changing,” in the sense of dramatically altering how we think about computing and its applications.
- The importance of the advance needs to be obvious and easily appreciated by a wide audience.
- There needs to be a clear tie to computing research (or to infrastructure initiatives that build upon research and were sponsored by computing research organizations).
- We’re particularly interested in highlighting the impact of federally-funded university-based research.
We’re focusing on work carried out in the past 20 years or so, in part because of the upcoming 20-year celebrations for the CISE directorate at NSF. Of course, lots of great fundamental research can take more than 20 years before the impact becomes obvious, but even in such cases there is usually continuing influences on more recent research that can be cited here.
To get your juices flowing, here are four game-changers that we definitely think belong on the list. Use these to think about others that belong on the list, or feel free to argue with our choices.
The Internet and the World Wide Web as we know them today
In 1988 — 20 years ago — ARPANET became NSFNET. At the time, there were only about 50,000 hosts spread across only about 150 networks. In 1989, CNRI connected MCImail to the Internet — the first “commercial use.” In 1992, NCSA Mosaic triggered the explosive growth of the World Wide Web. In 1995, full commercialization of the Internet was achieved, with roughly 6,000,000 hosts spread across roughly 50,000 networks. Today, there are more than half a billion Internet hosts, and an estimated 1.5 billion Internet users.
While many of the underlying technologies (digital packet switching, ARPANET, TCP/IP) predate the 20-year window, the transition from the relatively closed ARPANET to the totally open Internet and World Wide Web as we know them today falls squarely within that window. NSF-supported contributions included CSnet, NSFNET, and NCSA Mosaic.
The Internet and the World Wide Web are game-changers.
Where once we filed, today we search
The vast majority of the world’s information is available online today, and we find what we need — whether across the continent or on our own personal computer — by searching, rather than by organizing the information for later retrieval.
Research on the retrieval of unstructured information is based on decades of fundamental research in both computer science theory and AI. But the paradigm shift that is web crawling and indexing and desktop search is much more recent. It traces its roots to university projects such as WebCrawler, MetaCrawler, Lycos, Excite, Inktomi, and the NSF Digital Libraries Initiative research which begat Google.
Search is a game-changer.
Cluster computing
At the risk of offending our many computer architect friends, we’re going to assert that cluster computing is the most significant advance in computer architecture in the past 20 years.
A decade ago, Jeff Bezos was featured in magazine advertisements for the DEC AlphaServer, because that’s what Amazon.com ran on — the biggest shared-memory multiprocessor that could be built. Similarly, the AltaVista search engine was designed to showcase the capabilities of big SMP’s with 64-bit addressing.
Today, this seems laughable. Companies such as Google and Amazon.com replicate and partition applications across clusters of tens of thousands of cheap commodity single-board computers, using a variety of software techniques to achieve reliability, availability, and scalability.
The notion of hardware “bricks” probably can be traced to Inktomi, a byproduct of the Berkeley Networks of Workstations project. The software techniques are drawn from several decades of research on distributed algorithms.
Cluster computing is a game-changer.
The transformation of science via computation
The traditional three legs of the scientific stool are theory, experimentation, and observation. In the past 20 years, computer simulation has joined these as a fundamental approach to science, driven largely by the NSF Supercomputer Centers and PACI programs. Entire branches of physics, chemistry, astronomy, and other fields have been transformed.
Today, a second transformation is underway — a transformation to data-centered eScience, which requires semi-automated discovery in enormous volumes of data using techniques such as data mining and machine learning, much of which is based on years of basic research in statistics, optimization theory, and algorithms.
Computational science is a game-changer.
Some non-inclusions
Quantum computing. There is huge potential here, but the impact hasn’t been felt yet.
Simultaneous multithreading. We claim that this, and many other important advances in computer architecture, are dominated by cluster computing. (Remember, we’re trying to be provocative here! Blame Dave Ditzel, who put this idea into Ed’s head.)
Your part goes here!
What’s your reaction to the four game-changers that we’ve identified? Do you agree that they belong on the list? If not, why not? If so, what do you think were the principal components of each — the key contributing research results?
Even more importantly, give us eight more! What are your nominees for game-changing advances from computing research conducted in the past 20 years?
Give us your thoughts!
– Ed Lazowska and Peter Lee
Oct
31
As recently announced on the Computing Research Policy Blog, the CRA and CCC web sites are now providing a weekly feature called “Computing Research Highlight of the Week.” If you are doing computing research, you are invited to submit your own work for possible inclusion in this weekly feature.
These highlights are designed to provide easily digestible, compelling nuggets of computing research work. Members of Congress, the Administration, and funding agency managers and directors are some of the main audiences for these web pages. We believe the highlights should also prove to be useful for the entire research community. The highlights can be accessed directly, received by email, RSS feed, or even embedded in your own web page.
The current Computing Research Highlight of the Week describes a new algorithm from UCSD researchers that performs route computation in a way that may lead to major improvements in network efficiency. Check it out — it is punchy, informative, and makes good use of some simple graphics while at the same time providing links to the scientific publication and full press release.
So, please submit your own highlights! The response thus far has been very good, and we expect that many people outside of our community, including key decision makers, will make good use of the information.
Oct
20
The Data-Centric Gambit
Filed Under research horizons | 7 Comments
Things always change fast in computing. But the rate of change seems to be on a major uptick recently. In this post, I want to focus on an accelerating driver of that change, a looming crisis on the horizon, and a surprising link between the two that may have big promise. In the spirit of blog discourse, let’s lay this out in broad strokes:
- The Industrial Revolution of Data. Today’s world-wide web remains a staggering tribute to the typing abilities of the human race. But even with a growing global population, typists are not a scalable source of bit-production going forward. We are entering an era where the overwhelming majority of information will not be hand-crafted. It will be stamped out by machines: software logs, cameras, microphones, GPS transceivers, sensor networks, RFID readers, and so on. This is inevitable. It has already begun to change the computing marketplace: most organizations of size now realize they can afford to save and mine all their logs, and are looking for inexpensive ways to do so. The startup world has responded with a flurry of parallel database and data analytics companies.
- The Crisis of the Three C’s: Coders, Clouds and Cores. Meanwhile, it’s no news that software development is far, far too difficult. In his Turing Award talk a decade ago, Jim Gray identified radical improvements in programming among his 13 remaining long-term challenges for computing — alongside passing the Turing test and building Vannevar Bush’s Memex. What’s changed on this front since 1998 is the rapid rise of parallelism that my colleagues have been blogging about here. Cloud computing infrastructure, with its “shared-nothing” clusters of machines, demands parallel and distributed programs today. Manycore architectures will demand parallelism at a finer grain in the next few years. The pressing need for parallel software — and armies of fluent software developers to build it — raises both the difficulty and the stakes of the Grand Challenge that Jim Gray highlighted in 1998.
Given this background, what excites me these days is that the trend may bring some new solutions to the crisis, in a surprisingly organic way.
For over twenty years, “Big Data” has been a sustained bright spot in parallel computing. SQL has been a successful, massively parallel programming language since the late 1980’s, when Teradata (a survivor that evaded Dave Patterson’s Rolls of the Dead) first commercialized parallel database research from projects like Gamma and Bubba. In recent years, SQL has been joined by Google’s MapReduce framework, which is bringing algorithmicists into massive data processing in a way that SQL never did. Both SQL and MapReduce will likely thrive, and may well converge: two parallel database startup companies recently announcedintegrated implementations of SQL and MapReduce. (Full disclosure: I advise Greenplum, one of those companies.)
SQL and MapReduce programmers do not think much about parallelism. Rather than trying to unravel an algorithm into separate threads, they focus on chopping up sets of input data into pieces, which get pumped through copies of a single sequential program running asynchronously on each processor. In parallel programming jargon, this kind of code is sometimes dismissively referred to as being “embarrassingly parallel”. But very often, the simplest ideas are the most fertile. Programmers “get” these approaches to parallelism. And remember: the Coders are part of the Crisis of the Three C’s, and the key is to make lots of them happy and productive.
But can those programmers lead us anywhere interesting? The most intriguing part of this story is that in the last 5-10 years, the set-oriented, data-centric approach has been gaining footholds well outside of batch-oriented data parallelism. There has been a groundswell of work on “declarative”, data-centric languages for a variety of domain-specific tasks, mostly using extensions of Datalog. These languages have been popping up in networking and distributed systems, natural language processing, compiler analysis, modular robotics, security, and video games, among other applications. And they are being proposed for tasks that are not embarrassingly parallel. It turns out that focusing on the data can make a broad class of programs simpler — much simpler! — to express.
Networking and distributed coordination protocols are one good example. They run in parallel, and are themselves a key to cloud services. In our work on Declarative Networking over the last years, we showed that a wide range of network and distributed coordination protocols are remarkably easy to express in a data-centric language. For example, our version of the Chord Distributed Hash Table (DHT) protocol is 47 lines of our Overlog language; the reference implementation is over 10,000 lines of C++. (DHTs are a key component of cloud services like Amazon’s Dynamo.) Meanwhile, graduate students at Harvard prototyped a simple version of the tricky Paxos consensus protocol in an alpha edition of Overlog in 44 rules. (Paxos is a key component in cloud infrastructure like Google’s GFS file system and Chubby lock manager.) We have other examples where our declarative programs are line-for-line translations of pseudo-code from research papers. These are the kinds of scenarios where the quantitative differences are best captured qualitatively. You can print out our Chord implementation on one sheet of paper, take it down to the coffee shop, and figure it out. Doing that with 10,000 lines of C++ would be a superhuman feat of Programmer-Fu, and a big waste of paper.
Machine Learning is another area where data-centric declarative programming seems to help with parallelism and distribution. A group at Stanford pointed to a range of standard machine learning tasks that can be expressed almost trivially as MapReduce programs, without any requirement for parallel programming expertise. More deeply, a number of the fundamental algorithms driving Machine Learning center on “message-passing” algorithms like Belief Propagation and Junction Trees that work on a computational model of explicit dataflow, rather than shared memory. Research in ML over sensornets showed how to overlay that logical communication onto a physical network. And these inference networks — much like DHTs — turn out to be a good fit to Overlog. (Distributed Junction Trees in 39 rules, anyone?) The Dyna language is another good example, with a focus on (currently single-node) Natural Language Processing.
I’ll be the first to admit that Datalog syntax is horrible, and it is not a reasonable language for developers. There is much to be done before adoption of complex data-centric languages can occur. But what excites me here is that the main positive trend in parallel programming — the one driven by the Industrial Revolution of Data, the one with programmer feet on the street — that trend feeds into this promising new generation of much richer data-centric languages. If MapReduce is the boot camp for a next round of parallel languages, those languages are likely to be data-centric. And there’s growing reason to believe that the data-centric approach will suit a wide range of tasks.
– Joe Hellerstein is a Professor of Computer Science at the University of California, Berkeley. His research focuses on data management and networking.


