NPR’s Diane Rehm Show on Monday featured an hour-long discussion among several thought leaders — titled “The New World of Massive Data Mining” — about the Federal government’s new Big Data R&D Initiative:
Every time you go on the Internet, make a phone call, send an email, pass a traffic camera or pay a bill, you create [electronic data]. In all, 2.5 quintillion bytes of data are created each day. This massive pile of information from all sources is called “Big Data.” It gets stored somewhere, and everyday the pile gets bigger. Government and industry are finding new ways to analyze it. Last week the administration announced an initiative to aid the development of Big Data computing. A panel of experts join guest host Tom Gjelten to discuss the opportunities — for business, science, medicine, education, and security … but also the privacy concerns.
Among the discussants:
- Suzanne Iacono, co-chair, Federal Big Data Senior Steering Group, and senior science adviser, Directorate for Computer and Information Science and Engineering (CISE) at the National Science Foundation (NSF);
- Daphne Koller, professor of computer science, Stanford University Artificial Intelligence Laboratory;
- John Villasenor, senior fellow at the Brookings Institution and professor of electrical engineering at University of California, Los Angeles; and
- Michael Leiter, senior counselor, Palantir Technologies, and former director, National Counterterrorism Center.
Leiter laid out the challenge:
It’s not just the volume of the data… [but] it’s also the speed with which it’s coming in, and also the variety of forms of that data. It can be text, it can be weblog records, it can be video, it can be pictures — all of that data becomes more and more overwhelming. And the difficulty of course is trying to stay in front of that — trying to make sure you know what you have and how different pieces within different data sets are correlated with one another…
It requires, first of all, integrating that data — it’s not just looking at one stovepipe of information; it’s comparing one source of information with other sources and seeing where there are correlations that are meaningful. Second, it’s being able to do so in a very flexible, agile away, so a human being can manipulate and play with that data… You’re not just relying on a set of algorithms that supposedly spit out an answer; [rather] people can crawl through that data and identify what is meaningful, test hypotheses, and then look in other areas.
Iacono described the interests of the U.S. government in Big Data (following the link):
“Several years ago, the science leaders in the Administration recognized that ‘Big Data’ really is the next big thing. And by that I mean, it’s high time that we make significant investments to do just what you’re saying — really get our arms around the Big Data and really make a difference to the country. There are obvious impacts economically.
“We’re seeing a huge transformation in science from small data science to big data science, and we have opportunities to address national challenges like clean energy and cyberlearning in completely new ways that we’ve never thought about before. But the challenges are really great… we’ve got to be able to integrate these heterogeneous databases to really be able to make a difference.
“So about a year ago, under the auspices of the National Science and Technology Council, the Office of Science and Technology Policy chartered … a Big Data Senior Steering Group to go about and get a research and education agenda in place.”
“There’s research and development going on in private firms and also in academia, but in the private firms, the research and development is mostly the ‘D’ part that’s going on — they’re trying to develop and engineer products that they’re going to put out in the market. The government, however, takes a much longer term view and really is investing in the long-term research that’s going to enable discoveries that are going to matter years down the road and ends up in the kinds of products that we can’t even imagine today.
“For example, emergency preparedness and public safety… Imagine being able to predict a plume — some kind of nuclear disaster — and being able to integrate that with census data and which kids are at school, and being able to develop an evacuation plan that actually allows for people to get out safely and [as a result] reduce deaths…
“We cannot do that today. We do not have the underlying tools and techniques to make heterogeneous data seem more homogeneous so that there can be this kind of an evacuation plan done on the fly in real time.”
…and the implications of Big Data on privacy and security:
“But today, there’s a whole new field called privacy by design. And what we’re interested in as computer scientists is trying to figure out what it is that people want and building it into our systems right from the get-go. So we’re not waiting until after the system hits the streets and people are using it to figure out, ‘Oh no, my privacy has been invaded’. Let’s figure that out when we start to design our systems, figure [it] out all during the production of these systems — in the programming languages that are used, in the policies that are embedded. Let’s figure out what those systems should look like so that they actually help people maintain their information privacy and make choices.”
The discussion spanned a variety of areas that stand to benefit from advances in core tools and techniques for Big Data — e.g., data mining, machine learning, and predictive modeling — including health and wellness, intelligence and national security, national defense, and education. For example, Koller described Stanford’s experiments in online and personalized education:
“At Stanford, in the fall, we initiated a project of massive online education in which three of the Stanford classes, initially just in the computer science department, were provided to students anywhere around the world for free. An important difference between this and previous efforts is that this was not just video modules that provide the students with content but were also integrated with a significant amount of online assessment that allows students to practice the material, achieve mastery, and move on, and ultimately at the end attain what’s called ‘a statement of accomplishment’ indicating that they really did master the material in the course…
“Admittedly, one could make use of perhaps smaller data sets, but I think the ability to track student behavior over very large numbers provides unique opportunities… For each of those classes, we had an enrollment of about 100,000 students or higher. And that gives you numbers that allow you to detect patterns that otherwise you would never be able to find.
“So for example, in one of the assignments in one of those courses, there was a case where 5,000 students submitted the exact same wrong answer. So the teaching assistants looked at what the answer was and they realized that the students had inverted the order of two steps that they had to sue in one of the a algorithms that they had to implement. And that allowed us to detect this misconception, provide the studnets with a targeted erro rmessage, as well as realize that perhaps one could have explained that material a little bit better the first time to avoid that misconception in the first place.
“Now if you were teaching a class to 200 people and five of them would have gotten the exact same wrong answer, nobody would have noticed.
“So the availability of these really large amounts of data provides us with insights into how people learn, what they understand, what they don’t understand, what are the factors that cause some students to get it and others not, that is unprecedented, I think, in the realm of education.”
As Villasenor noted, “There’s almost no application [area] either in government or in the private sector that can’t benefit from some of this Big Data.”
(Contributed by Erwin Gianchandani, CCC Director)