NIST’s BIG DATA Workshop:
Too Much Data, Not Enough Solutions

June 21st, 2012 by Erwin Gianchandani Post a comment »

Recapping the NIST BIG DATA Workshop [image courtesy TechCrunch].“How is the general population of researchers and institutions to meet [the needs of] ‘Big Data’?” That was the question posed last week by Ian Foster, director of the Computation Institute at Argonne National Laboratory, before a packed auditorium at the National Institute of Standards and Technology (NIST) just outside Washington, DC. Foster was delivering one of the keynotes at NIST’s BIG DATA Workshop, a two-event that assembled leading experts from academia, industry, and government to explore key topics in support of the Federal government’s recently-announced $200 million Big Data R&D Initiative.

Foster’s answer? (Follow the link to find out!)

Accelerate discovery and innovation worldwide by providing “research IT as a service,” Foster argued — in other words, leverage the cloud to provide millions of researchers with unprecedented access to powerful tools; enable a massive shortening of cycle times in time-consuming research processes; and reduce the research IT needs.

Along the way, Foster said, focus on new standards and best practices — especially and for the handling sensitive data such as HIPAA-protected data, human transplant data, etc. — along with an established end-to-end pipeline of data acquisition, management, and analysis that facilitates the necessary discovery and innovation processes. In this context, key challenges to be resolved are distributed implementation of services and reproducibility of results. After all, as multiple speakers agreed, there is no “one size fits all” solution.

Echoed Howard Wactlar, division director for Information and Intelligent Systems (IIS) within the National Science Foundation’s (NSF) Computer Information Science and Engineering (CISE) directorate, “Let the data drive the research.” Wactlar described the paradigm shift that is taking place before our eyes, from hypothesis-driven research to data-driven discovery. He urged the audience to consider as an example the incredible success of predictive analytics for online commercial powerhouse Amazon: the company’s purchase recommendation feature relies upon rapid analysis of multiple high-volume data streams, such as when and how often individuals purchase specific items, similar interests across customers, etc.

Like others at the workshop, Foster noted that ‘Big Data’ isn’t a new problem: there was a NSF workshop on the topic back in 1997. But the unparalleled size, scale, and complexity (to include heterogeneity and noise) of data being generated across all sectors of science, engineering, and society in recent years is triggering equally unprecedented demand for novel approaches and solutions by government agencies and corporations.

So for “the general population of researchers and institutions” that Foster referenced, where does someone interested in ‘Big Data’ start? Amazon has made available to the public free datasets for exploration, and one can also use Amazon’s pay-for-use DynamoDB system, a NoSQL database. There are also a myriad open source platforms that don’t cost a penny, such as HPCC Systems from LexisNexis — essentially a data-intensive supercomputing platform designed for the enterprise to solve big data problems.

To learn more about the workshop, check out the NIST website.

And did you attend, or do you have thoughts about some of the themes described above? Post a comment in the space below and discuss with your colleagues.

(Contributed by Kenneth Hines, CCC Program Associate)