Nearly a year ago, tech writer John Markoff published a story in The New York Times about Open Source Indicators (OSI), a new program by the Federal government’s Intelligence Advanced Research Projects Activity (IARPA) seeking to automatically collect publicly available data, including Web search queries, blog entries, Internet traffic flows, financial market indicators, traffic webcams, changes in Wikipedia entries, etc., to understand patterns of human communication, consumption, and movement. According to Markoff:
It is intended to be an entirely automated system, a “data eye in the sky” without human intervention, according to the program proposal. The research would not be limited to political and economic events, but would also explore the ability to predict pandemics and other types of widespread contagion, something that has been pursued independently by civilian researchers and by companies like Google.
This past April, IARPA issued contracts to three research teams, providing funding potentially for up to three years, with continuation beyond the first year contingent upon satisfactory progress. At least two of these contracts are now public (following the link):
The first effort — led by Virginia Tech computer scientist Naren Ramakrishnan together with colleagues from the University of Maryland, Cornell University, Children’s Hospital Boston, San Diego State University, the University of California at San Diego, Indiana University, CACI International Inc., and Basis Technology — is called early model-based event recognition using surrogates, or EMBERS:
Surrogates are accessible pieces of information that mirror or precede events of interest. The team intends to organize a huge database of surrogates predictive of real events and to apply these surrogates to public data sources.
The focus of the IARPA program is on Latin American countries. A key theme in the EMBERS project is the use of models to capture population-level behavioral changes in these countries. Tracking or identifying individuals is strictly excluded from the research.
“The models must be expressive enough to capture many important behaviors. For instance, how many people and what other factors result in a protest becoming violent? When do a few reported cases of dengue fever become an epidemic? But we do not want a model that is so complex that it becomes intractable. So finding the right balance is important,” said Madhav Marathe, professor of computer science … at the Virginia Bioinformatics Institute, and EMBERS co-investigator.
The Virginia Tech-led effort is funded as a potential $13.36 million three-year contract.
A second contract went to BBN Technologies and, according to BBN, key elements of its approach are:
- Multi-Source Data Exploitation: We analyze a wide variety of data sources that include social media (Twitter, Facebook, YouTube, blogs), Internet search terms, and structured sources (e.g. WorldBank, financial, and NGOs).
- High-Throughput Multimodal Feature Extraction: We apply BBN’s language-independent topic and sentiment analysis technologies to create a “semantic time series,” which is a compact representation of evolving event precursors over time. We additionally extract features from social media metadata (e.g. Twitter hashtags), and from a wide variety of structured data sources (e.g. financial data).
- Causality and Time Series Analysis: We fuse information over different modalities using techniques such as fast cross-correlation across time series, hierarchical clustering of correlated features for dimensionality reduction, and feature aggregation to discover signals in noisy data.
- Predictive Modeling. We are applying a suite of prediction models, including logistic regression and novel dynamic graphical models. Formal probabilistic reasoning is used to fuse knowledge derived from human experts with automatically discovered features and patterns.
According to IARPA program manager Jason Matheny:
“Research shows that many significant societal events are preceded by population-level changes in communication, consumption, and movement. Some of these changes may be indirectly observable from diverse, publicly available data, but few methods have been developed for anticipating or detecting unexpected events by fusing such data. OSI’s methods, if proven successful, could provide early warnings of emerging events around the world.”
Each OSI research team is being required to make a number of warnings/alerts that will be judged on the basis of lead time, or how early the alert was made; the accuracy of the warning, such as the where/when/what of the alert; and the probability associated with the alert, that is, high vs. very high.
(Contributed by Erwin Gianchandani, CCC Director)