Computing Community Consortium Blog

The goal of the Computing Community Consortium (CCC) is to catalyze the computing research community to debate longer range, more audacious research challenges; to build consensus around research visions; to evolve the most promising visions toward clearly defined initiatives; and to work with the funding organizations to move challenges and visions toward funding initiatives. The purpose of this blog is to provide a more immediate, online mechanism for dissemination of visioning concepts and community discussion/debate about them.

Google Releases Data Set for Research

May 18th, 2012 / in research horizons, Research News, resources / by Erwin Gianchandani

From Google’s Research Blog this afternoon:

Google Releases Data Set for Research [image courtesy Google].Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas [more after the jump]…


The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data.

We hope that this release will fuel numerous creative applications that haven’t been previously thought of!

Learn more here.

(Contributed by Erwin Gianchandani, CCC Director)

Google Releases Data Set for Research