The following is a special contribution to this blog by Carlos Guestrin, who will be joining the faculty of the University of Washington computer science and engineering department this fall. Carlos led the organization of the First GraphLab Workshop on Large-scale Machine Learning in San Francisco, CA.
The scale and complexity of data on the web continues to grow at a tremendous rate. A recent New York Times article compared Big Data to an economic asset for companies, like currency and gold. But, in order to extract value from 6 billion Flickr images, 900 million Facebook users, 24 million Wikipedia articles, or the 72 hours of video uploaded to YouTube per minute, we need machine learning techniques that can scale to these huge datasets. The First GraphLab Workshop on Large-scale Machine Learning, held in San Francisco on July 9th, sought to bring together folks from industry and academia to explore the state of the art on this fundamental challenge.
About 320 participants attended 15 talks and around the same number of demos, covering systems, abstraction, languages and algorithms for large-scale data analysis. GraphLab provides a novel abstraction for large-scale machine learning, which in many applications yields 1-2 orders of magnitude performance improvements over Hadoop. GraphLab is particularly suited to problems with dependencies in the data, which cannot be easily or efficiently separated into independent subproblems, such as collaborative filtering, graph analytics and clustering. In the opening keynote, I announced two new major releases for the GraphLab family. First, I announced the release of GraphLab 2.1, an updated abstraction that significantly increases the scalability of GraphLab, especially in Cloud settings. The second release was GraphChi, which is able to solve web-scale problems on a single personal computer, such as a Mac Mini, through a smart data structure that minimizes random disk accesses. (Following the ongoing animal mascots in Cloud systems, the “lab” in GraphLab corresponds to a “labrador”, while the “chi” in GraphChi refers to a chihuahua.)
The remainder of the talks addressed a number of exciting developments in Cloud-based Big Data analytics. The longer talks included (following the link):
- Ted Willke from Intel announced the development of GraphBuilder, a new distributed system using Hadoop to address the crucial gap between unstructured data and the formation of the graph of dependencies in this data.
- Joe Hellerstein from UC Berkeley discussed Bloom, a language for distributed programming. Bloom exploits the CALM principle to minimize the amount of coordination needed in distributed systems.
- Sam Madden from MIT described their Schism system, which takes workloads into account for optimizing database partitioning and replication in the Cloud, and thus improve the scalability of distributed systems.
- Jeff Heer from Stanford presented a number of impressive approaches for Big Data visualization and interactive analysis, including the D3.js visualization library and DataWrangler for data cleaning and transformation.
- Alex Smola, who is current at Google but on his way to CMU, outlined a set of algorithmic templates for distributed machine learning, including topic modeling and matrix factorization, with an “eventual consistency” approach to parameter learning.
- Jure Leskovec from Stanford presented an exciting sequence of insights in the social analysis of large-scale social networks, including answering the question: “Is the enemy of my enemy my friend?”
- Ted Dunning from MapR discussed an insightful case study in large-scale clustering, along with a scalable single-pass k-Means algorithm and its application to fast neighbor finding.
There were also a number of exciting short talks from Yahoo!, Twitter, Stanford, Netflix, Pandora, IBM, and One Kings Lane.
The energy and excitement of the workshop participants reflected the timeliness of the topic and the need for new approaches for large-scale machine learning and Big Data analytics. Presentations and videos are being posted on the workshop webpage.
Editor’s note: GraphChi is the subject of an article published in the Technology Review this week.