July 2630, 2010
Visit the public course archive for the course schedule, course content, a discussion forum, and more.
Arkansas High Performance Computing Center, University of Arkansas, Fayetteville
Electronic Visualization Laboratory, University of Illinois at Chicago
Indiana University, Bloomington
Institute for Digital Research and Education, University of California, Los Angeles
Michigan State University, East Lansing
Pennsylvania State University, University Park
University of Iowa, Iowa City
University of Minnesota Supercomputing Institute, Minneapolis
University of Notre Dame, Notre Dame, Indiana
University of Texas at El Paso
Humans are generating, sensing, and harvesting massive amounts of digital data, and many of these unprecedentedly large data sets will be archived in their entirety. We find ourselves surrounded by huge volumes of "data at rest," that is, data written once and destined to live forever. Data movement will become the exception rather than rule.
Digital data owners will control the data distribution channels via "cloud computing" infrastructure where data is unstructured and devoid of schema, begging for semantic metadata, preservation, and curation. The familiar notions of sequential or random access files no longer apply in the cloud. Instead developers will write code that mines this mass of unstructured data, extracts what is of interest, and then inserts the resulting data subset into a relational database or other structured data store where it will be analyzed and visualized.
The disciplines on the forefront of this paradigm shift are astroscience, bioscience, geoscience, and the social sciences. Science communities will learn how to manage this morass of data by refining the techniques pioneered by Google and Facebook and, more importantly, by inventing new techniques that meet the specific demands of their scientific disciplines.
As the computing landscape becomes increasingly data-centric, computational scientists will employ new tools based on new models of computation. In a data-intensive world where the sheer volume of data demands new approaches and techniques, the inclination is to move the computation to the data, a basic theme underlying this course. Called the "fourth paradigm" (after theory, experiment, and computation), data-intensive computing is poised to transform scientific research.
Students will learn about:
Participants will get hands-on programming experience with data-intensive computing languages such as MapReduce.
Instructors:
Geoffrey C. Fox, distinguished scientist and director, Community Grids Lab, Pervasive Technology Institute, Indiana University
Judy Qiu, assistant director, Community Grids Lab, Pervasive Technology Institute, Indiana University
Prerequisites:
Course outline:
NOTE: Students are required to provide their own laptops.