VSCSE - Virtual School of Computational Science and Engineering

Big Data for Science

July 26–30, 2010

Course Archive

Visit the public course archive for the course schedule, course content, a discussion forum, and more.



Sites

Arkansas High Performance Computing Center, University of Arkansas, Fayetteville

Electronic Visualization Laboratory, University of Illinois at Chicago

Indiana University, Bloomington

Institute for Digital Research and Education, University of California, Los Angeles

Michigan State University, East Lansing

Pennsylvania State University, University Park

University of Iowa, Iowa City

University of Minnesota Supercomputing Institute, Minneapolis

University of Notre Dame, Notre Dame, Indiana

University of Texas at El Paso

Humans are generating, sensing, and harvesting massive amounts of digital data, and many of these unprecedentedly large data sets will be archived in their entirety. We find ourselves surrounded by huge volumes of "data at rest," that is, data written once and destined to live forever. Data movement will become the exception rather than rule.

Digital data owners will control the data distribution channels via "cloud computing" infrastructure where data is unstructured and devoid of schema, begging for semantic metadata, preservation, and curation. The familiar notions of sequential or random access files no longer apply in the cloud. Instead developers will write code that mines this mass of unstructured data, extracts what is of interest, and then inserts the resulting data subset into a relational database or other structured data store where it will be analyzed and visualized.

The disciplines on the forefront of this paradigm shift are astroscience, bioscience, geoscience, and the social sciences. Science communities will learn how to manage this morass of data by refining the techniques pioneered by Google and Facebook and, more importantly, by inventing new techniques that meet the specific demands of their scientific disciplines.

As the computing landscape becomes increasingly data-centric, computational scientists will employ new tools based on new models of computation. In a data-intensive world where the sheer volume of data demands new approaches and techniques, the inclination is to move the computation to the data, a basic theme underlying this course. Called the "fourth paradigm" (after theory, experiment, and computation), data-intensive computing is poised to transform scientific research.

Students will learn about:

  • The notion of "data at rest" and its impact on data movement and computation
  • The role of cloud infrastructure in data-intensive computing
  • The need for semantic metadata, preservation, and curation of digital data

Participants will get hands-on programming experience with data-intensive computing languages such as MapReduce.

Instructors:
Geoffrey C. Fox, distinguished scientist and director, Community Grids Lab, Pervasive Technology Institute, Indiana University

Judy Qiu, assistant director, Community Grids Lab, Pervasive Technology Institute, Indiana University

Prerequisites:

  • Experience working in a Unix environment
  • Experience developing and running scientific codes written in C, C++, Java, or a similar high-level programming language

Course outline:

  • Opening Keynote: Data-intensive Computing
  • Data Movement & Storage
  • Data Mining
  • Semantic Web
  • Keynote: Distributed Data-Parallel Computing
  • Cloud Computing Platforms (e.g., Hadoop, Azure)
  • MapReduce for Big Data
  • Hybrid Approaches to Big Data (e.g., Twister, HadoopDB, Sector/Sphere)
  • MapReduce vs. SQL
  • Performance Considerations
  • Visualization of Large Data Sets
  • Case Studies:
    • Astronomy
    • Bioinformatics
    • Earth Science
  • Hands-on Lab

NOTE: Students are required to provide their own laptops.