Presenters: Ilkay Altintas, PH.D.; and Shweta Purawat, Data Science Workflow Specialist, San Diego Supercomputer Center (SDSC)

Date: April 26, 2017




In the Big Data era, often, valuable information gets buried in voluminous amounts of data. Scalability is becoming a prerequisite for applications to be able to efficiently process large-scale datasets. This is where scientific workflows – a software application comprised of computational steps and data tools that scale up to run on high-performance computers, distributed environments, or commercial cloud systems – can make the critical difference. Workflows give you confidence in the accuracy of your results. They are science accelerators because they reduce the time to those results.

The participants will learn how they can turn their scientific computing applications into scalable workflows by analyzing available options, techniques and tools. The focus will be on teaching methodologies to create efficient, scientifically rigorous, scalable workflow applications. Participants will also learn about Kepler, a comprehensive environment of reusable and extensible components to support distributed analysis of large-scale data. In particular, you will learn about:

  • Distributed platforms and system
  • Cloud and Big Data
  • Scalable workflow tools
  • How to make your science reproducible
  • Kepler tools to build scalable scientific workflows

The Kepler scientific workflow system is an open-source collaborative platform to serve scientists of all disciplines. Kepler has been successfully used in a wide variety of projects to manage, process, and analyze scientific data. Kepler provides a graphical user interface (GUI) for designing scientific workflows, which are a structured set of tasks linked together to implement a computational solution to a scientific problem. Kepler is a powerful and easy-to-use framework to facilitate High-performance, High-throughput and Big Data applications in scientific workflow systems. Kepler’s modular development approach allows users to build workflows in any domain with minimal effort. Users can leverage the workflow composition and management capabilities of Kepler to deploy algorithms on large scale distributed platforms. Kepler is continuously upgraded to support latest Big Data programing paradigms such as MapReduce and enhance deploying capabilities on modern execution engines like Hadoop, Spark and Stratosphere.

The demo session will familiarize audience with open source Kepler workflow system’s key features that will enable them to kick-start their workflow programing journey. We will then explain how workflow systems can help with rapid development of distributed and parallel applications on top of common computing platforms including NSF XSEDE high performance computing resources, the Amazon cloud and Hadoop. We will illustrate the case using Kepler-based Kepler-based Molecular Dynamics workflow that runs on XSEDE HPC cluster (SDSC Comet). The Molecular Dynamics Computer Aided Drug-Discovery (MDCADD) workflow integrates AMBER software and HPC resources using Kepler scientific workflow system.

Target Audience:

  • Researchers, developers and graduate students who are interested in learning new computational tools and techniques for scientific computing and data science applications.
  • Researchers, developers and graduate students in any scientific domain who are in saving time, optimizing and scaling up the computations to produce results faster or communicating your research results more effectively.
  • Researchers and graduate students who are responsible from building computational and data science workflows, evaluating workflow systems as a means to conduct reproducible research, and curious to learn more about what workflows help with are welcome to attend. The hands-on examples will be selected from a variety of scientific domains and the content on the workflow-driven reproducible science is set up to benefit a larger multidisciplinary audience.

Prerequisites: None.

User Base: Over 1000 scientists across many domains & business personnel aspiring to leverage data science

Software Availability: Free and open source

Software Requirements: Oracle Virtual Box to run a pre-installed virtual machine.

Use Cases:

  • Kepler’s Getting-started demo workflows
  • Molecular Dynamics Computer Aided Drug-Discovery (MDCADD) workflow on XSEDE HPC cluster (SDSC Comet)

Training and Reference Materials:


Ilkay Altintas is the SDSC’s Chief Data Science Officer and the founder and director for the Workflows for Data Science (WorDS) Center of Excellence. Since joining SDSC in 2001, Altintas has worked on different aspects of scientific workflows as a principal investigator and in other leadership roles across a wide range of cross-disciplinary NSF, DOE, NIH, and Moore Foundation projects. She is a co-initiator of and an active contributor to the popular open-source Kepler Scientific Workflow System, and co-author of publications related to computational data science and e-Sciences at the intersection of scientific workflows, distributed computing, bioinformatics, conceptual data querying, and software modeling.

Shweta Purawat is Data Science Workflow Specialist for the WorDS Center of Excellence at the San Diego Supercomputer Center, UCSD. She develops compute and data intensive workflows and actors for distributed platforms that involve cloud and grid environments. She is excited about Cloud Computing, Data Science Workflows, Scientific Workflow design, Large Scale Data Intensive Computing, and Predictive Analytics. Her interest lies in designing core technology tools and applications, that act as a catalyst for innovation. She holds M. Tech degree from IIT Bombay, India. Prior to joining SDSC, Shweta was a Design Engineer at Intel Corporation.