The Data Science with Spark Workshop addresses high-level parallelization for data analytics workloads using the Apache Spark framework. Participants will learn how to prototype with Spark and how to exploit large HPC machines like the Piz Daint CSCS flagship system.
“The work we do involves capturing and analyzing huge environmental data sets so that the government can make informed policy decisions that protect humans and the environment,” said Ron Hines, Associate Director for Health at the EPA’s National Health and Environmental Effects Research Laboratory in Research Triangle Park, N.C. “We have collaborated with the NCDS on some of its initiatives in the past and having a seat at its leadership table will help us connect with leading data researchers, access data resources and infrastructure, and contribute to the development of future NCDS strategies.”
“This talk will describe one new effort to embed best practices for reproducible scientific computing into traditional university curriculum. In particular, a set of open source, liberally licensed, IPython (now Jupyter) notebooks are being developed and tested to accompany a book “Effective Computation in Physics.” These interactive lecture materials lay out in-class exercises for a project-driven upper-level undergraduate course and are accordingly intended to be forked, modified and reused by professors across universities and disciplines.”
“CDSW’s organizers are professional programmers and data scientists and several of us have experience teaching data science in more traditional university and corporate settings. Our talk will describe how “democratized” data science is similar to — and sometimes extremely different from — these more traditional approaches. We will talk about some of the challenges we have faced and highlight some of our most inspirational successes.”