The Data Science with Spark Workshop addresses high-level parallelization for data analytics workloads using the Apache Spark framework. Participants will learn how to prototype with Spark and how to exploit large HPC machines like the Piz Daint CSCS flagship system.
“Do you need to compress your software development cycles for services deployed at scale and accelerate your data-driven insights? Are you delivering solutions that automate decision making & model complexity using analytics and machine learning on Spark? Find out how a pre-integrated analytics platform that’s tuned for memory-intensive workloads and powered by the industry leading interconnect will empower your data science and software development teams to deliver amazing results for your business. Learn how Cray’s supercomputing approach in an enterprise package can help you excel at scale.”
“This talk will describe Monotasks, a new architecture for the core of Spark that makes performance easier to reason about. In Spark today, pervasive parallelism and pipelining make it difficult to answer even simple performance questions like “what is the bottleneck for this workload?” As a result, it’s difficult for developers to know what to optimize, and it’s even more difficult for users to understand what hardware to use and what configuration parameters to set to get the best performance.”
“Managing the work on each node can be referred to as Domain parallelism. During the run of the application, the work assigned to each node can be generally isolated from other nodes. The node can work on its own and needs little communication with other nodes to perform the work. The tools that are needed for this are MPI for the developer, but can take advantage of frameworks such as Hadoop and Spark (for big data analytics). Managing the work for each core or thread will need one level down of control. This type of work will typically invoke a large number of independent tasks that must then share data between the tasks.”
“The pharmaceutical industry trend toward joint ventures and collaborations has created a need for new platforms in which to work together. We’ll dive into architectural decisions for building collaborative systems. Examples include how such a platform allowed Human Longevity, Inc. to accelerate software deployment to production in a fast-paced research environment, and how Celgene uses AWS for research collaboration with outside universities and foundations.”
“We took the Aries system interconnect from our supercomputers, the industry-standard architecture of our clusters, the scalable graph engine from the Urika-GD appliance, and the pre-integrated, open infrastructure of our Urika-XA system and combined them into one agile analytics platform. The Urika-GX gives our customers the tool they need to overcome their most advanced analytics challenges today, and the platform to bridge to tomorrow.”
Hadoop and Spark clusters have a reputation for being extremely difficult to configure, install, and tune, but help is on the way. The good folks at Cluster Monkey are hosting a crash course entitled Apache Hadoop with Spark in One Day. “After completing the workshop attendees will be able to use and navigate a production Hadoop cluster and develop their own projects by building on the workshop examples.”
In this special guest feature from Scientific Computing World, Andrew Jones from NAG looks ahead at what 2016 has in store for HPC and finds people, not technology, to be the most important issue. “A disconcertingly large proportion of the software used in computational science and engineering today was written for friendlier and less complex technology. An explosion of attention is needed to drag software into a state where it can effectively deliver science using future HPC platforms.”
“What we’re previewing here today is a capability to have an overarching software, resource scheduler and workflow manager that takes all of these disparate sources and unifies them into a single view, making hundreds or thousands of computers look like one, and allowing you to run multiple instances of Spark. We have a very strong Spark multitenancy capability, so you can run multiple instances of Spark simultaneously, and you can run different versions of Spark, so you don’t obligate your organization to upgrade in lockstep.”
Today LBNL announced that a team of scientists from Berkeley Lab’s Computational Research Division has been awarded a grant by Intel to support their goal of enabling data analytics software stacks—notably Spark—to scale out on next-generation high performance computing systems.