Processing Big Data can be extremely challenging, even for seasoned researchers. This workshop focuses on the whole lifetime of large datasets, from job prep, to jobs, to analysis. To learn more, I caught up with the program coordinator, Fernanda Foertter from OLCF.
insideHPC: What prompted you to put this workshop together?
Fernanda Foertter: Many converging factors. As an HPC User Assistance Specialist at ORNL, I see users struggle with data in numerous ways. When Titan came online, applications improved in both time-to-solution and resolution; in other words, more data. But even before Titan came online the conversation about the next machine to replace Titan in 2017 had started, along with hardware requirements and specifications for the OLCF’s data analysis and capacity computing clusters. We are learning that users need better solutions to handle large HPC-related datasets. And just this year INCITE proposals were invited to include information on their data management plans. Understanding needs for data management is more important than ever now that DOE’s Office of Science will require, beginning in September 2013, data management plans to be included in all of the R&D grant proposals they receive. It became clear that we needed a workshop discuss how various communities have dealt with large scientific data.
insideHPC: Who is the target audience for this workshop and what challenges are they facing in their day jobs?
Fernanda Foertter: Our target audience is our current and future users. And yes, that is a broad group. Sometimes the challenge is bringing data in to our site. Other times it’s writing data out from 6,000 nodes (thus taking down our storage) or the data communication overhead that prevents applications from scaling. Analysis has become an issue too due to the fact that as datasets get large gaining any meaningful insight is time-consuming. We have noticed some communities are not leveraging tools other communities have already created. There’s a gap between emerging large-data communities and those that have been around a while. How to present or share this data to interested parties in a meaningful way is also a problem. Ultimately, the target audience is anyone and everyone struggling with data in HPC, and we would like to understand the needs of this population so we can serve them better.
insideHPC: You are bringing in researchers from places like CERN and JGI with very large datasets. What kinds of best practices will they be sharing?
Fernanda Foertter: I’ve asked them to share anecdotes about large data problems at their institutions. I’ve also asked them to share the pain; what didn’t work and why. Some best practices include leveraging and extending tools built for other purposes, in-situ analysis, use of libraries for I/O, and sharding of data to improve sharing access.
insideHPC: What do you think will be the biggest takeaway from the workshop for attendees?
Fernanda Foertter: The big takeaway is that it’s important to plan for the lifetime of the data. It’s no longer enough to think just about the science if it’ll take several years to analyze the results. CERN is perhaps the best example of this mindset, as much of the data it generates it throws away because it can’t possibly analyze it all, let alone pay to store it.
insideHPC: Will the event be recorded for those unable to travel?
Fernanda Foertter: The event will be recorded and broadcast via WebEx.
insideHPC: Can you tell us more about Knoxville and the venue for the workshop?
Fernanda Foertter: Knoxville is a great town, with a lot of history and a major university at its heart. We also found during the last Titan training event in February that holding the workshop in Knoxville was more convenient for our attendees. Due to the popularity of the event, we’ve moved the workshop to the Hilton in Knoxville, as it’s very close to great eateries and entertainment. Though I suspect the attendees will hurry back to their rooms to try some of our hands-on tutorials instead!
Registration for this Workshop is now open, so check out the Event Page for more information.