In this video, Rick Janowski from IBM describes how the company is doubling down on Apache Spark for Big Data analytics.
“At the core of this commitment, IBM plans to embed Spark into its industry-leading Analytics and Commerce platforms, and to offer Spark as a service on IBM Cloud. IBM will also put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide; donate its breakthrough IBM SystemML machine learning technology to the Spark open source ecosystem; and educate more than one million data scientists and data engineers on Spark.”
insideHPC: What is IBM doing with Apache Spark?
Rick Janowski: IBM has made a major commitment with Apache Spark. We’re seeing that there’s a huge amount of interest out there in the Apache community and in the industry, and we’re very much supporting that. Back at the Spark Summit in San Francisco early this year, we’ve made a number of major announcements. Amongst those there’s a donation of a machine learning platform to the open source community. We set up a Spark technology center in San Francisco to provide Apache Spark distributions and contribute to that overall community. We also have, within IBM, plans to put 3500 developers and researchers to work on the Spark projects. And perhaps most excitingly even, with our partners we’re talking about training one million data scientist and engineers on Spark over the coming months and years.
insideHPC: I was going to say, are there one million data scientists out there or working? There’s a lot of university programs.
Rick Janowski: If there aren’t now, then we plan to make it so [laughter], but yes.
insideHPC: Great. Terrific, because with Apache Spark I’ve got tell you, as a journalist covering this space, the hits go through the roof whenever we talk about Apache Spark. Are you seeing that kind of interest out there as well.
Rick Janowski: Absolutely. In first full viewing, in quantitative terms, I was just looking today on the Apache website. I’m seeing that over the past 12 months there’s been almost 10,000 commits of our code into the Apache Spark code base, compared with slightly fewer than 3,000 for doing MapReduce which of course is the transition that we’re seeing from MapReduce to Spark, which is why a lot of this excitement is about. In terms of another qualitative data point, I also did a search on Google statistics. Four times more searches for Apache Spark than Apache Hadoop at this time. And then anecdotally, I mean, obviously Platform Computing has many customers that we have great relationships with. We’ve been talking to them about their technology plans, their technology roadmaps. We are seeing a lot of Spark activity. Early days yet. It’s on the rise, but we are seeing a huge amount of research and testing of Spark systems as the next technology platform for big data analytics.
insideHPC: So I want to ask you about where this is going, because there has to be a really rapid set of deliveries coming out with all these people working on the code. It must be hard to keep up with that.
Rick Janowski: Well, as I said, IBM has 3,500 people working on this. So I can’t represent all of them. But certainly within Platform Computing, we have a 20-plus-year track record of managing, distributing the applications, and big data applications. We’re doing a technology preview of a product that’s coming out by the end of 2015, which will address some of the concerns that we think we see with our customers moving Spark from the experimental to the production environment. Those concerns include, obviously, you’ve got a whole new tool set to learn, new skills, new workflows, just the whole notion of how you integrate Spark into an existing environment. Also, what we’re seeing as I say we’ve got all of this experimental notes, so you’re getting these scatter silos of Spark clusters, and that’s not the most optimal way of running an organization’s distributed computing power. And with the activity that I’m describing in terms of the Apache open source community, we’re seeing new versions of Spark coming out every several weeks rather than months. So that becomes a version control nightmare in the context of an enterprise.
What we’re previewing here today is a capability to have an overarching software, resource scheduler and workflow manager that takes all of these disparate sources and unifies them into a single view, making hundreds or thousands of computers look like one, and allowing you to run multiple instances of Spark. We have a very strong Spark multitenancy capability, so you can run multiple instances of Spark simultaneously, and you can run different versions of Spark, so you don’t obligate your organization to upgrade in lockstep. One organization wants a newer version, another organization wants an older version, that’s okay, we support that. Underlying that, we also incorporate spectrum scale FPO that has the GPFS technology within it. That’s a very viable alternative to open source HDFS. It has a smaller footprint, it does not have a single point of failure as HDFS does, and its POSIX compliant which HDFS is not. Incorporating that product also is out of the Spark Technology Center that I’ve mentioned, a whole Apache Spark distribution developed by IBM.
insideHPC: We’re the supercomputer conference of course. Are the HPC guys glomming onto the Spark as well? Are we starting to see that?
Rick Janowski: Yes, we are starting to see that a lot, like anywhere. It’s early days, but I’ve had conversations here with academics, with industrialists, and I’m actually quite encouraged by the sophistication and the level of interest in solutions that will help them move from the experimental stage to the production stage. So the kind of solution I’m describing here today, a single end-to-end enterprise great solution for deploying Spark with IBM services and support, that’s getting a huge amount of resonance at the show this week.
Learn more in this IBM Webcast with IDC Analyst Carl Olofson.