Sign up for our newsletter and get the latest HPC news and analysis.
Send me information from insideHPC:


Artificial Intelligence: The Next Industrial Revolution

In this guest article, Scot Schultz, Sr. Director AI/HPC and Technical Computing at Mellanox Technologies, explores how artificial intelligence is shaping up to launch the next industrial revolution.

Scot Schultz from Mellanox

It has been said that artificial intelligence will create the next industrial revolution, the fourth industrial revolution that modern-day society has experienced since the dawn of mechanical production and steam power energy documented in 1784. Next on the timeline of society’s pivotal transformation was electrical energy and mass production, while the third revolution because around 1969 with electronics and evolving to include the wide spread adoption of internet technologies. Today, many agree that the next wave of disruptive technology blurring the lines between the digital, physical and even the biological, will be the fourth industrial revolution of AI. The fusion of state-of-the-art computational capabilities, extensive automation and extreme connectivity is already impacting nearly every aspect of society, driving global economics and extending into every aspect of our daily life.

The last 10 years have been about building a world that is mobile-first. In the next 10 years, we will shift to a world that is AI-first.” — Sundar Pichai, CEO of Google, October 2016

The most influential technology firms, including Google, Microsoft, Facebook and Amazon are highlighting their enthusiasm for artificial intelligence.

It’s hard to overstate,” Amazon CEO Jeff Bezos wrote, “how big of an impact AI is going to have on society over the next 20 years.”

We have all grown accustom with many of the new terms and elements of modern day applied technology; such as 3-D printing; pedestrian detection, lane-departure detection and autonomous vehicles, drones and IOT (internet of things). Even customer service, email spam filters, and Siri are examples that many can identify with and use every day, but not necessarily within the scope of how computer systems are already leveraging big data, deep learning and advanced automation techniques that impact our daily lives. AI has already taken medical research, autonomous vehicles and homeland security to new levels. Perhaps, AI is even already playing a role with your 401k investment strategy?

It has been said that artificial intelligence will create the next industrial revolution. #HPCClick To Tweet

There are many views and opinions on artificial intelligence, such as the fear that the rise of AI could evolve beyond our understanding and take over the world or perhaps destroy humanity. AI has even been described as mankind’s biggest threat , perhaps even bigger than the threat of nuclear war.  However, in contrast, most believe we have already embarked on a journey of profound positive social impact, leveraging AI to improve our quality of life, finding new cures for life’s most threatening illness and have a deeper understanding of our own evolution. Regardless of your view, artificial intelligence is here to stay – making an impact in nearly every aspect of our lives.

artifical intelligence

Similar to humans and nearly any evolved biological life-form, artificial intelligence is highly dynamic, rational and environmentally aware to take action, or multiple actions, to maximize its chance of success at some given goal. Dissimilar to humans in many respects, it’s able to clearly focus on statistical data, facts, predictive and reinforced outcome at an astonishing rate and doesn’t require a good eight hours of shut-eye every day. IBM’s Watson, for example, can process more than 500 gigabytes of information per second, loosely coupled to the equivalent of about a million books per second.

While the applications for artificial intelligence are vast; from image recognition to complete autonomous systems for self-driving vehicles and defense systems, to advanced weather prediction, personalized medicine, cancer research and more, and in every use-case is the usage of data, (and lots of it).

So understandably, the ability to move data is ultimately the most important design choice for machine learning systems, and this is where Mellanox plays such a critical role.

The Network is Critical to AI Performance

Mellanox has long been the leader and de-facto standard in high performance computing, accelerating the world’s most powerful supercomputers and consistently developed new capabilities that improve performance of the most challenging simulations for scientific research, but how is this important for more modern day workloads, such as deep learning?

As it turns out, high performance computing and artificial intelligence have very similar requirements. Important to both is the ability to move data, exchange messages and computed results from thousands of parallel processes fast enough to keep the compute resources running at peak efficiency. This is where the Mellanox interconnect plays such an important role, for both HPC and AI.

Mellanox’s capabilities are specifically designed around an, ‘offload network architecture’, meaning that the interconnect eliminates overhead from the CPU’s involvement in the network communication, which is critical toward achieving scalable performance.

Today’s most popular frameworks like Tensorflow, Caffe2, Microsoft’s Cognitive Toolkit, Baidu PaddlePaddle among many others are used in the development and deployment of deep learning and cognitive computing applications. They have evolved to be easier to develop with and deploy into production at scale. What’s more, they natively include support for the advanced network accelerations from Mellanox.

Even beyond the core frameworks used to implement AI, additional bolt-on libraries are also leveraged to accelerate AI which are also heavily optimized to take full advantage of the underlying Mellanox hardware. With fast and modern-day hardware, the tools used for artificial intelligence are easily adaptable to take advantage of advanced GPU accelerators, modern CPUs and depend upon the most advanced interconnect solutions from Mellanox to dramatically increase performance for the most challenging AI workloads.

The Challenges of Scalable AI

So, what is meant by scalable performance? One can think of scalable performance in the terms of adding additional processing capabilities to a problem set or a simulation. This is where one would expect the problem to be solved faster by adding more CPUs or GPUs, but too often when the network has not been considered, conversely, the opposite effect happens. Because more parallelization means more communication and data movement between the independent tasks, the result often is even more communications between them. Such communications between parallel tasks is known as collective communications. While traditionally with most legacy interconnects such as Ethernet or OmniPath, the processor and often the operating system, needs to be involved in all of this communication. Copying data to and from device buffers to application/user space is not ever talked about, because it’s fairly hidden from the end user, but in-fact has a significant impact on application performance. Native to the Mellanox interconnect is the ability to perform RDMA (remote direct memory access).

artificial intelligence

RDMA supports what is known as zero-copy networking by enabling the network adapter to move data directly to or from the application. This eliminates both the operating system and the processors involvement, so it is much faster and enables much higher message-rates and is paramount for HPC and AI workloads. One reason why legacy interconnects are dependent upon the CPU, are that they are designed too cheaply to offer a real hardware implementation of RDMA, so there is no actual CPU or OS by-pass. Simply put, all legacy Ethernet and the OmniPath interconnect hardware is designed to on-load network communications to the system’s resources using software semantics that mimic what Mellanox has been doing in hardware for generations. So sure, they can often claim through-put at 100Gb/s on the wire, but at what expense? You probably guessed already, that inexpensive solution you thought was a great deal now leaves poor scalability for your applications. Why? Because the more parallelism that you thought would speed up the time to solution, now has even more network communications. The overhead of the non-native RDMA networks on the system leaves the software starving for processing resources.

Now that your “in-the-know”, about the fundamental importance of a well-designed, intelligent interconnect for artificial intelligence and HPC applications, there is even more that Mellanox has built upon to offer exceptional scalability and performance … more than 25 years of experience of well thought-through design and state-of-the-art offload capabilities that far exceed what any other interconnect could ever offer.

artificial intelligence

Mellanox OFED GPUDirect RDMA, is another widely adopted capability used to accelerate data movement, increase performance and scalability with GPU-bound workloads, particular for AI and HPC. GPUDirect RDMA allows data movement directly to and from remote GPUs using the Mellanox network fabric, again removing the processor and system memory from any involvement of data movement. It is one of the most popular techniques in both HPC and AI today when scaling beyond a single compute node.

The leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and TensorFlow already extensively use NCCL (NVIDIA Collective Communication Library) to deliver near-linear scaling for training on multi-GPU systems. Today, both native RDMA and GPUDirect RDMA is also supported with the popular NCCL 2. NCCL 2 is specifically designed to accelerate collectives and provides optimized routines on the latest NVIDIA GPUs for such collectives as all-gather, all-reduce, broadcast, reduce and reduce-scatter and natively supports InfiniBand verbs. If you would like a deeper dive, you can watch a great video from NVIDIA’s GTC conference to get a sense of how this powerful library is being used.

artificial intelligence

Mellanox SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is another profound acceleration capability for HPC and AI systems. SHARP is the base implementation of today’s in-network computing capabilities and improves upon the performance of collective operations at the network switch level. This eliminates the need to send data multiple times between endpoints, so it not only decreases the amount data traversing the network, it dramatically reduces the collective operations time to complete. For AI, message sizes are typically much larger than traditional HPC communication and SHARP further extends the scalability for deep learning even beyond the current techniques of today, offering reduced time for the training of models at a greater degree of accuracy for the models representation.

What’s next?

The past year has already proven to be a game-changer in the next generation performance for artificial intelligence, as the race continues to solve the challenges of scalable deep learning. Most popular deep learning frameworks today can scale to multiple GPUs within a server, but as we learned, it is much more difficult using multiple servers with GPUs.

This challenge in particular, is precisely where Mellanox has been the clear leader as the only interconnect solution able to deliver the needed performance and offload capabilities to unlock the power of scalable AI. As the network continues to prove to be a pillar of application performance even at 100Gb/s today, it will undoubtedly open new avenues for ground-breaking research and life changing technologies as we move into the era of 200Gb/s connectivity in 2018.

IBM Research recently announced their amazing achievement in unprecedented performance and close to ideal scaling with new distributed deep learning software which achieved record communication overhead and 95 percent scaling efficiency on the Caffe deep learning framework with Mellanox InfiniBand and over 256 NVIDIA GPUs in 64 IBM Power systems.

With the IBM DDL (Distributed Deep Learning) library, it took just 7 hours to train ImageNet-22K using ResNet-101. From 16 days down to just 7 hours it not only changes the workflow time-base for data scientists, it changes the game entirely.

The world is anxiously awaiting the completion of the World’s Largest, Fastest and Smartest Supercomputer ever built. Oak Ridge National Labs has been working with Mellanox since 2014 to develop Summit, which will deliver more than five times the performance of the Titan system that is currently in use. It has been said that Summit could be the last stop before Exascale arrives. While all this is amazing, what is not generally discussed, is the profoundness that this will also be the largest system ever deployed for artificial intelligence research.

Expect that Mellanox, The Artificial Intelligence Interconnect Company, continues to innovate and advance the most efficient methods to use and move data, and to shape the future of how data is applied.

Each of Summit’s nodes will contain multiple IBM POWER9 CPUs and NVIDIA GPUs all connected together with dual-rail Mellanox EDR 100Gb/s InfiniBand. By leveraging the world’s most intelligent interconnect which delivers the highest throughput, ultra-low latency and in-network computing, this system will advance science, energy, and medical discoveries never thought possible.

As we embark on the fourth industrial revolution, we should expect the world to undergo an amazing transformation in how we interact with computers. Autonomous self-driving vehicles, humanitarian research, personalized medicine, homeland security and even seamlessly interacting as a global society regardless of language or location are just a few of the exciting elements we will experience within our lifetime; and this is just the tip of the proverbial iceberg that will advance our knowledge and understanding of our place in the universe for generation to come.

Scot Schultz is  Sr. Director AI/HPC and Technical Computing at Mellanox Technologies

 

Leave a Comment

*

Resource Links: