In this video, Chris Fougner from Baidu Research describes how the company is accelerating Deep Neural Networks with GPU clusters.
“Deep neural networks are increasingly important for powering AI-based applications like speech recognition. Baidu’s research shows that adding GPUs to the data center makes deploying big deep neural networks practical at scale. Deep learning based technologies benefit from batching user requests in the data center, which requires a different software architecture than traditional web applications.”
We caught up with Chris Fougner to learn more.
insideHPC: Deep neural networks are increasingly important for powering AI-based applications like speech recognition. Is this work purely research, or is Baidu already using this technology for everyday computing?
Chris: Baidu is heavily invested in deep neural network technologies. We’re currently using it in applications such as speech recognition, image search and medical diagnostics, which are deployed to millions of users.
insideHPC: We tend to think of speech recognition in terms of customer service telephone systems and services like Siri. Will GPUs deployed at datacenter scale enable new types of applications, or just the ability to service more customers?
Chris: Placing GPUs in data centers allows us to deploy larger neural network models, at lower latencies, more efficiently. We’ve mostly focused on translating these gains into higher accuracy results for our users at lower costs for Baidu. That being said, deep learning is finding uses in more and more applications, and the trend has been to use larger networks. I believe GPUs will make it easier to transition new types of applications from a research setting into data centers.
insideHPC: What metrics do you use to measure scale?
Chris: Baidu has hundreds of millions of users, so when we talk about deploying AI at scale, we mean reaching all of them.
insideHPC: What are the challenges of resource allocation for GPUs in a batch environment in the cloud?
Chris: The challenges aren’t as much about resource allocation as about maximizing resource utilization. The typical software architecture for deploying web applications is poorly suited for high throughput computation like deep neural networks. There’s tension between maximizing throughput and keeping end-user latency low, and figuring out how to do both within the context of a web-scale application is a challenge. In order to do this, we’ve had to employ batching, use reduced precision arithmetic, and custom matrix kernels for deployment (see http://svail.github.io for details).
insideHPC: What are you using for job scheduling software?
Chris: We use standard data center infrastructure and software to make sure each request is sent to the appropriate server. Once a request reaches a server, it is propagated through a neural network. We wrote our own software to handle batching of these requests under latency constraints. In streaming applications like speech recognition, there is a lot of intermediate state to keep track of, which our software handles in order to provide correct results as data streams in from many users.
Batching adds more complexity since state needs to be combined and separated dynamically.
insideHPC: What about programmer productivity? Are there building blocks that you use or does each AI application have to be written from scratch?
Chris: We have a generic framework for batching user requests through neural networks, which can be re-used in different applications. By using GPUs in production as well as during training time we actually increase programmer productivity since much of the work can be shared between the two tasks.