Optimizing in a Heterogeneous World is (Algorithms x Devices)

In this guest article, our friends at Intel discuss how CPUs prove better for some important Deep Learning. Here’s why, and keep your GPUs handy! Heterogeneous computing ushers in a world where we must consider permutations of algorithms and devices to find the best platform solution. No single device will win all the time, so we need to constantly assess our choices and assumptions.

NVIDIA TensorRT 6 Breaks 10 millisecond barrier for BERT-Large

Today, NVIDIA released TensorRT 6, which includes new capabilities that dramatically accelerate conversational AI applications, speech recognition, 3D image segmentation for medical applications, as well as image-based applications in industrial automation. TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for AI applications. “With today’s release, TensorRT continues to expand its set of optimized layers, provides highly requested capabilities for conversational AI applications, delivering tighter integrations with frameworks to provide an easy path to deploy your applications on NVIDIA GPUs. In TensorRT 6, we’re also releasing new optimizations that deliver inference for BERT-Large in only 5.8 ms on T4 GPUs, making it practical for enterprises to deploy this model in production for the first time.”