Google's dedicated TensorFlow processor, or TPU, crushes Intel, Nvidia in inference workloads

Several years ago, Google began working on its own custom software for machine learning and artificial intelligence workloads, dubbed TensorFlow. Last year, the company announced that it had designed its own tensor processing unit (TPU), an ASIC designed for high throughput of low-precision arithmetic. Now, Google has released some performance data for their TPU and how it compares to Intel's Haswell CPUs and Nvidia's K80 (Kepler-based) data center dual GPU.

Before we dive into the data we need to talk about the workloads Google is discussing. All of Google's benchmarks measure inference performance as opposed to initial neural network training. Nvidia has a graphic that summarizes the differences(Opens in a new window) between the two:

Click to enlarge. Teaching a neural network what to recognize and how to recognize it is referred to as training, and these workloads are still typically run on CPUs or GPUs. Inference refers to the neural network's ability to apply what it learned from training. Google makes it clear that it's only interested in low-latency operations and that it's imposed strict criteria for responsiveness on the benchmarks we'll discuss below.

Google's TPU design, benchmarks

The first part of Google's paper discusses the various types of deep neural networks it deploys, the specific benchmarks it uses, and offers a diagram of the TPU's physical layout, pictured below. The TPU is specifically designed for 8-bit integer workloads and prioritizes consistently low latency over raw throughput (both CPUs and GPUs tend to prioritize throughput over latency, particularly GPUs).

Click to enlarge

Google writes(Opens in a new window) (PDF): "Rather than be tightly integrated with a CPU, to reduce the chances of delaying deployment, the TPU was designed to be a coprocessor on the PCIe I/O bus, allowing it to plug into existing servers just as a GPU does. Moreover, to simplify hardware design and debugging, the host server sends TPU instructions for it to execute rather than fetching them itself. Hence, the TPU is closer in spirit to an FPU (floating-point unit) coprocessor than it is to a GPU."

Click to enlarge

Each TPU also has an off-chip 8GiB DRAM pool, which Google calls Weight Memory, while intermediate results are held in a 24MiB pool of on-chip memory (that's the Unified Buffer in the diagram above). The TPU has a four-stage pipeline and executes CISC instructions, with some instructions taking thousands of clock cycles to execute as opposed to the typical RISC pipeline of one clock cycle per pipeline stage. The table below shows how the E5-2699v3 (Haswell), Nvidia K80, and TPU compare against each other in various metrics.

Click to enlarge

Before we hit the benchmark results, there are a few things we need to note. First, Turbo mode and GPU Boost were disabled for both the Haswell and Nvidia GPUs, not to artificially tilt the score in favor of the TPU, but because Google's data centers prioritize dense hardware packing over raw performance. Higher turbo clock rates for the v3 Xeon are dependent on not using AVX, which Google's neural networks all tend to use. As for Nvidia's K80, the test server in question deployed four K80 cards with two GPUs per card, for a total of eight GPU cores. Packed that tightly, the only way to take advantage of the GPU's boost clock without causing an overheat would have been to remove two of the K80 cards. Since the clock frequency increase isn't nearly as potent as doubling the total number of GPUs in the server, Google leaves boost disabled on these server configurations.

Google's benchmark figures all use the roofline performance model. The advantage of this model is that it creates an intuitive image of overall performance. The flat roofline represents theoretical peak performance, while the various data points show real-world results.

In this case, the Y-axis is integer operations per second, while the "Operational Intensity" X-axis corresponds to integer operations per byte of weights read (emphasis Google's). The gap between an application's observed performance and the curve directly above it shows how much additional performance might be gained if the benchmark was better optimized for the architecture in question, while data points on the slanted portion of the roofline indicate that the benchmark is running into memory bandwidth limitations. The slideshow below shows Google's results in various benchmarks for its CPU, GPU, and TPU tests. As always, each slide can be clicked to open a larger image in a new window.

[metaslider id=247200]

Google's TPU isn't just a high performance engine; it offers substantially improved performance per watt as well, both in the original TPU and for improved variants Google has modeled (TPU').

Click to enlarge

The chief limiting factor between Google's TPU and higher performance is memory bandwidth. Google's models show TPU performance improving 3x if memory bandwidth is increased 4x over current designs. No other set of enhancements, including clock rate improvements, larger accumulators, or a combination of multiple factors has much of an impact on performance.

The last section of Google's paper is dedicated to dispelling various fallacies and correcting misunderstandings, a number of which relate to the choice of the K80 GPU. One section is particularly worth quoting:

Fallacy: CPU and GPU results would be comparable to the TPU if we used them more efficiently or compared to newer versions. We originally had 8-bit results for just one DNN on the CPU, due to the significant work to use AVX2 integer support efficiently. The benefit was ~3.5X. It was less confusing (and space) to present all CPU results in floating point, rather than having one exception, with its own roofline. If all DNNs had similar speedup, performance/Watt ratio would drop from 41-83X to 12-24X. The new 16-nm, 1.5GHz, 250W P40 datacenter GPU can perform 47 Tera 8-bit ops/sec, but was unavailable in early 2015, so isn’t contemporary with our three platforms. We also can’t know the fraction of P40 peak delivered within our rigid time bounds. If we compared newer chips, Section 7 shows that we could triple performance of the 28-nm, 0.7GHz, 40W TPU just by using the K80’s GDDR5 memory (at a cost of an additional 10W).

This kind of announcement isn't the sort of thing Nvidia is going to be happy to hear. To be clear, Google's TPU results today are applicable to inference workloads, not the initial task of training the neural network -- that's still done on GPUs. But, with respect to Nvidia and AMD both, we've also seen this kind of cycle play out before. Once upon a time, CPUs were the unquestioned kings of cryptocurrency mining. Then, as difficulty rose, GPUs became dominant, thanks to vastly higher hash rates. In the long run, however, custom ASICs took over the market.

Both AMD and Nvidia recently added (Nvidia) or announced (AMD) support for 8-bit operations to improve total GPU throughput in deep learning and AI workloads, but it will take significant improvements over and above these steps to address the advantage ASICs would possess if they start moving into these markets. That's not to say we expect custom ASIC designs to own the market -- Google and Microsoft may be able to afford to build their own custom hardware, but most customers won't have the funds or expertise to take that on.

Google's dedicated TensorFlow processor, or TPU, crushes Intel, Nvidia in inference workloads

Google's TPU design, benchmarks

Tagged In

More from Computing