Name: Hot Chips: Google TPU Performance Analysis Live Blog (3pm PT, 10pm UTC)
Item: Hot Chips: Google TPU Performance Analysis Live Blog (3pm PT, 10pm UTC)
Author: Dr. Ian Cutress

Hot Chips: Google TPU Performance Analysis Live Blog (3pm PT, 10pm UTC)

by Ian Cutress on 8/22/2017 5:58 PM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

30 Comments

Back to Article

serendip - Tuesday, August 22, 2017 - link
Waaaaay over my head, I couldn't understand most of what was written. How exactly does Google use these for neural networks? What aspects of Google Search lend themselves to neural network processing? And how would these cards compare to neuron-like chips being developed by IBM?
Threska - Tuesday, August 22, 2017 - link
Something, something, alien math, something. :-p
AmeliaPerry - Monday, August 28, 2017 - link
I basically mak℮ about $9,000-$12,000 a month online. It's ℮nough to comfortably replace my old jobs income, especially considering I only work about 10-13 hours a week from home. Go this website and click to "Tech" tab to start your work... http://cutt.us/4DDiG
Yojimbo - Tuesday, August 22, 2017 - link
Google does a lot more than perform searches. Google uses neural networks for image recognition, translation services, speech recognition, recommendations, ad targeting, self-driving cars, YouTube censorship, beating the world's top go players, and probably other things. As far as search, they are possibly used in interpreting inputted search strings and judging the relevance of search results.

IBM's TrueNorth chip, part of a class called neuromorphic computers, is very different from a TPU. Traditional computers have processing units together in one place (such as a CPU), the memory together in another (a RAM chip), and the communication function together. The TPU works this way, even if they are optimizing the data flow by trying to store relevant data close to the processing units. Neuromorphic devices are instead broken up into units with each unit having processing, memory, and communications functionality. The units, which are modeled after neurons, are formed together in a network so that one of the inputs of a unit might be the output of several others. When an input goes into a unit, the neuron "spikes" and sends an output based on the inputs and I believe whatever internal state the unit had. The neurons are event driven instead of running on a clock like a traditional processor.

So, put another way, neural networks are abstract structures modeled after the way the brain works. TPUs are processors designed to take these neural networks and perform manipulations on them, but they do so by using traditional computing architecture, and not by following the model of our brain. Neuromorphic devices like IBM's True North are trying to operate closer to how the neurons in the brain actually operate (they generally don't try to replicate the way the brain works, they just take inspiration from key facets of its operation). Obviously, in theory, neuromorphic devices should be very good candidates to use to operate neural networks, providing you can make an effective one and you can program it effectively.

The advantage of a neuromorphic device is that it avoids the bottleneck associated with moving data between storage and processing units, which uses a whole lot of power. However, programming for them is very different and so I think we don't really know how to get them to do what we want, yet.
Yojimbo - Wednesday, August 23, 2017 - link
BTW, I am no expert on neuromorphic devices, this is just my rudimentary understanding of them. It's possible I've made some errors or left out some key points.
WatcherCK - Wednesday, August 23, 2017 - link
Dont sell yourself short, that explanation really helped with understanding what TPU is about, and if i can find the time I am encouraged to go and look for more information about this. Its why my first read of the day is still Anandtech after all these years, partly for the articles and partly for the comments, trolls, gnomes and goblins included. Authors and commentors keep up the hard work :)
Xajel - Wednesday, August 23, 2017 - link
Damn, that card looks sexy, although I can't use it in any of my works and needs :D
edzieba - Wednesday, August 23, 2017 - link
Lots of comparison to GK210 rather than GP100/GV100. I understand Google want to say "those were 2015 era and we made the TPU in 2015", but things have changed quite a bit in the GPU realm since then.
Yojimbo - Wednesday, August 23, 2017 - link
It certainly made sense when they did the tests in 2015. It was iffy when they released the paper publicly in 2016. But I find it disappointing that they are still seemingly presenting the data as if its currently relevant in 2017.

The K80 is a dinosaur in terms of machine learning. The problem is they compare their "TPU" to "GPU" and "CPU" in their 2017 presentation. "GPU" has continued on and changed massively for machine learning workloads since their tests, but their labeling gives an impression that their presented results somehow extrapolate to the GPU lineage and so are still relevant. By now they should not be trying to make such generalizations from their results with 3 generations old hardware.

In my opinion, either they should cast their talk as a historic background of the development and introduction of their TPU, and probably change their charts to say K80 instead of "GPU", or if they are going to give the talk as if its currently relevant they should update their tests with current comparisons.
jospoortvliet - Tuesday, August 29, 2017 - link
They had a 10x diff to CPU and GPU, that won't have been wiped out by the 3%/year improvements in CPU or 50%/year in GPU... I don't disagree a more current comparison would have been nice but this is far from irrelevant.
Yojimbo - Tuesday, August 29, 2017 - link
You are underestimating the improvements in GPUs specifically for this task. What you are forgetting is that the K80 operated on 32 bit floats, whereas the P40 can operate efficiently on 8 bit integers. That allows 4 times the theoretical throughput without accounting for more cores or faster clocks. From the K80 to the P40, the GPU improved theoretical throughput over 6x for deep learning inference, with an improvement much greater than that if latency is taken into account.

That's why the K80 was a dinosaur for these workloads and why trotting it out as a comparison in 2017 is only relevant for a historical perspective.
willis936 - Wednesday, August 23, 2017 - link
It looks like a big fat 8-bit ALU. Peak 92 trillion operations per second? When can I get one of these for NNEDI3 so I can watch my cartoons in upscaled 4K?
shinpil - Wednesday, August 23, 2017 - link
Did google really say that precision of matmul is 8-bit by 8-bit integer?
I think it is 8-bit by 8-bit float, not integer.
Google highlights high flops performance on slide, not integer.
Yojimbo - Wednesday, August 23, 2017 - link
Where do they? I only see it in the roofline model slide. That's just explaining the model. Unless I'm missing something, in the data graphs they call it TeraOps/sec. That's because the TPU was doing 8-bit integer operations and the GPU was doing 32-bit float operations. I forget what the CPU was doing, they noted it in their paper that they released last year.
shinpil - Wednesday, August 23, 2017 - link
I have seen deep learning like CNN and DNN uses floating point operations not fixed point operations. So I confused.
8 bit integer only covers -128~127.
I think this range is too narrow to handle lots of applications.
I wonder how google solved this problem.
Yojimbo - Thursday, August 24, 2017 - link
Well, actually researchers even make networks with 1 or 2 bits. 8 bit is generally not good for training because the gradient descent algorithm doesn't work too well with such low precision, I think. But researchers have had success with taking a trained network and quantizing it to represent the weights with 8 bit integers. Remember that there are millions of weights in a deep neural network. There is plenty of freedom to fit a target function with only a limited number of states for each weight.
Nenad - Thursday, August 24, 2017 - link
It is INT8 for their first generation of TPU at least.

You can see that in Google article that explain their 1st gen TPU in more details: https://cloud.google.com/blog/big-data/2017/05/an-...

They mention "we apply a technique called quantization that allows us to work with integer operations instead" ... but I think in their 2nd gen TPU, Google is using float ( https://www.nextplatform.com/2017/05/17/first-dept... )

Also, those comparisons with GPU on their slides are outdated - new NVidia Volta GPUs have tensor arithmetic integrated in GPU and achieve 120 TFLOPS , which is comparable even to new Google TPU2 which has 45 TFLOPS per module (or 180 TFLOPS per 'card' consisting of 4 connected modules )
p1esk - Friday, August 25, 2017 - link
So, Volta V100 chip does 120TFlops, while TPU2 chip does only 45TFlops. Not sure how Google is planning to compete.
Yojimbo - Friday, August 25, 2017 - link
Well, one can't compare those peak theoretical numbers to get a good idea of how the various options perform. One needs to compare real world benchmarks. It's complicated because scalability and power efficiency play a part, as well.
p1esk - Saturday, August 26, 2017 - link
You're right, the benchmarks is the only way to compare. Still, TPU seems to specialized for DL computation, while V100 does very well in FP32 and FP64 as well. Can you imagine how much faster V100 would be for DL if they fill its entire die area with "tensor cores"?
skavi - Friday, August 25, 2017 - link
price and power usage
p1esk - Saturday, August 26, 2017 - link
I'm not sure about price: unlike Google, Nvidia enjoys massive economies of scale.
tuxRoller - Sunday, August 27, 2017 - link
First, by not selling it.
Second, by building an ASIC with a team of superstars (take a look at the authors of the original tpu paper on arxiv).
Third, let's see how things shake out once the volts hit the floor.
tuxRoller - Monday, August 28, 2017 - link
OMT, if you look at the ratios, the dgx1 is about 4x faster than k80 across various inference benchmarks.

https://www.tensorflow.org/performance/benchmarks#...
Yojimbo - Monday, August 28, 2017 - link
From what I see, those are training benchmarks, not inference benchmarks.

If the data were relevant (if it were inference data), one other thing to keep in mind would be that those benchmarks were taken in March 2017 with more recent libraries used than the tests from 2014 presented in Google's talk. One could not just equate the 2014 K80 results with the 2017 K80 results to extrapolate an accurate comparison of P100 performance with the TPU.
Yojimbo - Monday, August 28, 2017 - link
Oh, some other things to keep in mind.

One is the low-precision capabilities of NVIDIA's newer cards. The P100 is not NVIDIA's best Pascal inference card, the P40 is. The P100 has double rate FP16 capability, but not special 8 bit integer capability. The P40 has quad rate 8 bit integer capability making it a good choice to compare with the TPU since the TPU uses 8 bit integers, as well. The K80's fastest mode is to use 32 bit floats.

Secondly, latency seems to be very important in inferencing. Most discussion of inferencing I have seen has discussed it as being a key metric. It seems odd to me that Google hasn't presented any latency comparisons. Maybe Google had a specific use in mind for these TPUs where the latency isn't an issue, but I think most likely the TPU was an experimental chip and latency was an issue in its practical operation. Ian did quote the presenter as saying "In retrospect, inference prefer latency over throughput" and "K80 poor at inference vs capability in training". NVIDIA came out and claimed their P40 could get twice the inferences per second as the TPU while maintaining less than 10 ms of latency, but I never saw any detailed explanation of that claim.

NVIDIA has to this point (Pascal generation) not convinced anyone to use their GPUs rather than CPUs for inferencing at any significant scale. I think Google uses mostly CPUs themselves. CPUs are latency optimized. Various people (Microsoft and Baidu, for example) are experimenting with using FPGAs for inferencing because they can maintain flexibility for different algorithms better than ASICs and supposedly do so with good latency. NVIDIA has supposedly improved their inferencing throughput and latency on their latest Volta generation GV100, but it remains to be seen if they actually convince anyone to use that card for inferencing.
HollyDOL - Friday, August 25, 2017 - link
I am really curious if this TPU platform has enough life force in it. I remember PhysX cards being a big thing and "everybody is going to have one"... and now the cards are long gone and only the software part remained, bought by nVidia...
jospoortvliet - Tuesday, August 29, 2017 - link
Well Google has no intention to sell - they ARE the market and certainly big enough..
Yojimbo - Tuesday, August 29, 2017 - link
They aren't the market. They are part of the market. They aren't even the biggest cloud provider. Amazon is. There are various big players and several of them are trying to design and build their own deep learning accelerators. There are also several startups that are trying to bring deep learning accelerators to market, including one that is now owned by Intel (Nervana). It doesn't look like there's anything really special about the TPU in the scheme of things. It's interesting, it's experimental and they are continuing to try to develop it, but various different companies are experimenting with things. I am sure Google has a lot more GPUs than they have TPUs.
tuxRoller - Sunday, August 27, 2017 - link
So, is that TPU 2 write-up coming?
If not, any chance at providing a link to the slides?

Hot Chips: Google TPU Performance Analysis Live Blog (3pm PT, 10pm UTC)

Post Your Comment

30 Comments

Back to Article

serendip - Tuesday, August 22, 2017 - link

Threska - Tuesday, August 22, 2017 - link

AmeliaPerry - Monday, August 28, 2017 - link

Yojimbo - Tuesday, August 22, 2017 - link

Yojimbo - Wednesday, August 23, 2017 - link

WatcherCK - Wednesday, August 23, 2017 - link

Xajel - Wednesday, August 23, 2017 - link

edzieba - Wednesday, August 23, 2017 - link

Yojimbo - Wednesday, August 23, 2017 - link

jospoortvliet - Tuesday, August 29, 2017 - link

Yojimbo - Tuesday, August 29, 2017 - link

willis936 - Wednesday, August 23, 2017 - link

shinpil - Wednesday, August 23, 2017 - link

Yojimbo - Wednesday, August 23, 2017 - link

shinpil - Wednesday, August 23, 2017 - link

Yojimbo - Thursday, August 24, 2017 - link

Nenad - Thursday, August 24, 2017 - link

p1esk - Friday, August 25, 2017 - link

Yojimbo - Friday, August 25, 2017 - link

p1esk - Saturday, August 26, 2017 - link

skavi - Friday, August 25, 2017 - link

p1esk - Saturday, August 26, 2017 - link

tuxRoller - Sunday, August 27, 2017 - link

tuxRoller - Monday, August 28, 2017 - link

Yojimbo - Monday, August 28, 2017 - link

Yojimbo - Monday, August 28, 2017 - link

HollyDOL - Friday, August 25, 2017 - link

jospoortvliet - Tuesday, August 29, 2017 - link

Yojimbo - Tuesday, August 29, 2017 - link

tuxRoller - Sunday, August 27, 2017 - link

Log in

Don't have an account? Sign up now