Name: NVIDIA Tegra K1 Preview & Architecture Analysis
Item: NVIDIA Tegra K1 Preview & Architecture Analysis

NVIDIA Tegra K1 Preview & Architecture Analysis

by Brian Klug & Anand Lal Shimpi on 1/6/2014 6:31 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

88 Comments

Back to Article

blanarahul - Monday, January 6, 2014 - link
Quick question: Is it possible to build a 32-bit ARMv8 CPU core i.e. a ARMv8 core capable of running a 32-bit OS without using a hypervisor? That would really ease the transition to 64-bit for Android.
blanarahul - Monday, January 6, 2014 - link
Anand, where do you find information about the revisions of Cortex cores??
blanarahul - Monday, January 6, 2014 - link
Found it: http://infocenter.arm.com/help/index.jsp?topic=/co...
klmx - Monday, January 6, 2014 - link
The real-time version of ARMv8 (ARMv8-R) is still 32-bit, and is capable of running a rich OS like Android, but I guess that's not the solution you had in mind
droopaloop - Monday, January 6, 2014 - link
ARMv8 supports 2 different instruction sets; AArch32, and AArch64. AArch32 is basically the ARMv7 instruction set (read: 32bit). It is possible to run Android in AArch32. It is also possible to run the kernel in AArch64, and have AArch32 apps running on top which should help ease the transition to 64bit.
Krysto - Monday, January 6, 2014 - link
How would using a 32-bit-only CPU ease the transition to 64-bit? ARMv8 supports both 32-bit and 64-bit modes by default (you can make it 64-bit-only later, though), so that's what needed for the "transition to be easier" - OEMs to just jump to ARMv8, even if the OS remains 32-bit.
phoenix_rizzen - Monday, January 6, 2014 - link
This is exactly what Apple did with their A7 SoC. It's an ARMv8 CPU. It can run either a 64-bit or 32-bit version of iOS. If running the 64-bit version, it can run either 32-bit or 64-bit apps.
Krysto - Monday, January 6, 2014 - link
That's what everyone will do. That's what I'm saying. It's the default. OEMs just need to switch to ARMv8, and that's it. It's up to Google to make Android 64-bit, and then to app devs to make their apps 64-bit, but neither is "necessary" for the transition to 64-bit CPUs.
BMNify - Tuesday, January 7, 2014 - link
remember The Cortex-A15 and above also introduced full hardware accelerated virtualization so you could also run several versions of an OS concurrently too if you so wished and skillful enough to code your apps base libraries with good generic and fast inter process communications (xcore style)
deltatux - Thursday, January 9, 2014 - link
Why would you need to build a 32-bit ARMv8 core? Android is highly portable, just recompile the entire thing to the aarch64 instruction set and you're good to go. Nearly all apps are written in Dalvik anyways which means you won't have to recompile those apps and ARMv8 in itself is backwards compatible with 32-bit just like how x86-64 is backwards compatible with the original 32-bit x86. So even if there are 32-bit native apps, they can easily run on 64-bit Android.

NVIDIA has shown off Android 4.4 running on Denver in 64-bit mode, so 64-bit Android does exist and works out of the gate.
davidjin - Thursday, January 23, 2014 - link
Not true. v8 introduces new MMU design and page table format and other enhanced features, which are not readily compatible to a straightforward "re-compiled" kernel.

By App-wise, you are right. Re-compilation will almost do the trick. However, without a nice kernel, how do you run the Apps?
deltatux - Saturday, January 25, 2014 - link
This is no different than the AMD64 implementation, the Linux kernel will gain a few improvements to make it work perfectly on that specific architecture extension, the same is likely going to happen to aarch64. That means, there's no reason to make a 32-bit version of ARMv8. ARMv8 will be able to encompass and execute both 64-bit and 32-bit software.

As shown by NVIDIA's own tech demo, Android is already 64-bit capable, thus, no need to continually run in 32-bit mode.
Laxaa - Monday, January 6, 2014 - link
Kepler K1 with Denver in Surface 3?
nafhan - Monday, January 6, 2014 - link
I'd be more interested in a Kepler Steam Box.
B3an - Monday, January 6, 2014 - link
... You can already get Kepler Steam boxes. They use desktop Kepler GPU's.

Surface 3 with K1 would be much more interesting, and unlike Steam OS, actually useful and worthwhile.
Alexvrb - Monday, January 6, 2014 - link
Agreed. The Surface 2 is a well done piece of hardware. It just needs some real muscle now. As the article says, it would be a good opportunity to showcase X360-level games like GTA V.
name99 - Monday, January 6, 2014 - link
Right. That's the reason Surface 2 isn't selling, it doesn't have "real muscle". Hell, put a POWER8 CPU in there with 144GB and some 10GbE connectors and you'll have a product that takes over the world...
stingerman - Tuesday, January 7, 2014 - link
I doubt Microsoft will sell a device that competes against the 360... It would signal to investors what they already suspect: Gaming Consoles are end of line. Problem is that Apple is now in a better position...
klmx - Monday, January 6, 2014 - link
I love how Nvidia's marketing department is selling the dual Denver cores as "supercores"
MonkeyPaw - Monday, January 6, 2014 - link
They are at least extra big compared to an A15, so it's at least accurate by a size comparison. ;)
Nenad - Monday, January 13, 2014 - link
That is not real picture of GPUs/CPUs, it is photoshoped, so we do not know relative size of A15 and Denver cores.
chizow - Monday, January 6, 2014 - link
Of course they would, designating it as simply dual-core would intimate it's a downgrade when it clearly is not.
MrdnknnN - Monday, January 6, 2014 - link
"As if that wasn’t enough, starting now, all future NVIDIA GeForce designs will begin first and foremost as mobile designs."

I guess I am a dinosaur because this makes me want to cry.
nathanddrews - Monday, January 6, 2014 - link
Why? It was the best thing that ever happened to Intel (Core). Desktop graphics are in a rut. Too expensive, not powerful enough for the coming storm of high frame rate 4K and 8K software and hardware.
HammerStrike - Monday, January 6, 2014 - link
From a gaming perspective Intel's focus on mobile has let to 10%-15% performance increases in their desktop line whenever they release a new chip series. That's pretty disappointing, from a gaming performance perspective, even though I understand why they are focusing there.

Also, I disagree with you on desktop graphics - this is a golden time for them. The competition in the $200-$300 card range is fierce, and there is ton of great value there. Not sure why you think there is a "storm" of 4k and 8k content coming any time soon, as there isn't, but even 2x R9 290, $800 at MSRP (I know the mining craze has distorted that, but it will correct) can drive 4K today. Seeing as most decent 4K monitors are still $3000+, I'd argue it is the cost of the displays, and not the GPU's, that is holding back wider adaptation.

As long as nVidia keeps releasing competitive parts I really don't care what their design methodology is. That being said, power efficiency is the #1 priority in mobile, so if they are going to be devoting mindshare to that my concern is top line performance will suffer in desktop apps, where power is much less of an issue.
OreoCookie - Monday, January 6, 2014 - link
Since Intel includes relatively powerful GPUs in their CPUs, discrete GPUs are needed only for special purposes (gaming, GPU compute and various special applications). And the desktop market has been contracting for years in favor of mobile computers and devices. In the notebook space, thanks to Intel finally including decent GPUs in their CPUs, only high-end notebooks come with discrete GPUs. Hence, the market for discrete GPUs is shrinking (which is one of the reasons why nVidia and AMD are both in the CPU game as well as the GPU game).
MrSpadge - Monday, January 6, 2014 - link
> From a gaming perspective Intel's focus on mobile has let to 10%-15% performance increases in their desktop line whenever they release a new chip series. That's pretty disappointing

That's not because of their power efficiency oriented design, it's because their CPU designs are already pretty good (difficult to improve upon) and there's no market pressure to push harder. And as socket 2011 shows us: pushing 6 of these fat cores flat out still requires 130+ W, making these PCs the dinosaurs of old again (-> not mass compatible).
Sabresiberian - Monday, January 6, 2014 - link
I think you are misunderstanding the situation here. What will go in a mobile chip will be the equivalent of one SMX core, while what will go in the desktop version will be as many as they can cool properly. With K1 and Kepler we have the same architecture, but there is one SMX in the coming mobile solution, and 15 SMXs in a GeForce 780Ti. So, 15x the performance in the 780Ti (roughly) using the same design.

Maxwell could end up being made up of something like 20 SMXs designed with mobile efficiency in mind; that's a good thing for those of us playing at the high end of video quality. :)
MrSpadge - Monday, January 6, 2014 - link
This just means they'll be optimized for power efficiency first. Which makes a lot of sense - look at Haitii, it can not even reach "normal" clock speeds with the stock cooler because it eats so much power. Improving power efficiency automatically results in higher performance becoming achievable via bigger dies. What they decide to offer us is a different story altogether.
kpb321 - Monday, January 6, 2014 - link
My initial reaction was a little like MrdnknnN but when I thought about it I realized that may not be a bad thing. Video cards at this point are primarily constrained on the high end by power and cooling limitations more than anything else. The R9 is a great example of this. Optimizing for mobile should result in a more efficient design which can scale up to good desktop and high end performance by adding on the appropriate memory interfaces and putting down enough "blocks" SMXs in nvidia's case. They already do this to give the range of barely better than integrated video cards to top end 500+ dollar cards. I don't think the mobile focus is too far below the low end cards of today to cause major problems here.
HighTech4US - Monday, January 6, 2014 - link
You quoting CharLIE and using his Nvidia hate filled speculative drivel as somehow being gospel taints your article.
MrSpadge - Monday, January 6, 2014 - link
He is right, occasionally - you've got to give him that. The problem is you never know before it happens, which makes reading the Inq pretty pointless.
HighTech4US - Wednesday, January 8, 2014 - link
A broken clock is right twice a day and wrong the rest of the time which is pretty much charLIE's track record on Nvidia.
OreoCookie - Monday, January 6, 2014 - link
If you're familiar the style of The Register and The Inquirer (at least in the past when I was reading them regularly), they often packaged excellent tech reporting in their quirky and snarky way of writing things. (I read both back in the day to learn more about the inner workings of cpu designs and such.) (Charlie Demerjian hails from The Inquirer.) And Anand did not treat Demerjian as gospel, he said that it looks as if he may have been spot on.
Xavierx78 - Monday, January 6, 2014 - link
Would really like to see the K1 in the next version of OUYA!
silenceisgolden - Monday, January 6, 2014 - link
I'm noticing a lack of talk about LTE/radio support still
fafa1971 - Monday, January 6, 2014 - link
They can bundle a discrete modem, NVIDIA i500 (LTE cat. 4, 150 Mbps):

http://www.nvidia.com/object/i500-cellular-modems-...
Rayb - Monday, January 6, 2014 - link
Apparently Nvidia decided to avoid the US market for lack of CDMA. It has been certified for AT&T and will be marketed globally, just not in the US because Verizon/Sprint don't use the GSM baseband.

http://www.fiercewireless.com/story/nvidia-not-tar...
Krysto - Monday, January 6, 2014 - link
Decent showing from Nvidia with K1, but the real game-changer will be K2, or whatever the hell they'll call it then, with Denver and Maxwell by default, and made at 16nm FinFET. Hopefully it will arrive no later than early spring next year.

Normally that chip should get 700 Gflops, but if they can push another 3x, to 1 TF, that would really give them a lot of buzzwords in the media: "Denver", "Maxwell", "16nm FinFET", "1 Teraflops", etc

I hope they don't blow it. Qualcomm is already getting lazy because they have too much domination in the market. We need another strong competitor.
ddriver - Monday, January 6, 2014 - link
Got to love Shang Tsungs' marketing, K1 - the 192 core chip, now coming with either 2 or 4 cores.
da_asmodai - Monday, January 6, 2014 - link
This articles says first, first, first for Kepler core in mobile but it's not out yet and I believe everything that's claimed as a first in this article is also supported by the Adreno 420 in the already announced Snapdragon 805. I'd like to see a side by side spec comparison of Kepler, Adreno 420, and PowerVR Series 6XT.
dwforbes - Monday, January 6, 2014 - link
"FP64 support is also present, at 1/24 the FP32 rate"

Should this 1/2 the FP32 rate, or is it really so crippled?
Ryan Smith - Monday, January 6, 2014 - link
No, 1/24 is correct. It's so that you have native FP64 when you need it, but you aren't wasting die space on precision you aren't going to use.
ddriver - Monday, January 6, 2014 - link
nvidia being cheap once again, deliberately ruining compute performance like they did with desktop GPUs for years. And let me guess, no openCL support either? Thanks but no thanks, gonna stick to qualcomm and samsung mobile platforms and amd/radeon on the desktop. And for what? To push their ridiculously and shamelessly overpriced "professional" products?

GTX 780 DP @ 1/24 SP
R9 290 DP @ 1/8 SP
R9 280 DP @ 1/4 SP
Loki726 - Monday, January 6, 2014 - link
Adding big double precision units has real area and power costs (you typically can't rail gate off individual functional units). If you put full-rate double precision units in a mobile SoC it would just sit there draining your battery life.
ddriver - Monday, January 6, 2014 - link
Unfortunately, power efficiency is just the excuse to deliberately cripple compute performance of consumer products. As you see, AMD has no problem providing DP support with lower penalty, which is the reason my compute farm runs radeons exclusively, because the performance per $ ratio completely destroys prosumer products. I do realize I am a very specific and narrow case, since I couldn't care less about gaming and graphics performance, since I use it only to rasterize the compute output, but still... why not go for a more performing design, considering it is not that much about efficiency but the greed for the fat profit margins of teslas and quadros that motivates nvidia to cripple DP performance to such a horrendous extent.
Loki726 - Monday, January 6, 2014 - link
AMD doesn't release a mobile GPU part, and the Qualcomm parts which are based off of the old AMD VLIW5 design that they bought from AMD don't include double precision. Every little bit of power matters in mobile.
ddriver - Monday, January 6, 2014 - link
The 1/24 DP performance does not come as a mobile-dedicated design, even the GTX 780 is crippled this way, even though it is an enthusiast part, where power efficiency is the least concern.
Loki726 - Monday, January 6, 2014 - link
They are different strategies. Neither one is ideal for everyone.

Putting double precision hardware into consumer parts is effectively asking
gamers to pay for extra area and power. I do agree that this is less of an
issue in desktop parts compared to mobile, but it is still an issue and GPUs
have enough ALUs in them that you would notice it if every one of them got 10%
less efficient in benchmarks.

AMD builds one chip and sells it into both compute and graphics markets. In
order to make it appealing for compute applications they add double precision.
They make gamers pay for this even though they never use it, but they don't have
to pay the design costs of building multiple chips. NVIDIA builds two
different designs. One chip for compute and another one for graphics (although
this round they also sold the compute chip into the graphics market - Titan).
Presumably they do this because they think that they can recoup the extra cost
of building a different chip that only compute users buy, and by doing so
they don't make gamers pay extra for double precision.

The compute chip has extra features like more registers, ECC, more double
precision units, dynamic parallelism, etc. Chip design is expensive. Think
hundreds of millions of dollars for a new design. If there were just as many
users of compute GPUs as there are gamers who buy graphics cards, the prices
would probably come down a lot.

I'm with you that I would like to have cheaper compute parts with full double
precision support, but I think the only real way to drive down chip prices is to
make them a commodity. It won't happen until there is a killer compute app
that makes every desktop owner want to go out and buy a compute GPU.
ddriver - Tuesday, January 7, 2014 - link
And how do consumers exactly "pay extra" for the better DP performance when AMD GPUs are overall much better value than nvidia gpus? It seems to me that if the extra cost is as high as you believe it is (which it really isn't) then it is AMD that pays it with its profit margins.
jerrylzy - Tuesday, January 7, 2014 - link
Exactly. I don't see Loki726's point of gamers paying extra $ for Double Precision. AMD Cards are generally much cheaper at the same performance level, though at the cost of power consumption.
Loki726 - Tuesday, January 7, 2014 - link
I mean compared to a world where AMD decided to rip out the double precision units. There are obviously many (thousands) other factors that do into the efficiency of a GPU.
jerrylzy - Tuesday, January 7, 2014 - link
Unfortunately, instead of using VLIW 5, Qualcomm implemented new scalar architecture way back in adreno 320.
Loki726 - Tuesday, January 7, 2014 - link
Yep, the have improved on it, but they started with the AMD design. My point was that the Qualcomm GPU is a better comparison point to a Tegra SoC than an AMD desktop part.
ddriver - Wednesday, January 8, 2014 - link
The decision to chose Qualcomm in favor of Tegra would be based entirely on the absence of OpenCL support in Tegra. Exclusive cuda? Come on, who would want to invest into writing a parallel accelerated high performance routine that only works on like no more than 5% of the hardware-capable to run it devices? Not me anyway.

The mention of the radeon was regarding a completely different point - that nvidia sacks DP performance even where it makes no sense to, and is IMO criminal to do so - the "gain" of such a terrible DP implementation is completely diminished by the loss of potential performance and possibility of accelerating a lot of professional workstation software. And for what, so the only spared parts - the "professional" products can have their ridiculous prices better "justified"? Because it is such a sweet deal to make a product 10% more expensive to make and ask 5000% more money for it.

Which is the reason AMD offers so much more value, while limp and non-competitive in the CPU performance, the place where computation is really needed - professional workstation software can greatly benefit from parallelization, and the much cheaper desktop enthusiast product actually delivers more raw computational power than the identical, but more conservatively clocked fireGL analog. Surely, fireGL still has its perks - ECC, double the memory, but those advantages shine in very rare circumstances, in most of the professional computation demanding software the desktop part is still an incredibly lucrative investment, something you just don't get with nvidia because of what they decided to do the last few years, coincidentally the move to cripple DP performance to 1/24 coincided with the re-pimping of the quadros into the tesla line. I think it is rather obvious that nvidia decided to shamelessly milk the parallel supercomputing professional market, something that will backfire in their face, especially stacking with the downplay-ment of OpenCL in favor of a vendor exclusive API to use the hardware.
Loki726 - Wednesday, January 8, 2014 - link
Agreed with the point about code portability, but that's an entirely different issue. I'd actually take the point further and say that OpenCL is too vendor specific -> it only runs on a few GPUs and has shaky support on mobile. Parallel code should be a library like pthreads, C++ (or pick your favorite language) standard library threads, or MPI. Why program in a new language that is effectively C/C++, except that it isn't?

I personally think that if a company artificially inflates the price of specific features like double precision, then they leave themselves open to being undercut by a competitor and they will either be forced to change it or go out of business. As I said, AMD's design choice penalizes gamers, but helps users who want compute features, and NVIDIA's choice benefits gamers, but penalizes desktop users who want the best value for some compute features like double precision.

I have a good understanding of circuit design and VLSI implementation of floating point units and I can say that the area and power overheads of adding in 768 extra double precision units to a Kepler GPU or 896 double precision units to a GCN GPU would be noticeable, even if you merged pairs of single precision units together and shared common logic (which would create scheduling hazards at the uArch level that could further eat into perf, and increase timing pressure during layout).

Take a look at this paper from Mark Horowitz (an expert) that explores power and area tradeoffs in floating point unit design if you don't believe me. It should be easy to verify. http://www.cpe.virginia.edu/grads/pdfs/August2012/... . Look at the area and power comparisons in Table 1, scale them to 28nm, and multiply them by ~1000x (to get up to 1/2 or 1/4 of single precision throughput).

Double precision units are big, and adding a lot of them adds a lot of power and area.
Krysto - Saturday, January 11, 2014 - link
I want to believe OpenCL was left out because they've been trying to squeeze so much in this time-frame already. But since they fully ported everything in one swoop, I still find it hard to believe they didn't omit it on purpose. Hopefully, they'll support OpenCL 2.0 in Maxwell, because OpenCL 2.0 also offers some great parallelism features, which Maxwell could take advantage of.
Andromeduck - Wednesday, January 8, 2014 - link
Isn't that what the GTX Titan is for?
Jon Tseng - Monday, January 6, 2014 - link
Sounds very interesting. The Q for me though as you allude to at the end is whether they can recruit devs to utilise this. Especially when mobile games are a freemium dominated world the temptation is to code for lowest common denominator/max audience, probably with a Samsung label on it (I'm not complaining - its whats enabled me to run World of Tanks happily on my Bay Trail T100!).

World beating GPU tech no use unless people are utilising it. Interesting thought about getting MSFT on board though - I guess the downer is that Windows Phone is a minority sport still, and tablet wise it would have to be Windows RT... :-x
nicolapeluchetti - Monday, January 6, 2014 - link
The processing power might be the same as the X-Box 360-PS 3 but using Direct X doesn't incur in a performance Hit?
eddman - Monday, January 6, 2014 - link
I'm wondering the same thing.

Xbox and PS are gaming machines, running specialized OSes, and programmers can utilize low-level APIs and extract as much performance as possible.

Tegra K1 might be powerful, but it'll still be running general purpose OSes like android and windows RT.

Is there any way to know the performance gain by going low-level vs. high-level APIs for a video game? How much it really is? 5%? 15%? 40%?!!
Krysto - Monday, January 6, 2014 - link
Xbox360 and PS3 support DirectX9 and OpenGL ES 2.0+extensions. Many developers can and have made games with those APIs. Not all games on the consoles are "bare metal". So the "overall" difference in gaming, is probably not going to be very different.

The real "problem" is that, those games will need to come to devices that have 1080p or even 2.5k resolutions, which will cut the graphics performance of the games by 2-4x, compared to Xbox/PS3. This is why I hate the OEMs for being so dumb and pushing resolutions even further on mobile.

It's a waste of component money (could be used for different stuff instead), of battery life, and also of GPU resources.
nicolapeluchetti - Monday, January 6, 2014 - link
I guess really good games use low levels API, i mean GTA V looks amazing and the specs of the X-Box 360 are what they are.
I agree with you that resolution will be a problem, but actually i really like the added resolution in everyday use. i recently switched from a note 2 to a nexus 5 and the extra resolution is fantastic.
They will probably have to upscale things, render at 720p and render at 1080p
Krysto - Tuesday, January 7, 2014 - link
AMD said Mantle is pretty much bare-metal console API. And they said at their conference at CES that Battlefield 4 with that is 50 percent faster. So the difference is not huge, but significant.

By far the biggest impact will be made by the resolution of the device. While games on Xbox 360 run at 720p, most devices with Tegra K1 will probably have at least a 1080p resolution, which is twice as many pixels, so it cuts the performance in half (or the graphics quality).
TheJian - Sunday, January 12, 2014 - link
Link please, and at what point in the vid do they say it (because some of those vids are 1hr+ for conferences)? I have seen only ONE claim and by a single dev who said you might get 20% if lucky. It is telling that we have NO benchmarks yet.

But I'm more than happy to read about someone using Mantle actually saying they expect 45-50% IN GAME over a whole benchmark (not some specific operation that might only be used once). But I don't expect it to go ever 20%.

Which makes sense given AMD shot so low with their comment of "we wouldn't do it for 5%". If it was easy to get even 40% wouldn't you say "we wouldn't do it for 25%"? Reality is they have to spend to get a dev to do this at all, because they gain NOTHING financially for using Mantle unless AMD is paying them.

I'll be shocked to see BF4 over 25% faster than with it off (I only say 25% for THIS case because this is their best case I'm assuming, due to AMD funding it big time as a launch vehicle). Really I might be shocked at 20% but you gave me such a wide margin to be right saying 50%. They may not even get 20%.

Why would ANY dev do FREE work to help AMD, and when done be able to charge ZERO over the cost of the game for everyone else that doesn't have mantle? It would be easier to justify it's use if devs could charge Mantle users say $15 extra per game. But that just won't work here. So you're stuck with amd saying "please dev, I know its more work and you won't ever make a dime from it, but it would be REALLY nice for us if you did this work free"...Or "Hi, my name is AMD, here's $8 Million dollars, please use Mantle". Only the 2nd option works at all, and even then you get Mantle being back burner the second the game needs to be fixed for the rest of us (BF4 for instance, all stuff on back burner until BF4 is fixed for regular users). This story is no different than Phsyx etc.
nicolapeluchetti - Tuesday, January 7, 2014 - link
Mantle is said to have 45% performance bonus compared to DirectX on Battlfield. Those are the rumours.
OreoCookie - Monday, January 6, 2014 - link
It's great to see that finally the SoC makers are being serious about GPU compute, now it's up to software developers to take advantage of all that compute horsepower. Given Apple's focus on GPU performance in the past, I'm curious to see what their A8 looks like and how it stacks up against Tegra K1 (in particular the Denver version).
timchen - Monday, January 6, 2014 - link
The Denver speculation really needs some justification.

Doesn't common sense say that the same task is always more power efficient done with hardware rather than software? It would at least need a paragraph or two to explain how OoO or speculative execution or ILC can be more power efficient in software.

Now if it is just that you need to build different binaries specifically for these cores, it then sounds a lot more like a compute GPU actually-- but as far as I understand so far general tasks are not suitable to run on those configurations, and parallelization for general problems is pretty much a dead horse (similar to P=NP?) now.
KAlmquist - Monday, January 6, 2014 - link
That speculation didn't make a lot of sense to me, either.

One of the reasons that out of order execution improves performance is that cache misses are expensive. In an out of order processor, when a cache miss occurs the processor can defer the instructions that need that particular piece of data, and execute other instructions while waiting for the read to complete. To create "nice bundles of instructions that are already optimized for peak parallelism," you have to know how long each memory read is going to take.

The writers mention the Transmeta Efficeon processor, which translated x86 instructions to native instructions and then executed them on an in-order processor. That was a fairly effective approach, but doesn't demonstrate that an in-order processor can compete with a modern out of order processor. After all, ARM started out producing in-order processors, which were very energy efficient, but eventually they had to produce an out of order design in order to increase performance without increasing the clock rate.
Loki726 - Monday, January 6, 2014 - link
Transmeta didn't have an in-order design in the same way that a normal CPU is in order. See their CGO paper: http://people.ac.upc.edu/vmoya/docs/transmeta-cgo....

Here's the relevant text:

"Compilers typically deal with recovery from speculation by generating compensation code, which re-
executes incorrectly sequenced operations, performs operations omitted from the speculative code path, and
corrects mismatches in register assignments (Freudenberger et al. [13]). With this approach, hardware
support is required to defer faults of potentially faulting instructions moved above branches (e.g.,
boosting,Smith et al. [23]), to detect overlapping memory operations scheduled out of sequence, and to branch to the
compensation code (e.g., memory conflict buffers, Gallagher et al. [14], or the Intel IA-64 ALAT[18]).

In contrast, Crusoe native VLIW processors provide an elegant hardware solution that supports arbitrary kinds of
speculation and subsequent recovery and works hand-in-hand with the Code Morphing Software [8]. All registers
holding x86 state are shadowed; that is, there exist two copies of each register, a working copy and a shadow
copy. Normal atoms only update the working copy of the register. If execution reaches the end of a translation, a
special commit operation copies all working registers into their corresponding shadow registers, committing the
work done in the translation. On the other hand, if any exceptional condition, such as the failure of one of CMS’s
translation assumptions, occurs inside the translation, the runtime system undoes the effects of all molecules
executed since the last commit via a rollback operation that copies the shadow register values (committed at the
end of the previous translation) back into the working registers.

Following a rollback, CMS usually interprets the x86 instructions corresponding to the faulting translation, executing
them in the original program order, handling any special cases that are encountered, and invoking the x86
exception-handling procedure if necessary.

Commit and rollback also apply to memory operations. Store data are held in a gated store buffer, from which they
are only released to the memory system at the time of a commit. On a rollback, stores not yet committed can
simply be dropped from the store buffer. To speed the common case of no rollback, the mechanism was designed so
that commit operations are effectively “free”[27], while rollback atoms cost less than a couple of branch mispredictions."
name99 - Monday, January 6, 2014 - link
This is not especially new (though it might have been in Transmeta's time).

Given the existence of robust and generally accurate branch prediction, a number of architectures have been proposed that are based on checkpoints and rollbacks rather than a ROB. There are a number of ways you can slice this, with the newest, richest, ideas having names like CFP (Continuous Flow Processing) and DOE (Distributed OutOfOrder Execution), both created by folks with Intel affiliations.

What these architectures do is help you with long memory latency delays because (in spite of what the above author said) OoO doesn't help much there. OoO covers L1 delays, most L2 delays, some L3 delays if you're lucky, and very little of the main memory delay. That's why prefetching is still an active area of research (e.g. there were some minor but cute improvements to prefetch in Ivy Bridge). The problem is the length of the ROB limits how far you can cover latency in a ROB architecture, and you can't make the ROB much larger because that increases the size (and slows down) the register file. Checkpoint architectures are not constrained in this way.

HOWEVER all this is neither here nor there.
There are three interesting claims being made about Denver
- it uses a checkpoint architecture. Interesting if true, because this type of architecture has the potential to be the general replacement for ROB OoO; even if the first implementation is only equivalent of ROB OoO, there are many new optimizations it opens up
- it uses some sort of "Code Morphing". Who knows WTF this means. Could be anything from rewriting ARM assembly to an internal ISA (like Apple have done many times, from 68K->PPC to Rosetta; likewise DEC did this to run x86 binaries on Alpha) to PPro style µOps to something very minor like the way POWER "cracks" a few instructions to simpler instructions.
- it is "7-wide". If this is an issue width, it's a bullshit measure that no-one who knows anything cares about. If this is a Decode/Rename/Dispatch width, it is a major leap forward, and the only likely way it is doable at such low power is through use of a trace cache which records dependency and remap information. If nVidia has this, it would be very cool.

Given that this is nVidia, my betting would be that every one of these is underwhelming. The exciting checkpoint architecture is in fact a standard ROB (with standard ROB limitations). The code morphing is minor cracking of a few "hard" instructions. The 7-wide refers to issue width so, ho-hum.
Loki726 - Tuesday, January 7, 2014 - link
"This is not especially new."

Agreed. I mainly posted it for reference in case someone had not seen it before.
Da W - Monday, January 6, 2014 - link
For that matter i would prefer a Kabini surface mini and for AMD to follow Nvidia in game streaming (from PC or from Xbox one).
chizow - Monday, January 6, 2014 - link
Great write-up guys, you're right, this is the most exciting announcement I've seen in the CPU/GPU/SoC space in a very long time, similar to A7 Cyclone but 2x that due to both CPU and GPU bombshells. It's probably the first analysis I've read in full because everything was just that interesting relative to what the rest of the industry is doing.

One burning question that I did not see touched upon at all, here or elsewhere:

****What does Tegra K1 do for Nvidia's Kepler IP tech licensing prospects?

It seems to me, even if Tegra itself is not a smash hit for Nvidia in terms of design wins, the GPU technology is so disruptive that even if it gets into a few major designs (Surface 3, Nexus 7 2014, Asus Transformer for example) it may very well *FORCE* the other major industry players (Intel, Samsung, Apple) that don't have their own in-house graphics IP to license Kepler to remain competitive?

What do you all think? Any buzz on that front at CES?
OreoCookie - Friday, January 10, 2014 - link
As far as I can tell, nVidia only compared the GPU performance of the A7 to Tegra K1 but not the CPU performance. I'd be very curious to see how the Denver cores compare to Apple's Cyclone cores, though.

Also, given Tegra's release date, it'll compete with Apple's A8.
Krysto - Saturday, January 11, 2014 - link
Based on the (limited) technical description and how massive those cores are, along with clock speeds that are almost twice as high as what Apple typically uses, I'd say they will beat Apple's A8 (probably just an upgraded Cyclone) pretty easily - unless Nvidia did something stupid with that software translation that adds too much overhead and and cuts the performance too much.

But since we don't know exactly what's going on inside of those CPU cores, we'll have to wait for more details or a direct comparison (and hopefully Denver actually arrives this fall, and not next year).
OreoCookie - Sunday, January 12, 2014 - link
Initially, I thought so, too, but knowing it's a Transmeta Crusoe-like design, I'd be much more cautious about performance. At the same clockspeed, the Crusoe was about half or a third as fast as a Pentium III. The advantage was that the cpus consumed much less power.

Of course that tells us nothing of a comparison between the A7 or A8 and a Denver-based K1 other than that the architectures are not directly comparable.
name99 - Monday, January 6, 2014 - link
"
We’ve seen code morphing + binary translation done in the past, including famously in Transmeta’s offerings in the early 2000s, but it’s never been done all that well at the consumer client level.
"

Actually we've seen a few different versions of it which have worked just fine.
One obvious example (not consumer, but transparent) was IBM's switch over from custom cores to POWER cores for i-Series.
More on the consumer end, Apple have been doing this for years if you use OpenCL on their products --- they convert, on the fly, a byte code version of the GPU instructions to the target GPU. And of course anything that uses a JIT, whether it's targeting Java or JS (or Dalvik for that matter) is doing a similar sort of thing.

There may be uniquely painful aspects to doing this for x86/Windows, especially 15 years ago, but I don't think Transmeta's failure tells us anything --- this mainstream-ish tech. Especially now, in a world with hypervisors, where you have a more well-defined "space" for control code to run and bring up the OS step by step.
ruthan - Tuesday, January 7, 2014 - link
Ok, they maybe have enough GPU performance in this chip on paper. But how is final TDP SOC power consumation for 64 bit piece?
But if you want to have realy PS3 or Xbox performance, which was advertised / promised till original Ipad and we still arent here at all.
Other problem are game engines middleware performance, because 80% of mobile games using Unity3D engine, which in by my experience, much more HW resources greedy and inefficient (C# - has automatic garbage collection, all in Unity running in single thread, GUI performance is terrible, PhysX implementation is signle thread) that, that console developement kits.
Back into problem, GPU is maybe ok, but for final overall performance you need also CPU with desktop like performance and to freed GPU with data and im dont think so that these weak ARM is nearly here.

So in overall i dont agree with these big perfromance and desktop like performance promises at all, would be ok, but it is only empty words.
kwrzesien - Tuesday, January 7, 2014 - link
I think nVidia has finally done it with a great SoC/GPU! I hope they get a few very solid design wins, it could change alot.

Looking at those beautiful chip diagrams I think they have the CPU/CPU balance just right.
easp - Wednesday, January 8, 2014 - link
So, it seems to me that 8 of these Denver cores would offer similar general purpose compute performance to a dual socket server from ~5-6 years ago, and yet, would make up a minuscule % of die area on a Tesla-class GPU die...
Krysto - Saturday, January 11, 2014 - link
Some also say a Denver core should equal the Sandy Bridge core in performance, which would be quite impressive. That's what I have in my laptop, and it was a pretty high-end one 2 years ago.
OreoCookie - Sunday, January 12, 2014 - link
Who wrote that, can you provide a link? I haven't seen any such claims. And I'm fairly sure nVidia would have mentioned that during the press event. Apple's A7 packs about the same punch as a Core 2 Duo, so it'd not be out of the question, but I'd be more cautious, especially seeing how high Intel's cpus turbo these days.
PC Perv - Saturday, January 11, 2014 - link
How can you make so many definitive statements over what was essentially a PR pitch? It's too bad there is no "critics" or ombudsman to hold these bloggers accountable over time. (Granted that is also why these bloggers will never garner respects from mainstream media) These bloggers seemingly get away with anything they say as long as they keep their industry friends happy.

If anyone wants to know what I am talking about, go back 2 ~ 3 years and check these clowns' articles. And check if they ever, i mean EVER, acknowledge their misjudgments or stupidity.
PC Perv - Saturday, January 11, 2014 - link
For instance, do you guys have any follow up on Tegra 4i?

http://www.anandtech.com/show/6787/nvidia-tegra-4-...

Or ist it just the way it is with you guys? Just blow fanfare whenever OEM does a press conference, and completely forget about it in less than a year?

Have you no shame?
TheJian - Tuesday, January 14, 2014 - link
What fanfare? T4i is a Q1 product and the modem just got certified on ATT last month or so. The whole point of the T4i is the modem and phones so what is the problem? NV already showed it doing 150mbps (an update from 100mbps preview info) and this hasn't even been rolled out yet (anybody else running this besides Honk Kong/Singapore?). What do you want them to report? This product has been PULLED IN along with K1 at the cost of some T4 delay and sales. This is not news and everyone (even this NV hating site) has reported it :) T4i if late at all is only because of the modem awaiting which after checking happened Early Nov.

Not sure this new modem speed is even interesting with caps we have today. At 50mbps on cable I can clock ~325GB or so pegged all day (that's north of 10TB/month). Even Hong Kong has a 10GB cap which is what, like 5x USA caps at 2GB usually? Even in HK that's only ONE 1080p flick and some browsing? I hope we start seeing Cell phone bill lawsuits soon that tie up these CAPPED companies so badly they are forced to stop this crap just due to litigation cost fears. But I think this is probably a pipe dream until google or someone else can offer unlimited mobile.

IE, google mentions rolling out Gbit internet in Austin, and ATT goes on immediate defense announcing huge speed upgrades (20x faster to 300mbps) and a future upgrade past that on the books not long after. So it is terribly expensive and not doable before google, but the same week google announces their roll-out, ATT can suddenly roll-out a huge upgrade and BEAT google's roll-out dates...LOL. But to match google's prices ($70) you have to OK them spying on you...ROFL. At least Google forced the updates.
http://www.theverge.com/2013/12/11/5200250/at-t-be...
Then claims they can deny google access to poles a few days later:
http://arstechnica.com/tech-policy/2013/12/why-att...
We can only hope the city votes on 23rd (next week) to allow low pole access pricing. Hard to say google would lose offering free internet to 100 libraries and public joints in the city that the CITY chooses, but they already delayed so maybe they're stupid or bribed heavily. :)

Maybe google just needs to announce everywhere and get ATT etc to announce matching $70 pricing then just say "just kidding". :) At worst they seem to easily force monopolies to respond as shown here. I hope they do the same in phones, amazon and apple too (heck MS also). We need all these big tech dogs to bark at cell providers big time and threaten their business models in any way they can. Competition from outsiders is sorely needed in cell or we'll be stuck with Verizon/ATT etc caps forever.
phoenix_rizzen - Thursday, January 16, 2014 - link
Rogers in Canada has 150 Mbps LTE using their 2600 MHz spectrum. It's been live for about a year now.

They ran a speedtest competition around the time they lit up the first 2600 MHz towers in Ontario, and there were a *lot* of entries showing over 90 Mbps entries. It's listed somewhere on their Redboard blog.

My phone only does 100 Mbps LTE, and our town doesn't yet officially have LTE (there are 2 towers with it enabled out of the dozen or so towers in town), but I can get a consistent 40 Mbps on speedtests, with the occasional jump over 70.

So, if backward old Canada can get 150 Mbps LTE working, anywhere should be able to. :)

Oh, and 6 GB data plans are very common up here.
tipoo - Thursday, November 6, 2014 - link
I wonder if the code morphing has anything to do with the Nexus 9s performance inconsistency? Does amazing in most singular benchmarks, but when thrown multitasking or unpredictable code it chokes.

NVIDIA Tegra K1 Preview & Architecture Analysis

Post Your Comment

88 Comments

Back to Article

blanarahul - Monday, January 6, 2014 - link

blanarahul - Monday, January 6, 2014 - link

blanarahul - Monday, January 6, 2014 - link

klmx - Monday, January 6, 2014 - link

droopaloop - Monday, January 6, 2014 - link

Krysto - Monday, January 6, 2014 - link

phoenix_rizzen - Monday, January 6, 2014 - link

Krysto - Monday, January 6, 2014 - link

BMNify - Tuesday, January 7, 2014 - link

deltatux - Thursday, January 9, 2014 - link

davidjin - Thursday, January 23, 2014 - link

deltatux - Saturday, January 25, 2014 - link

Laxaa - Monday, January 6, 2014 - link

nafhan - Monday, January 6, 2014 - link

B3an - Monday, January 6, 2014 - link

Alexvrb - Monday, January 6, 2014 - link

name99 - Monday, January 6, 2014 - link

stingerman - Tuesday, January 7, 2014 - link

klmx - Monday, January 6, 2014 - link

MonkeyPaw - Monday, January 6, 2014 - link

Nenad - Monday, January 13, 2014 - link

chizow - Monday, January 6, 2014 - link

MrdnknnN - Monday, January 6, 2014 - link

nathanddrews - Monday, January 6, 2014 - link

HammerStrike - Monday, January 6, 2014 - link

OreoCookie - Monday, January 6, 2014 - link

MrSpadge - Monday, January 6, 2014 - link

Sabresiberian - Monday, January 6, 2014 - link

MrSpadge - Monday, January 6, 2014 - link

kpb321 - Monday, January 6, 2014 - link

HighTech4US - Monday, January 6, 2014 - link

MrSpadge - Monday, January 6, 2014 - link

HighTech4US - Wednesday, January 8, 2014 - link

OreoCookie - Monday, January 6, 2014 - link

Xavierx78 - Monday, January 6, 2014 - link

silenceisgolden - Monday, January 6, 2014 - link

fafa1971 - Monday, January 6, 2014 - link

Rayb - Monday, January 6, 2014 - link

Krysto - Monday, January 6, 2014 - link

ddriver - Monday, January 6, 2014 - link

da_asmodai - Monday, January 6, 2014 - link

dwforbes - Monday, January 6, 2014 - link

Ryan Smith - Monday, January 6, 2014 - link

ddriver - Monday, January 6, 2014 - link

Loki726 - Monday, January 6, 2014 - link

ddriver - Monday, January 6, 2014 - link

Loki726 - Monday, January 6, 2014 - link

ddriver - Monday, January 6, 2014 - link

Loki726 - Monday, January 6, 2014 - link

ddriver - Tuesday, January 7, 2014 - link

jerrylzy - Tuesday, January 7, 2014 - link

Loki726 - Tuesday, January 7, 2014 - link

jerrylzy - Tuesday, January 7, 2014 - link

Loki726 - Tuesday, January 7, 2014 - link

ddriver - Wednesday, January 8, 2014 - link

Loki726 - Wednesday, January 8, 2014 - link

Krysto - Saturday, January 11, 2014 - link

Andromeduck - Wednesday, January 8, 2014 - link

Jon Tseng - Monday, January 6, 2014 - link

nicolapeluchetti - Monday, January 6, 2014 - link

eddman - Monday, January 6, 2014 - link

Krysto - Monday, January 6, 2014 - link

nicolapeluchetti - Monday, January 6, 2014 - link

Krysto - Tuesday, January 7, 2014 - link

TheJian - Sunday, January 12, 2014 - link

nicolapeluchetti - Tuesday, January 7, 2014 - link

OreoCookie - Monday, January 6, 2014 - link

timchen - Monday, January 6, 2014 - link

KAlmquist - Monday, January 6, 2014 - link

Loki726 - Monday, January 6, 2014 - link

name99 - Monday, January 6, 2014 - link

Loki726 - Tuesday, January 7, 2014 - link

Da W - Monday, January 6, 2014 - link

chizow - Monday, January 6, 2014 - link

OreoCookie - Friday, January 10, 2014 - link

Krysto - Saturday, January 11, 2014 - link