I am also wondering about this. To me it seems a bit strange why they don't take this step given that their revenues from mobile are getting slimmer and slimmer.
Imagination Technologies may be British but it's owned via shell-companies by the Chinese government. Finding viable revenue streams no longer matters to them.
This product may eventually transition to China-only PCs, but in the meantime (as it's implied in Innosilicon PR) it's targeted for Chinese data centers.
I would imagine the consideration is money and competition. Imagination has no mindshare in the consumer market, and lacks many gaming-centric features (and good drivers) that would essentially eliminate adoption. The MGPU seems like it will be the future of gpu computing, but right now putting multiple slabs of silicon together and efficiently connecting them is expensive and power hungry which is undesirable.
To my knowledge, their gpus do not currently support APIs such as DX12, lack a freesync/gsync competitor, reduced input latency, game streaming, etc. In terms of drivers they have no public drivers for windows or linux, meaning their drivers are unoptimized for desktop usage.
You're conflating a few things. Adaptive sync or Freesync is a feature of the display controller not the GPU. As has been stated, Apple has used Imagination's GPU IP for years and that includes the iPad Pro which supports dynamic refresh rates.
Game streaming is a way that software uses a GPUs hardware decoder. All it would require is something like Moonlight supporting it.
I don't know what reduced input latency is supposed to refer to here since that has to do with a bunch of things that aren't the GPU.
I can't say for certain if they support DirectX 12 but they have supported OpenGL, Vulkan, and DirectX 11 for a long time. Considering many DirectX 11 GPU were able to support DX12 via a driver update, that's very likely something that they could do the same. If they can't then they would still be an option for Linux users.
Lastly, Imagination makes GPU IP meaning they don't manufacture actual chips. They license out designs so you're not gonna go to Imagination's website and see download list for their drivers. You can, however, look at Imagination's YouTube channel and see their cards running on Linux and Windows desktops so those drivers exist.
He's "Technically wrong" but from the consumer's viewpoint he's entirely right. There's a lot of work ImaginationTechnologies have to do to get their PowerVR GPUs to the desktop space then catch-up and compete against the likes of Nvidia's Ampere GPUs (industry standard).
Normally, I would say that's easy. PowerVR has actually been an innovator and sometimes a leader in the GPU field, albeit from the shadows. But now? I doubt it.
Rant: Firstly with 2020 being what it is. More importantly, the company is shifting massively with their CEO ousted, board members changing, and their Senior Engineers leaving after the aggressive Chinese buyout. The British seem to be angry about the whole thing, but it was so predictable, they're too naive. I'm not sure if what's left of PowerVR will have the leadership and talent anymore to get the job done. Perhaps these A-Series and B-Series are the end of their hard work, and in a few years time they will hit the wall (kinda like Intel did after SkyLake). Their short-handed won't affect Apple much, since they actually have their own GPU designers and don't rely on PowerVR much anymore. So I can see these valuable staff being poached by Apple-SoC, ARM-Mali or even Qualcomm-Adreno teams. Hopefully they don't add to the monopoly with Nvidia/AMD teams.
I can agree with all that. It would be very difficult for them to break into a field that's been team red and team green for some people's entire lives. If they do fall apart, I'd say Nvidia would be the worst case scenario as their recent purchase of ARM is allowing them to wield a gross amount of power.
"The MGPU seems like it will be the future of gpu computing, but right now putting multiple slabs of silicon together and efficiently connecting them is expensive and power hungry which is undesirable."
Judging by their past performance with respect to drivers for windows platform systems, I wouldn't touch their desktop products with a ten foot pole until they have thoroughly demonstrated a willingness to:
Provide a set of fully functioning drivers Provide regular updates to those drivers to address bugs and odd behavior Provide at the absolute least some framework for FOS drivers to be built for Linux Provide a driver whad for standard linux desktops Continue to provide these things for a period of several years
Because, up until now, they've done none of the above. Their last desktop product was an integrated gpu for Intel on an early Atom product. They, and Intel, promptly abandoned the product within months of releasing it, and never provided a single functional driver for the very next edition of Windows that was released within six months of the release of that processor. The drivers that were released for the existing windows version had rather basic functionality, were still buggy, and offered no useful video acceleration for any (at the time) modern video compression protocol.
So, they can release it all they want to, but I'm certainly not going near it for a long time.
Imagination Tech's PowerVR tech started on PC. They left the market because their cards were not powerful enough and refocused on the mobile market, where their tech was more power-efficient. It's funny now to hear people asking them to come back to PC. And I'm all for it!
That's not why they left. Imagination doesn't work like Nvidia or AMD. They don't make chips that others buy and place on boards for sale. Imagination makes source code for GPUs and in order for someone to make a graphics card out of it, someone needs to first license that code and make a physical chip.
Nobody made a successor to the chip used in the Kyro II so nobody was able to release cards that used anything new after that.
In Anandtech's own article for the Kyro II it shows it outperforming cards that are twice the cost, use the twice the power, and have three times the fill-rate. It's limitation was that it didn't have a hardware T&L unit at the time but the next design would have.
You don't win market share without a product offering in the market. If we only did things for which we could be certain of success, we would all still be huddled into a teeming mass near central Asia.
Since there has been no application of Imagination's top tier configurations after Apple abandoned them (from the Furian generation, 2017), leaving that market for small budget gpus, likely those in wearable devices will be justifiable. They having been working on high-end mobile gpus for some 5 years without market share, sadly.
Because the Chinese have a lot of money to throw around and they desperately want to develop market-controlling technologies. They don't control that many Western IP companies. Might as well try to make use of the few they do control.
They have a huge amount of money. And they have a planned economy so they can strategically put it where they want. If they want to dominate 5G they can develop 5G IP, steal 5G IP, subsidize 5G equipment makers, etc., for example. The also have a yearly trade surplus of more than $400 billion. And they have an isolated financial system in which they can rack up a lot of debt, and foreign investors have been very happy to loan them money, anyway. They do spend a huge amount of money building useless infrastructure to pump up their GDP numbers and employ their citizens, but at the same time they are strategically developing, leeching, and stealing high tech design and manufacturing capabilities to attempt to dominate future technologies.
They don't have a planned economy. It's largely market-driven. You are kidding yourself if you think they can create and control the world's biggest single market.
Certain industries are government controlled, yes, just like any western countries before privatization of utility services became a thing. And their sovereign fund is much bigger.
Those chinese have only managed to outcast the former corporate leaders recently. The shift from engeneer culture will take time, if not reverted by the UK government.
They had the stubbornness to not be bought by Apple in order to be bought by the Chinese government. And through what will or method is the UK government going to change the culture of the company?
Well if not Apple why not ARM, I know ARM tried to buy them at some point in the past. But once Apple left their valuation would have taken a pretty substantial hit and ARM's GPU IP is successful but I don't think it is the most Area/Power efficient so it looked to me to be something they would have explored both would have been in the same country, maybe it would have spurred ARM into providing a more viable alternative to Qualcomm in the smartphone GPU space.
"Whereas current monolithic GPU designs have trouble being broken up into chiplets in the same way CPUs can be, Imagination’s decentralised multi-GPU approach would have no issues in being implemented across multiple chiplets, and still appear as a single GPU to software."
There's not problem in splitting today desktop monolithic GPUs into chiplets. What is done here is to create small chiplets that have all the needed pieces as a monolithic one. The main one is the memory controller.
Splitting a GPU over chiplets all having their own MC is technically simple but makes a mess when trying to use them due to the NUMA configuration. Being connected with a slow bus makes data sharing between chiplets almost impossible and so needs the programmer/driver to split the needed data over the single chiplet memory space and not make algorithms that share data between them.
The real problem with MCM configuration is data sharing = bandwidth. You have to allow for data to flow from one core to another independently of its physical location on which chiplet it is. That's the only way you can obtain really efficient MCM GPUs. And that requires high power+wide buses and complex data management with most probably very big caches (= silicon and power again) to mask memory latency and natural bandwidth restriction as it is impossible to have buses as fast as actual ones that connect 1TB/s to a GPU for each chiplet.
As you can see to make their GPUs work in parallel in HCP market Nvidia made a very fast point-to-point connection and created very fast switches to connect them together.
It is removed just for the size of the cache. If you need more than that amount of data you'll still be limited to bandwidth limitation. With the big cache latency now added.
If it were so easy to reduce the bandwidth limitations anyone would just add a big enough cache... the fact is that there's no a big enough cache for the immense quantity of data GPUs work with, unless you want all your VRAM as a cache (but then you won't be connected with such a limited bus).
Probably works out-ish a bit better with a tile-based deferred renderer, since the active data for a given time will be more localized and more predictable.
The thing with tile-based GPUs is that they have less data to share between cores since the depth, stencil, and color buffers for each tile are stored on-chip. Since screen-space triangles are split into tiles and one triangle can potentially turn into thousands of fragments, it becomes less bandwidth intensive to distribute work like that. All the work that Imagination in particular has put into HSR to reduce texture bandwidth as well as texture pre-fetch stuff would also benefit them in multi-GPU configurations.
I'm gonna be a weirdo and add to something like half a year later. I'm not sure why seeing two or, in this case, four GPUs is preferable to seeing one in situations where all the GPUs are tile-based and on the same chip.
Let me think out loud here.
At the vertex processing stage, you could toss triangles at each GPU and they'll transform them to screen-space then clip, project, and cull them. Their respective tiling engines then determine which tiles each triangle is in and appends that to the parameter and geometry buffer in memory. I can't think of many reasons why they would really need to communicate with each other when making this buffer. After that's done, the fragment shading stage would consist of each GPU requesting tiles and textures from memory, shading and blending them in their own tile memory, and writing out the finished pixels in memory. I can't really find much in that example that makes all four GPUs work differently than one larger one.
I can see why that might be preferable with IMR GPUs though. If we were to just toss triangles at each GPU they would transform them to screen-space and clip, project, and cull them just like a TBDR. After this, a single IMR GPU would do an early-z test, if it passes then procedes with the fragment pipeline. This is where the first big issue comes up in a multi-GPU configuration though: overlapping geometry. Each GPU will be transforming different triangles and some of these triangles may overlap. It would be really useful for GPU0 to know if GPU1 is going to write over the pixels it's about to work on. This would require sharing the z-value of the current pixels between both GPUs. They could just compare z-values at the same stages, but unless they were synced with each other, that wouldn't prevent GPU0 from working on pixels that already passed GPU1's early z-test and are about to be written to memory. Obviously, that would result in a lot of unnecessary on-chip traffic, very un-ideal scaling, and possibly pixels being drawn to buffers than shouldn't have.
What might help is to do typical dual-GPU stuff like alternate frame or split-frame rendering so those z-comparisons would only have to happen between the pixels on each chip. The latter raises another problem though. Neither GPU can know what a triangles final screen space coordinates are until AFTER they transform it. This means if GPU0 is supposed to be rendering the top slice of the screen and it gets a triangle from the bottom of the screen or across the divide then it has to know how to deal with that. It could just send that triangle to GPU1 to render. Since they both share the same memory, it has a second option which is to do the z-comparison thing from before and GPU0 could render the pixels to bottom of the screen anyway.
Obviously you could also bin the triangles like TBDR or give each GPU a completely separate task like having one work on the G-buffer while the other creates shadow maps or have each rendering a different program. Because there's so many ways to use two or more IMRs together and each has it's drawbacks, it makes sense to expose them as two separate GPUs. It puts the burden on parrallizing them in someone elses hands. TBDRs don't need to do that because they work more like they normally would. That's why PowerVR Series 5 GPUs pretty much just scaled by putting more full GPUs on the SOC.
Obviously, these both become a lot more complicated when they're chiplets, especially if they have their own memory controllers but I won't get into that.
That doesn't necessarily mean that they're using the A series outright. I've seen speculation that Apple's solution is more or less Imagination's GPU but with redone shader clusters where there's more emphasis on FP16 performance.
Any plans to get your hands on a A-Series or B-Series Img GPU? Like I don't know if there is any current consumer devices on the market, or any coming in the future.
The only places where it could really compare with Big Navi is if there's a game with a lot of overdraw that a maxed out B-series GPU would be able to rid itself of.
Honestly, I am not fully understanding as to how this GPU is supposed to compete in the high perfomance computing market. AFAIK, that market is hungry for tflops(as well as fast memory) yet this does not seem to deliver enough tflops.
In fact, it's very disappointing specially for the tradeoffs that you would be forced to do in a multi-chiplet design. There are also a bunch of design decisions, that seem like they would hurt latency and possibly performance as well.
The article mentions that it has two possible configurations, one where a 'primary' GPU works as a 'firmware processor' to divide the workload across the other GPUs, in that kind of thing, it would seem to me that it would add some latency and overhead over the more traditional GPU. While the other configuration lacks a firmware processor but is completely limited by the primary GPU geometry unit.
Funnily enough, it don't seem like they have provided any detail about the memory controllers nor about the cache, possibly because it would be obvious that such a configuration would have it's severe tradeoff? There is also nothing about the interconnect that would link the GPUs together, this is an important one as it can have great impacts on the latency. You need to have them have some data sharing unless each GPU would only utilize the data it can get through it's own memory controller, but that could lead to problems too.
There are a bunch of things there that can increase latency and hamper performance. I personally would be skeptical about this until it releases and there are information on how this actually goes.
This design is not forcing anybody to go with chiplets. These multiple GPUs can be placed on one chip. As far as peak TFLOPS and fast memory, my guess is that you aren't familiar with tile-based rendering, specifically Imagination's Tile-Based Deferred Renderer, does.
AMD or Nvidia GPUs just pull vertices from external memory, then transform, rasterize, and fragment shade them then writing them back to external memory. All color and depth reads and writes happen in external memory. To make the most of that memory bandwidth and shader performance, game designers need to sort draw calls from near to far and maybe do a depth pre-pass to get rid of overdraw though there there will also be overdraw that occurs per draw call.
A tile based GPU pulls vertices from external memory then transforms them into screen space, clips and culls them, then write them back out into external memory as compressed bins that represent different tiles on the screen. It then reads them back a few tiles at a time and applies hidden surface removal on opaque geometry to remove overdraw completely. That process creates an on-chip depth/stencil buffer which insures that only pixels that contribute the final image get submitted for fragment shading. It then attempts to create the back-buffer for that tile completely within a small amount of on-chip color buffer memory so that it only needs to write back the finalized tile to external memory.
The depth buffer never even has to get written to external memory, all overdraw for opaque geometry is completely removed regardless of the order they were submitted to the GPU, and the 6TFLOPs that the B series can theoretically achieve is being spread among fewer pixels. It has far less reliance on external memory bandwidth since it only really needs to use it read textures and geometry data and store the tile list. Since on-chip memory can potentially use very high bus widths and run in-sync with the GPU, they could easily provide more bandwidth than GDDR6X memory. At 1500Mhz with four GPUS and four tiles per GPU, a 256-bit bus to each tile would feed it with 768GB/s of low-latency memory bandwidth. If you're curious how large each tile is, previous Imagination GPUs used 32x32 pixel tiles or smaller so a 256-bit G-buffer would only require about 32KB per tile so it can just use SRAM.
That tiling process also provides a simple way of dividing up the work among the GPUs. You mentioned that configuration that uses a primary GPU. The article says that would be the only one with a firmware processor and a geometry front end so it would be the one that handles all vertex processing and tiling but after those tiles are written to main-memory, the only communication that it needs to make with other GPUs is to tell them to read some of those tiles. They can then work almost completely independently from each other.
To my knowledge, even per compute workloads can take advantage of those on-chip buffers as a kind of scratchpad RAM.
Modern GPUs since Vega and Maxwell already use a hybrid of tile based rendering and more traditional rendering methods(you can see this old-ish video on nvidia approach https://www.youtube.com/watch?v=Nc6R1hwXhL8). IDK why you commented on it. Since that means it's not a straight comparison of Tiled vs traditional methods.
I mentioned TFLOPs for compute workloads and not those graphical workloads, which was described in the article as being one of the potential markets for this. As far as I am aware, a gpu that is tile based wouldn't change much for pure compute workloads.
Also about that memory, I really doubt with that kind of configuration the latency would be that low. Probably higher than you expect. Though of course, GPUs generally aren't latency sensitive as CPUs. But it's also not that much higher than something with GDDR6X(or just GDDR6 with some 16 gbps chips).
I must say that I am not familiar with tile based GPUs, but say per example about textures. Obviously you need to get them from the external memory, as there is no SRAM in the world that could store all textures that you need. This would obviously complicate the memory controller issue as I was talking about in my original post. Same for say computing with large memory requirement.
Obviously, if each of the chiplet has their own (say 64-bit) memory controller then they will need an interconnect to share the data. And that is what I was talking about, such a thing would increase latency. And again, the article does not say how the memory controllers are set up for those chiplets.
Tiled caching and tile-based rendering are still very different. Tiled caching can't do that and I believe it only works on a small buffer of geometry at a time as it works completely within the 2MB of L2 cache which is not enough to store the geometry for a whole scene. It's enough to have reduced the required external memory bandwidth by quite a bit. Tile-based rendering creates the primitive/parameter buffer in external memory before pulling them back on chip tile by tile and has the ability to create the entire back-buffer for each tile completely within on-chip memory.
In modern games, textures and geometry take up a lot of space in RAM but account for very little of the used memory bandwidth. Textures are generally compressed to 4 bits per pixel (though ASTC allows bit rates from 8 to 0.89 bits per pixel) and are read once to a few times at the beginning of the frame. The majority of bandwidth is needed for the back buffer which, for an Ultra HD game with a 256-bit G-buffer, would only take up about 265 MBs of space but would be written to and read from multiple times per-pixel per-frame. That's why the Xbox One had 32MB of ESRAM with a max bandwidth of 205-218GB/s and 8GB of DDR3 over a 256-bit bus with only 68.3GB/s of bandwidth. The ESRAM was large enough to store a 128-bit 1080p g-buffer and the DDR3 stores textures and geometry and acts as work RAM for the CPU. If a game has low-quality textures these days, it's generally blamed on the amount of RAM not its speed. As resolution increases, the average texture samples per-pixel goes down. In other words, if the chiplets do need to fetch texture samples from each other then it wouldn't require all that much chiplet to chiplet bandwidth. Using that XBO as an example, that's 17.075 GB/s per 64-bit memory controller without accounting for the fact that the CPU is using some of that. That's 40% of the die to die bandwidth between AMD's Zeppelin chiplets. TBDR would decrease texture bandwidth even more because it only fetches texels for visible fragments.
I'm not sure what you mean by "Also about that memory, I really doubt with that kind of configuration the latency would be that low. Probably higher than you expect." Are you talking about how said that the on-chip memory is ultra low-latency and high bandwidth? Because it would absolutely be low-latency. We're talking about a very local pool of SRAM that's running at the same clock as the ALUs. Meanwhile GDDR6X would have to be accessed via requests from memory controllers that are being shared by the whole chip then traveling off-package, through the motherboard and then into a separate package and back.
You're right about compute workloads not being all that different between TBRs and IMRs so a compute load that generally needs a lot of high bandwidth memory would still have the same requirements for external memory. However, I'm lead to believe that the reason GPUs are used for compute workloads is because those compute workloads, like graphics workloads, are considered embarrassingly parallel so I don't really know how much data moves horizontally in the GPU. It is very possible that most GPU compute workloads could be modified to make some use of the on-chip storage to reduce reliance on external memory.
I'm also curious to see how those memory controllers are set up, too, but I'm a bit more confident in a TBRs ability to scale in a multi-GPU set up.
The last time Imagination's GPUs were in the desktop space was in 2001 with the Kyro II. GPUs then and now are very different but this article can still give you a sense of what sort of gains a TBDR GPU could potentially provide.
Compared to the Geforce2 Ultra, the Kyro II used under half the amount of transistors, under half the power, had 36% of the memory bandwidth, 35% of the fill rate, and cost 44% of the price yet it actually beat the Geforce2 Ultra in some tests especially at higher resolutions.
One area currently grossly underserved by both NVIDIA and AMD is in entry-level dGPUs with decent ASICs for HEVC/h265, VP9 and AV1 decode in 10bit HDR/HDR+ on board. Basically, PCIe cards with 2-4 Gb VRAM under $ 100 that still beat the iGPUs in Renoirs and Tiger Lakes (Xe). That market is up for grabs now, maybe these GPUs can fill that void?
The TBDR architecture has always allowed for efficient multi-core solutions, since they bin all the triangles before rasterization and work then tile by tile. Each core can work on different tiles. IMG has been touting this benefit for literally decades already, I'm not sure what is so different here. The Naomi 2 arcade board from over 20 years ago is a simple implementation of this.
The other concern for scaling it up to high-end desktop levels is always the same; the number of triangles that would need to be binned for a modern a desktop game is much, much higher than a mobile game.
I never understood why the polygon counts are considered an issue for TBRs.
I know GPUs like the RTX 3080 have peak primitives counts of about 10.2 billion with boost clocks but I'm pretty sure actual game triangle counts never really get that high. If you divide that by 60 fps then we're looking at a peak of about 170 million per frame with a current target resolution of Ultra HD which is 8,294,400 pixels. Even accounting for back-facing and overlapping geometry, do any games really have 20 times more primitives than rendered pixels?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
74 Comments
Back to Article
Obi-Wan_ - Tuesday, October 13, 2020 - link
Is this something that could transition to the desktop consumer space at some point? Either the Imagination IP or just the pull/push change.yeeeeman - Tuesday, October 13, 2020 - link
I am also wondering about this.To me it seems a bit strange why they don't take this step given that their revenues from mobile are getting slimmer and slimmer.
Arsenica - Tuesday, October 13, 2020 - link
Imagination Technologies may be British but it's owned via shell-companies by the Chinese government. Finding viable revenue streams no longer matters to them.This product may eventually transition to China-only PCs, but in the meantime (as it's implied in Innosilicon PR) it's targeted for Chinese data centers.
Threska - Wednesday, October 14, 2020 - link
Getting the Apple contract is indeed a big deal.https://technode.com/2020/04/15/imagination-techno...
Otritus - Tuesday, October 13, 2020 - link
I would imagine the consideration is money and competition. Imagination has no mindshare in the consumer market, and lacks many gaming-centric features (and good drivers) that would essentially eliminate adoption. The MGPU seems like it will be the future of gpu computing, but right now putting multiple slabs of silicon together and efficiently connecting them is expensive and power hungry which is undesirable.myownfriend - Tuesday, October 13, 2020 - link
What gaming centric features are they missing and how do you know they have shitty drivers?Otritus - Wednesday, October 14, 2020 - link
To my knowledge, their gpus do not currently support APIs such as DX12, lack a freesync/gsync competitor, reduced input latency, game streaming, etc. In terms of drivers they have no public drivers for windows or linux, meaning their drivers are unoptimized for desktop usage.myownfriend - Wednesday, October 14, 2020 - link
You're conflating a few things. Adaptive sync or Freesync is a feature of the display controller not the GPU. As has been stated, Apple has used Imagination's GPU IP for years and that includes the iPad Pro which supports dynamic refresh rates.Game streaming is a way that software uses a GPUs hardware decoder. All it would require is something like Moonlight supporting it.
I don't know what reduced input latency is supposed to refer to here since that has to do with a bunch of things that aren't the GPU.
I can't say for certain if they support DirectX 12 but they have supported OpenGL, Vulkan, and DirectX 11 for a long time. Considering many DirectX 11 GPU were able to support DX12 via a driver update, that's very likely something that they could do the same. If they can't then they would still be an option for Linux users.
Lastly, Imagination makes GPU IP meaning they don't manufacture actual chips. They license out designs so you're not gonna go to Imagination's website and see download list for their drivers. You can, however, look at Imagination's YouTube channel and see their cards running on Linux and Windows desktops so those drivers exist.
Kangal - Friday, October 16, 2020 - link
He's "Technically wrong" but from the consumer's viewpoint he's entirely right. There's a lot of work ImaginationTechnologies have to do to get their PowerVR GPUs to the desktop space then catch-up and compete against the likes of Nvidia's Ampere GPUs (industry standard).Normally, I would say that's easy. PowerVR has actually been an innovator and sometimes a leader in the GPU field, albeit from the shadows. But now? I doubt it.
Rant:
Firstly with 2020 being what it is. More importantly, the company is shifting massively with their CEO ousted, board members changing, and their Senior Engineers leaving after the aggressive Chinese buyout. The British seem to be angry about the whole thing, but it was so predictable, they're too naive. I'm not sure if what's left of PowerVR will have the leadership and talent anymore to get the job done. Perhaps these A-Series and B-Series are the end of their hard work, and in a few years time they will hit the wall (kinda like Intel did after SkyLake). Their short-handed won't affect Apple much, since they actually have their own GPU designers and don't rely on PowerVR much anymore. So I can see these valuable staff being poached by Apple-SoC, ARM-Mali or even Qualcomm-Adreno teams. Hopefully they don't add to the monopoly with Nvidia/AMD teams.
myownfriend - Friday, October 16, 2020 - link
I can agree with all that. It would be very difficult for them to break into a field that's been team red and team green for some people's entire lives. If they do fall apart, I'd say Nvidia would be the worst case scenario as their recent purchase of ARM is allowing them to wield a gross amount of power.Threska - Wednesday, October 14, 2020 - link
"The MGPU seems like it will be the future of gpu computing, but right now putting multiple slabs of silicon together and efficiently connecting them is expensive and power hungry which is undesirable."https://www.digitaltrends.com/computing/google-sta...
lightningz71 - Tuesday, October 13, 2020 - link
Judging by their past performance with respect to drivers for windows platform systems, I wouldn't touch their desktop products with a ten foot pole until they have thoroughly demonstrated a willingness to:Provide a set of fully functioning drivers
Provide regular updates to those drivers to address bugs and odd behavior
Provide at the absolute least some framework for FOS drivers to be built for Linux
Provide a driver whad for standard linux desktops
Continue to provide these things for a period of several years
Because, up until now, they've done none of the above. Their last desktop product was an integrated gpu for Intel on an early Atom product. They, and Intel, promptly abandoned the product within months of releasing it, and never provided a single functional driver for the very next edition of Windows that was released within six months of the release of that processor. The drivers that were released for the existing windows version had rather basic functionality, were still buggy, and offered no useful video acceleration for any (at the time) modern video compression protocol.
So, they can release it all they want to, but I'm certainly not going near it for a long time.
29a - Tuesday, October 13, 2020 - link
They have tried before but they didn't do very well. I remember this card having a lot of hype before it was released.https://www.anandtech.com/show/735
myownfriend - Tuesday, October 13, 2020 - link
And a lot of people still talk about that card. That was one hell of a review.Threska - Wednesday, October 14, 2020 - link
Had the previous, and yes they were for the time period.Alexvrb - Wednesday, October 14, 2020 - link
The Kyros were great cards for their day, affordable and powerful.Lindegren - Wednesday, October 14, 2020 - link
They did, in 2001 - https://www.anandtech.com/show/735... but they focused on mobile instead, and that is why they are still here, unlike 3dfx and s3
Kel Ghu - Wednesday, October 14, 2020 - link
Imagination Tech's PowerVR tech started on PC. They left the market because their cards were not powerful enough and refocused on the mobile market, where their tech was more power-efficient. It's funny now to hear people asking them to come back to PC. And I'm all for it!myownfriend - Thursday, October 15, 2020 - link
That's not why they left. Imagination doesn't work like Nvidia or AMD. They don't make chips that others buy and place on boards for sale. Imagination makes source code for GPUs and in order for someone to make a graphics card out of it, someone needs to first license that code and make a physical chip.Nobody made a successor to the chip used in the Kyro II so nobody was able to release cards that used anything new after that.
In Anandtech's own article for the Kyro II it shows it outperforming cards that are twice the cost, use the twice the power, and have three times the fill-rate. It's limitation was that it didn't have a hardware T&L unit at the time but the next design would have.
EthiaW - Tuesday, October 13, 2020 - link
Why is the company still making futile investment to large GPUs such as BXT-16/32 without any forseeable customers? Just confused.HVAC - Tuesday, October 13, 2020 - link
You don't win market share without a product offering in the market. If we only did things for which we could be certain of success, we would all still be huddled into a teeming mass near central Asia.EthiaW - Tuesday, October 13, 2020 - link
Since there has been no application of Imagination's top tier configurations after Apple abandoned them (from the Furian generation, 2017), leaving that market for small budget gpus, likely those in wearable devices will be justifiable.They having been working on high-end mobile gpus for some 5 years without market share, sadly.
Zingam - Wednesday, October 14, 2020 - link
This way of thinking, Mr.... that's why you are no Bill Gates.Yojimbo - Tuesday, October 13, 2020 - link
Because the Chinese have a lot of money to throw around and they desperately want to develop market-controlling technologies. They don't control that many Western IP companies. Might as well try to make use of the few they do control.Zingam - Wednesday, October 14, 2020 - link
I doubt that Chinese have that much money... They are just big. When you are big then you also spend more to support yourself.Yojimbo - Wednesday, October 14, 2020 - link
They have a huge amount of money. And they have a planned economy so they can strategically put it where they want. If they want to dominate 5G they can develop 5G IP, steal 5G IP, subsidize 5G equipment makers, etc., for example. The also have a yearly trade surplus of more than $400 billion. And they have an isolated financial system in which they can rack up a lot of debt, and foreign investors have been very happy to loan them money, anyway. They do spend a huge amount of money building useless infrastructure to pump up their GDP numbers and employ their citizens, but at the same time they are strategically developing, leeching, and stealing high tech design and manufacturing capabilities to attempt to dominate future technologies.dotjaz - Wednesday, October 14, 2020 - link
They don't have a planned economy. It's largely market-driven. You are kidding yourself if you think they can create and control the world's biggest single market.Certain industries are government controlled, yes, just like any western countries before privatization of utility services became a thing. And their sovereign fund is much bigger.
colinisation - Tuesday, October 13, 2020 - link
Still not sure why IMGTech was never just bought out by Apple or ARM. They seem to have some interesting tech.GC2:CS - Tuesday, October 13, 2020 - link
To my knowledge IMGTech got bid by Apple twice. But they refused.EthiaW - Tuesday, October 13, 2020 - link
Such a stubborn engineer-leaded company.Yojimbo - Tuesday, October 13, 2020 - link
I didn't know Xi JinPing was an engineer...EthiaW - Wednesday, October 14, 2020 - link
Those chinese have only managed to outcast the former corporate leaders recently. The shift from engeneer culture will take time, if not reverted by the UK government.Yojimbo - Wednesday, October 14, 2020 - link
They had the stubbornness to not be bought by Apple in order to be bought by the Chinese government. And through what will or method is the UK government going to change the culture of the company?Yojimbo - Tuesday, October 13, 2020 - link
Hey, you're right. He studied chemical engineering. I knew that, but forgot.melgross - Tuesday, October 13, 2020 - link
With Apple being 60% of their sales, and 80% of their profits, they demanded $1 billion from Apple, which refused that ridiculous price.The company is likely worth no more than $100 million, if that, considering their sales are now just about $20 million a year.
colinisation - Tuesday, October 13, 2020 - link
Well if not Apple why not ARM, I know ARM tried to buy them at some point in the past.But once Apple left their valuation would have taken a pretty substantial hit and ARM's GPU IP is successful but I don't think it is the most Area/Power efficient so it looked to me to be something they would have explored both would have been in the same country, maybe it would have spurred ARM into providing a more viable alternative to Qualcomm in the smartphone GPU space.
CiccioB - Tuesday, October 13, 2020 - link
"Whereas current monolithic GPU designs have trouble being broken up into chiplets in the same way CPUs can be, Imagination’s decentralised multi-GPU approach would have no issues in being implemented across multiple chiplets, and still appear as a single GPU to software."There's not problem in splitting today desktop monolithic GPUs into chiplets.
What is done here is to create small chiplets that have all the needed pieces as a monolithic one. The main one is the memory controller.
Splitting a GPU over chiplets all having their own MC is technically simple but makes a mess when trying to use them due to the NUMA configuration. Being connected with a slow bus makes data sharing between chiplets almost impossible and so needs the programmer/driver to split the needed data over the single chiplet memory space and not make algorithms that share data between them.
The real problem with MCM configuration is data sharing = bandwidth.
You have to allow for data to flow from one core to another independently of its physical location on which chiplet it is. That's the only way you can obtain really efficient MCM GPUs.
And that requires high power+wide buses and complex data management with most probably very big caches (= silicon and power again) to mask memory latency and natural bandwidth restriction as it is impossible to have buses as fast as actual ones that connect 1TB/s to a GPU for each chiplet.
As you can see to make their GPUs work in parallel in HCP market Nvidia made a very fast point-to-point connection and created very fast switches to connect them together.
hehatemeXX - Tuesday, October 13, 2020 - link
That's why Infinity Cache is big. The bandwidth limitation is removed.Yojimbo - Tuesday, October 13, 2020 - link
Anything Infinity is big, except compared to a bigger Infinity.CiccioB - Tuesday, October 13, 2020 - link
It is removed just for the size of the cache.If you need more than that amount of data you'll still be limited to bandwidth limitation.
With the big cache latency now added.
If it were so easy to reduce the bandwidth limitations anyone would just add a big enough cache... the fact is that there's no a big enough cache for the immense quantity of data GPUs work with, unless you want all your VRAM as a cache (but then you won't be connected with such a limited bus).
myownfriend - Tuesday, October 13, 2020 - link
Yea like if the back buffer were drawn with on-chip memory... like a tile-based GPU.anonomouse - Tuesday, October 13, 2020 - link
Probably works out-ish a bit better with a tile-based deferred renderer, since the active data for a given time will be more localized and more predictable.myownfriend - Tuesday, October 13, 2020 - link
The thing with tile-based GPUs is that they have less data to share between cores since the depth, stencil, and color buffers for each tile are stored on-chip. Since screen-space triangles are split into tiles and one triangle can potentially turn into thousands of fragments, it becomes less bandwidth intensive to distribute work like that. All the work that Imagination in particular has put into HSR to reduce texture bandwidth as well as texture pre-fetch stuff would also benefit them in multi-GPU configurations.SolarBear28 - Tuesday, October 13, 2020 - link
This tech seems very applicable to ARM Macs although Apple is probably using in-house designs.Luke212 - Tuesday, October 13, 2020 - link
why would i want to see 2 gpus as 1 gpu? its a terrible idea. its NUMA x 100myownfriend - Tuesday, October 13, 2020 - link
On an SOC or even in a chiplet design, they wouldn't necessarily have separate memory controllers. We're talking about GPUs as blocks on an SOC.CiccioB - Tuesday, October 13, 2020 - link
It simplify things better than see them as 2 separate GPUsmyownfriend - Sunday, June 6, 2021 - link
I'm gonna be a weirdo and add to something like half a year later. I'm not sure why seeing two or, in this case, four GPUs is preferable to seeing one in situations where all the GPUs are tile-based and on the same chip.Let me think out loud here.
At the vertex processing stage, you could toss triangles at each GPU and they'll transform them to screen-space then clip, project, and cull them. Their respective tiling engines then determine which tiles each triangle is in and appends that to the parameter and geometry buffer in memory. I can't think of many reasons why they would really need to communicate with each other when making this buffer. After that's done, the fragment shading stage would consist of each GPU requesting tiles and textures from memory, shading and blending them in their own tile memory, and writing out the finished pixels in memory. I can't really find much in that example that makes all four GPUs work differently than one larger one.
I can see why that might be preferable with IMR GPUs though. If we were to just toss triangles at each GPU they would transform them to screen-space and clip, project, and cull them just like a TBDR. After this, a single IMR GPU would do an early-z test, if it passes then procedes with the fragment pipeline. This is where the first big issue comes up in a multi-GPU configuration though: overlapping geometry. Each GPU will be transforming different triangles and some of these triangles may overlap. It would be really useful for GPU0 to know if GPU1 is going to write over the pixels it's about to work on. This would require sharing the z-value of the current pixels between both GPUs. They could just compare z-values at the same stages, but unless they were synced with each other, that wouldn't prevent GPU0 from working on pixels that already passed GPU1's early z-test and are about to be written to memory. Obviously, that would result in a lot of unnecessary on-chip traffic, very un-ideal scaling, and possibly pixels being drawn to buffers than shouldn't have.
What might help is to do typical dual-GPU stuff like alternate frame or split-frame rendering so those z-comparisons would only have to happen between the pixels on each chip. The latter raises another problem though. Neither GPU can know what a triangles final screen space coordinates are until AFTER they transform it. This means if GPU0 is supposed to be rendering the top slice of the screen and it gets a triangle from the bottom of the screen or across the divide then it has to know how to deal with that. It could just send that triangle to GPU1 to render. Since they both share the same memory, it has a second option which is to do the z-comparison thing from before and GPU0 could render the pixels to bottom of the screen anyway.
Obviously you could also bin the triangles like TBDR or give each GPU a completely separate task like having one work on the G-buffer while the other creates shadow maps or have each rendering a different program. Because there's so many ways to use two or more IMRs together and each has it's drawbacks, it makes sense to expose them as two separate GPUs. It puts the burden on parrallizing them in someone elses hands. TBDRs don't need to do that because they work more like they normally would. That's why PowerVR Series 5 GPUs pretty much just scaled by putting more full GPUs on the SOC.
Obviously, these both become a lot more complicated when they're chiplets, especially if they have their own memory controllers but I won't get into that.
brucethemoose - Tuesday, October 13, 2020 - link
Andrei, could you ask Innosilicon for one of those PCIe GPUs?Even if it only works for compute workloads, another competitor in the desktop space would be fascinating.
Also, that is a *conspicuously* flashy and desktop-oriented shroud for something thats ostensibly a cloud GPU.
myownfriend - Tuesday, October 13, 2020 - link
I was thinking the same thing about the shroud.lucam - Tuesday, October 13, 2020 - link
AndreiApple A14 uses IMG series A
myownfriend - Tuesday, October 13, 2020 - link
Where did you hear that?lucam - Tuesday, October 13, 2020 - link
https://www.imgtec.com/news/press-release/imaginat...myownfriend - Wednesday, October 14, 2020 - link
That doesn't necessarily mean that they're using the A series outright. I've seen speculation that Apple's solution is more or less Imagination's GPU but with redone shader clusters where there's more emphasis on FP16 performance.Andrei Frumusanu - Wednesday, October 14, 2020 - link
They don't.lucam - Thursday, October 15, 2020 - link
They do, but with Apple proprietary custom design. You should check your sources.Andrei Frumusanu - Thursday, October 15, 2020 - link
You've got no idea what you're talking about. A-Series has nothing to do with the Apple GPU.Kangal - Friday, October 16, 2020 - link
Any plans to get your hands on a A-Series or B-Series Img GPU? Like I don't know if there is any current consumer devices on the market, or any coming in the future.myownfriend - Saturday, October 17, 2020 - link
This article mentions a desktop graphics card that's coming out that uses a B-series GPU.myownfriend - Saturday, October 17, 2020 - link
What sources?AMDSuperFan - Tuesday, October 13, 2020 - link
I would like to see some benchmarks of this product against Big Navi to help me make a good decision. So far, nothing seems to measure up.myownfriend - Tuesday, October 13, 2020 - link
The only places where it could really compare with Big Navi is if there's a game with a lot of overdraw that a maxed out B-series GPU would be able to rid itself of.persondb - Tuesday, October 13, 2020 - link
Honestly, I am not fully understanding as to how this GPU is supposed to compete in the high perfomance computing market. AFAIK, that market is hungry for tflops(as well as fast memory) yet this does not seem to deliver enough tflops.In fact, it's very disappointing specially for the tradeoffs that you would be forced to do in a multi-chiplet design. There are also a bunch of design decisions, that seem like they would hurt latency and possibly performance as well.
The article mentions that it has two possible configurations, one where a 'primary' GPU works as a 'firmware processor' to divide the workload across the other GPUs, in that kind of thing, it would seem to me that it would add some latency and overhead over the more traditional GPU. While the other configuration lacks a firmware processor but is completely limited by the primary GPU geometry unit.
Funnily enough, it don't seem like they have provided any detail about the memory controllers nor about the cache, possibly because it would be obvious that such a configuration would have it's severe tradeoff? There is also nothing about the interconnect that would link the GPUs together, this is an important one as it can have great impacts on the latency. You need to have them have some data sharing unless each GPU would only utilize the data it can get through it's own memory controller, but that could lead to problems too.
There are a bunch of things there that can increase latency and hamper performance. I personally would be skeptical about this until it releases and there are information on how this actually goes.
myownfriend - Wednesday, October 14, 2020 - link
This design is not forcing anybody to go with chiplets. These multiple GPUs can be placed on one chip. As far as peak TFLOPS and fast memory, my guess is that you aren't familiar with tile-based rendering, specifically Imagination's Tile-Based Deferred Renderer, does.AMD or Nvidia GPUs just pull vertices from external memory, then transform, rasterize, and fragment shade them then writing them back to external memory. All color and depth reads and writes happen in external memory. To make the most of that memory bandwidth and shader performance, game designers need to sort draw calls from near to far and maybe do a depth pre-pass to get rid of overdraw though there there will also be overdraw that occurs per draw call.
A tile based GPU pulls vertices from external memory then transforms them into screen space, clips and culls them, then write them back out into external memory as compressed bins that represent different tiles on the screen. It then reads them back a few tiles at a time and applies hidden surface removal on opaque geometry to remove overdraw completely. That process creates an on-chip depth/stencil buffer which insures that only pixels that contribute the final image get submitted for fragment shading. It then attempts to create the back-buffer for that tile completely within a small amount of on-chip color buffer memory so that it only needs to write back the finalized tile to external memory.
The depth buffer never even has to get written to external memory, all overdraw for opaque geometry is completely removed regardless of the order they were submitted to the GPU, and the 6TFLOPs that the B series can theoretically achieve is being spread among fewer pixels. It has far less reliance on external memory bandwidth since it only really needs to use it read textures and geometry data and store the tile list. Since on-chip memory can potentially use very high bus widths and run in-sync with the GPU, they could easily provide more bandwidth than GDDR6X memory. At 1500Mhz with four GPUS and four tiles per GPU, a 256-bit bus to each tile would feed it with 768GB/s of low-latency memory bandwidth. If you're curious how large each tile is, previous Imagination GPUs used 32x32 pixel tiles or smaller so a 256-bit G-buffer would only require about 32KB per tile so it can just use SRAM.
That tiling process also provides a simple way of dividing up the work among the GPUs. You mentioned that configuration that uses a primary GPU. The article says that would be the only one with a firmware processor and a geometry front end so it would be the one that handles all vertex processing and tiling but after those tiles are written to main-memory, the only communication that it needs to make with other GPUs is to tell them to read some of those tiles. They can then work almost completely independently from each other.
To my knowledge, even per compute workloads can take advantage of those on-chip buffers as a kind of scratchpad RAM.
Threska - Wednesday, October 14, 2020 - link
I remember this discussion with my Apocalypse and PowerVR. Shame the idea never really took hold.persondb - Friday, October 16, 2020 - link
Modern GPUs since Vega and Maxwell already use a hybrid of tile based rendering and more traditional rendering methods(you can see this old-ish video on nvidia approach https://www.youtube.com/watch?v=Nc6R1hwXhL8). IDK why you commented on it. Since that means it's not a straight comparison of Tiled vs traditional methods.I mentioned TFLOPs for compute workloads and not those graphical workloads, which was described in the article as being one of the potential markets for this. As far as I am aware, a gpu that is tile based wouldn't change much for pure compute workloads.
Also about that memory, I really doubt with that kind of configuration the latency would be that low. Probably higher than you expect. Though of course, GPUs generally aren't latency sensitive as CPUs. But it's also not that much higher than something with GDDR6X(or just GDDR6 with some 16 gbps chips).
I must say that I am not familiar with tile based GPUs, but say per example about textures. Obviously you need to get them from the external memory, as there is no SRAM in the world that could store all textures that you need. This would obviously complicate the memory controller issue as I was talking about in my original post. Same for say computing with large memory requirement.
Obviously, if each of the chiplet has their own (say 64-bit) memory controller then they will need an interconnect to share the data. And that is what I was talking about, such a thing would increase latency. And again, the article does not say how the memory controllers are set up for those chiplets.
myownfriend - Friday, October 16, 2020 - link
Tiled caching and tile-based rendering are still very different. Tiled caching can't do that and I believe it only works on a small buffer of geometry at a time as it works completely within the 2MB of L2 cache which is not enough to store the geometry for a whole scene. It's enough to have reduced the required external memory bandwidth by quite a bit. Tile-based rendering creates the primitive/parameter buffer in external memory before pulling them back on chip tile by tile and has the ability to create the entire back-buffer for each tile completely within on-chip memory.https://www.imgtec.com/blog/a-look-at-the-powervr-...
In modern games, textures and geometry take up a lot of space in RAM but account for very little of the used memory bandwidth. Textures are generally compressed to 4 bits per pixel (though ASTC allows bit rates from 8 to 0.89 bits per pixel) and are read once to a few times at the beginning of the frame. The majority of bandwidth is needed for the back buffer which, for an Ultra HD game with a 256-bit G-buffer, would only take up about 265 MBs of space but would be written to and read from multiple times per-pixel per-frame. That's why the Xbox One had 32MB of ESRAM with a max bandwidth of 205-218GB/s and 8GB of DDR3 over a 256-bit bus with only 68.3GB/s of bandwidth. The ESRAM was large enough to store a 128-bit 1080p g-buffer and the DDR3 stores textures and geometry and acts as work RAM for the CPU. If a game has low-quality textures these days, it's generally blamed on the amount of RAM not its speed. As resolution increases, the average texture samples per-pixel goes down. In other words, if the chiplets do need to fetch texture samples from each other then it wouldn't require all that much chiplet to chiplet bandwidth. Using that XBO as an example, that's 17.075 GB/s per 64-bit memory controller without accounting for the fact that the CPU is using some of that. That's 40% of the die to die bandwidth between AMD's Zeppelin chiplets. TBDR would decrease texture bandwidth even more because it only fetches texels for visible fragments.
I'm not sure what you mean by "Also about that memory, I really doubt with that kind of configuration the latency would be that low. Probably higher than you expect." Are you talking about how said that the on-chip memory is ultra low-latency and high bandwidth? Because it would absolutely be low-latency. We're talking about a very local pool of SRAM that's running at the same clock as the ALUs. Meanwhile GDDR6X would have to be accessed via requests from memory controllers that are being shared by the whole chip then traveling off-package, through the motherboard and then into a separate package and back.
You're right about compute workloads not being all that different between TBRs and IMRs so a compute load that generally needs a lot of high bandwidth memory would still have the same requirements for external memory. However, I'm lead to believe that the reason GPUs are used for compute workloads is because those compute workloads, like graphics workloads, are considered embarrassingly parallel so I don't really know how much data moves horizontally in the GPU. It is very possible that most GPU compute workloads could be modified to make some use of the on-chip storage to reduce reliance on external memory.
I'm also curious to see how those memory controllers are set up, too, but I'm a bit more confident in a TBRs ability to scale in a multi-GPU set up.
myownfriend - Wednesday, October 14, 2020 - link
The last time Imagination's GPUs were in the desktop space was in 2001 with the Kyro II. GPUs then and now are very different but this article can still give you a sense of what sort of gains a TBDR GPU could potentially provide.https://www.anandtech.com/show/735/10
Compared to the Geforce2 Ultra, the Kyro II used under half the amount of transistors, under half the power, had 36% of the memory bandwidth, 35% of the fill rate, and cost 44% of the price yet it actually beat the Geforce2 Ultra in some tests especially at higher resolutions.
eastcoast_pete - Wednesday, October 14, 2020 - link
One area currently grossly underserved by both NVIDIA and AMD is in entry-level dGPUs with decent ASICs for HEVC/h265, VP9 and AV1 decode in 10bit HDR/HDR+ on board. Basically, PCIe cards with 2-4 Gb VRAM under $ 100 that still beat the iGPUs in Renoirs and Tiger Lakes (Xe). That market is up for grabs now, maybe these GPUs can fill that void?vladx - Wednesday, October 14, 2020 - link
Video encode/decode is one of the features used by cloud GPUs so it would most likely be able to do all that.tkSteveFOX - Wednesday, October 14, 2020 - link
I don't think those sub $100 GPUs can beat the integrated graphics. They are there for office stations to support 3-4 monitors, nothing more.Mat3 - Wednesday, October 14, 2020 - link
The TBDR architecture has always allowed for efficient multi-core solutions, since they bin all the triangles before rasterization and work then tile by tile. Each core can work on different tiles. IMG has been touting this benefit for literally decades already, I'm not sure what is so different here. The Naomi 2 arcade board from over 20 years ago is a simple implementation of this.The other concern for scaling it up to high-end desktop levels is always the same; the number of triangles that would need to be binned for a modern a desktop game is much, much higher than a mobile game.
myownfriend - Wednesday, October 14, 2020 - link
I never understood why the polygon counts are considered an issue for TBRs.I know GPUs like the RTX 3080 have peak primitives counts of about 10.2 billion with boost clocks but I'm pretty sure actual game triangle counts never really get that high. If you divide that by 60 fps then we're looking at a peak of about 170 million per frame with a current target resolution of Ultra HD which is 8,294,400 pixels. Even accounting for back-facing and overlapping geometry, do any games really have 20 times more primitives than rendered pixels?
supdawgwtfd - Thursday, October 15, 2020 - link
"into different work tiles that can then the other “slave” GPUs can pull from in order to work on"My brain exploded from trying to read this...
Seriously. Get a damn editor or even a basic proof reader!