Also, Zen does not lower it's frequency to perform 256-bit AVX2 operations. That is specific to Intel CPUs. It seems Zen has more efficient power management of it's AVX units
Doesn't Zen have noticeably lower clock speeds to start with? The best thing is to look at actual benchmarks as clock speed alone is only part of the picture.
I did read that people over-clocking Coffee Lake i7 k series to ~5GHz were using an AVX offset of 3 so 4.7GHz for AVX loads sounds pretty good.
Right, clock speed isn't everything. Let's not forget about the Pentium 4 fiasco where they prioritized high frequencies over real world efficiency/performance.
I think the AVX offset is often misunderstood. I think the way to read it is that Intel feels that non-AVX code is still common to be worth optimizing for, hence the slight "turbo boost" for running non-AVX code.
No, it's a penatly for AVX2/512 code, not a boost for other workloads (which are of course still very important). The higher throughput in AVX2/512 is the reason for the higher power draw - there's no free lunch here. And Intel decided that full clocks & voltages would put undue stress on the chips and power delivery, hence the clock speed penatly for AVX2/512
You're missing my point. Processor designers can either limit the max frequency to the slowest/hottest part of the CPU design, or they can implement dynamic frequency logic to allow different instruction mixes to run at different frequencies. Whether one views a given dynamic frequency mode as a "penalty" or a "reward" is a matter of perspective (and a human value judgement too). The fact that Intel refers to the AVX frequency is a "penalty" says more about internal Intel politics than it does technical design.
You can make two versions of the same Intel CPU, and if you add AVX you need an offset as high as 400 mhz to run at the same temperature and voltage, so it is called a penalty.
Point is you need to compare last years with this years. You are overthinking a little bit ;)
It's called an AVX offset not a non-AVX boost but ultimately that's just semantics really. But when you consider that everything bar AVX runs at the maximum speed it makes sense to call it an AVX offfset as that is clear and simple to grasp.
While the Pentium 4 was a beast to keep cool and insanely power inefficient it was also the CPU when it came to video editing and transcoding. And it also ran laps around AMD's cpu's in anything floating point intensive. I can't think of any "real world" task it was deficient in.
Zen will execute 256 bit AVX2 instructions but will break them down into two 128 bit chunks for actual execution. Zen does not have the same peak per cycle throughput as Intel's recent Lake family due to this.
This could also be how Intel is implementing AVX-512 on the consumer side to keep power consumption in check: breaking down 512 bit instructions into 256 bit operations for execution. AVX-512 does require bit more than just breaking down instructions (it adds 32 masking registers for example). Still adding instruction level support in this fashion will help the propagation of AVX-512 software as more hardware can use. The more hardware there is to use it, the greater the incentive for developers to adopt it.
From what I can tell in Intel's documentation is that they are really 512 instructions and they have performance tests that show it is 2x as fast as AVX256. They also enhanced the compiler for 512 bit support. Keep in mind these are Vectors and not registers
That is the case for the server parts. Kevin G was speculating that Intel might go with a fake it until you make it approach to keep costs and power consumption down on (as yet unreleased) consumer parts by splitting the AVX512 into two AVX256's for execution.
It appears that people are uneducated on Intel development - a lot of technology does come from server technology as we can see the X series - but Intel does not fake technology like AMD did with stated 256 uses two 128 bits for AVX 2 ( I taking his word on that - so it could be wrong ). But I believe Intel technical document is surely stated 512 bit for the AVX512,
Big difference with server chips vs desktop variants is 1. They are intended for multiple cpu access 2. They have more advance IO functionality include ECC ram 3. They are more durable for long term
I have a 10 year dual Xeon 5160 machine and I would say above is true - I would expect that that is true for current xeon chips.
My guess on I9 X series is that they are Xeon variants without the IO enhancement and dual processor support. This Does NOT mean that two i9 X will function in Xeon mother board or Xeon in a i9 X motherboard.
The instructions are 512 bit but the raw execution units perform them in two 256 bit segments. This is exactly what Intel has done in the past with SSE and the 64 bit execution units doing 128 bit operations back in the Netburst days. This would include the Xeon 5160 you mentioned.
It is important to note that even though a 256 operand is internally split into two 128bit on the logical level, this doesn't involve two separate operations, it simply dispatches two instructions that are executed simultaneously, so the throughout is not any lower. It is just a representational thing, and it was mandated by amd's effort to create a more efficient SIMD unit, which they did.
And it is not how intel will be supporting avx512 - the chips already support it, it is just disabled. Only xeon gold and upper get the full 2 512 bit units enabled, the rest either get 1 or 0. It seems intel have decided to throw the dog a bone and enable one of the units on chips that previously had both disabled.
As Zen is splitting those instructions up into 2 128 bit parts it doesn't get a higher throughput than with 128 bit instructions, just smaller benefits due to requiring less instructions. They're not burning more power, so they don't lower the clocks. But they also don't get the same boost in peak performance per core as Intel does.
Wrong. It is split into two uops but the two are executed simultaneously nonetheless. Well, technically not, but practically, for all intents and purposes, zen's throughput is not any lower for most avx 256 operations. There are a few very corner cases which are not supported in hardware and emulated which get a hit, but those are rarely used, hence amd's decision to not waste transistors on them for the time being.
Only theoretical peak performance fall into that category. Real code is more than just AVX operands which lowers the real world benefits of having such a potentially higher throughput. See Amdahl's Law.
AVX can consume large amounts of power because of the nature of the processing. This is why Prime95 can make any AVX supporting cpu run ridiculously hot. Intel had two choices. Lower the clock or raise the voltage. They actually do both, depending on the SKU.
That's because it only does 128-bits at a time and takes 2 clocks where as Intel chips do it all at once and take 1 clock. So not only do Intel's chips have a higher clock speed but AMD takes a 50% penalty for taking 2 clocks instead of 1.
Actually there are 2 128bit units so the AMD CPU's don't do them one after another.
And both in Intel and AMD CPU's do the instructions take multiple clock cycles, that's because they are big and heavy and can't be done in a single clock cycle unless you decrease the clock speed a LOT ;-) Note that often different instructions take different numbers of clock cycles. It is certainly possible that for some, Intel takes, say, 6 and AMD 5 and for others, AMD takes 12 and Intel 10.
Zen (up to Epyc, since it is the same core) has no AVX256 blocks, so there is less need to clock down. Zen can do AVX256 by pairing two AVX128 units, but that is not the same thing. Cannon Lake's AVX512 units will be "true" AVX512 ones, not 2 x AVX256 ones, so my guess is that it will need to clock down even further.
zen runs at lower clocks to start with and zen does less work than intels avx2. certain operations cant be done in one clock on the zen. i run an 8700k at 5ghz with no avx offset at ~1.35v stable. the intel cpu gets more avx work done per clock than the zen so of course it will draw more power and throttling clocks to stay within tdp means that the intel actually has a higher througput of avx instructions even with the lower clocks. it has nothing to do with "more effiecient power management"
I wouldn't read too much into this. Intel regularly enables or disables features in order to fulfill other goals like yield management, price point diversity, marketing, etc.
AVX-512 is actually a great example of this behavior because Intel knows that people who need/want AVX-512 will pay dearly for it, and therefore the feature is only enabled on premium Skylake parts at the moment. Maybe Intel will still view AVX-512 as being a premium feature in the Cannonlake timeframe, or maybe not. Either way, the die logic will almost surely be there.
Considering Intel is requiring new motherboards for Coffee Lake without any real reason I think we can be assured they'll require new ones for Canon Lake as well.
There is a reason but not one that everyone accepts as being valid. Personally I don't know enough about how power delivery to chips impacts the requirement to break compatibility with a new CPU design so I just don't know what the truth is. Some people prefer to believe their own 'story' about the scenario even though they know no more than me on the subject. That just tells you something about their levels of bias and intelligence rather than the underlying issue. But this is the age we live in where people believe their own fake 'truths' so are also suggestible to the fake news of others.
Intel have stuck to two-gens-per-socket for the last decade for the consumer socket line (Socket Hx), so I expect that to continue. Whatever the successor to Coffee Lake is, it will likely use the same socket as Coffee Lake. If that's Cannon Lake, then we'd see Cannon Lake on Socket H5 ('LGA 1151-2') and Ice Lake on Socket H6.
Historically that is correct but there is strong evidence pointing to CL being a platform that is stand alone so is neither forward or backward compatible.
First of all, these next generation chips sound like a major CPU architecture change. My first job was almost 7 years of assembly language program and one thing is for sure REP MOV type of memory moves is used a lot - so Icy Lake looks quite fast chip especially if they implemented so existing code runs with new micro code.
But one thing about Cannon Lake, I thought it was primary aim at low power mobile chips and also that was planned before year end. Maybe the desktop chips are 2018.
This could mean that AVX-512 is coming to laptops which is going to be significant - because it means more applications are likely will used it since it is bigger market than desktop now a days
Sweet! It's about time locusts were able to leverage the power of AVX-512. Honestly though "consumer" is the worst term EVER. EVER. Did I mention EVER? It maybe works to refer to people as locusts within the context of buying food perhaps, but CPUs? Why can't we ever just say "customers" or something that is at least neutral?
Just FYI – Intel has been incrementally optimizing REP MOV since at least Nehalem. This particular optimization should make it possible to use REP MOV in cases where the amount of data copied is "short".
They are integrating the controller which is currently a separate chip directly into the actual CPU. Of course the motherboard will need to support its features which is the case for all I/O that a CPU offers.
Your fallacy that the desktop is not important for Intel shows how out of touch you are. TB3 will be integrated in desktop CPUs for sure.
Do GPUs have hardware acceleration of exponential functions? Surely, they must accelerate reciprocal and probably even x^-0.5.
Regardless, if you can currently use GPUs, then any AVX-512 support is unlikely to change things for you. Xeon Phi (KNL) has it and still gets stomped on raw compute perf by big GPUs.
You're being unrealistic. The combination of possible operations and operands is dizzying. It took Intel several generations to flesh out their previous vector instructions, and those didn't even support data types like fp16.
I am being realistic. There are principal architectural decisions made in the 70s (and even as early as 40s) which need to be ditched to extract full potential from modern manufacturing technologies, but it is possible, and much better than this constant abomination of inventing new commands.
No, you're not being realistic if you think they're going to just flush x86 down the toilet and start with a clean sheet of paper. There are CPUs like that, but targeted more towards hyperscale and other specialized niches.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
50 Comments
Back to Article
andrebrait - Thursday, October 19, 2017 - link
It's interesting to notice AMD has supported the SHA-NI instructions on their Zen architecture for a while now.lefty2 - Thursday, October 19, 2017 - link
Also, Zen does not lower it's frequency to perform 256-bit AVX2 operations. That is specific to Intel CPUs. It seems Zen has more efficient power management of it's AVX unitssmilingcrow - Thursday, October 19, 2017 - link
Doesn't Zen have noticeably lower clock speeds to start with?The best thing is to look at actual benchmarks as clock speed alone is only part of the picture.
I did read that people over-clocking Coffee Lake i7 k series to ~5GHz were using an AVX offset of 3 so 4.7GHz for AVX loads sounds pretty good.
Elstar - Thursday, October 19, 2017 - link
Right, clock speed isn't everything. Let's not forget about the Pentium 4 fiasco where they prioritized high frequencies over real world efficiency/performance.I think the AVX offset is often misunderstood. I think the way to read it is that Intel feels that non-AVX code is still common to be worth optimizing for, hence the slight "turbo boost" for running non-AVX code.
MrSpadge - Thursday, October 19, 2017 - link
No, it's a penatly for AVX2/512 code, not a boost for other workloads (which are of course still very important). The higher throughput in AVX2/512 is the reason for the higher power draw - there's no free lunch here. And Intel decided that full clocks & voltages would put undue stress on the chips and power delivery, hence the clock speed penatly for AVX2/512Elstar - Thursday, October 19, 2017 - link
You're missing my point. Processor designers can either limit the max frequency to the slowest/hottest part of the CPU design, or they can implement dynamic frequency logic to allow different instruction mixes to run at different frequencies. Whether one views a given dynamic frequency mode as a "penalty" or a "reward" is a matter of perspective (and a human value judgement too). The fact that Intel refers to the AVX frequency is a "penalty" says more about internal Intel politics than it does technical design.alistair.brogan - Thursday, October 19, 2017 - link
You can make two versions of the same Intel CPU, and if you add AVX you need an offset as high as 400 mhz to run at the same temperature and voltage, so it is called a penalty.Point is you need to compare last years with this years. You are overthinking a little bit ;)
HStewart - Thursday, October 19, 2017 - link
A similar thing could say about Core count - it not actually the number the cores that make the difference but how the cores performed.smilingcrow - Thursday, October 19, 2017 - link
It's called an AVX offset not a non-AVX boost but ultimately that's just semantics really.But when you consider that everything bar AVX runs at the maximum speed it makes sense to call it an AVX offfset as that is clear and simple to grasp.
pm9819 - Thursday, October 19, 2017 - link
While the Pentium 4 was a beast to keep cool and insanely power inefficient it was also the CPU when it came to video editing and transcoding. And it also ran laps around AMD's cpu's in anything floating point intensive. I can't think of any "real world" task it was deficient in.lefty2 - Thursday, October 19, 2017 - link
It happens that in general Intels 14nm++ can clock higher that Globalfoundries 14nm node, but that completely unrelated to my point.smilingcrow - Thursday, October 19, 2017 - link
You brought up clock speeds and now you say it's unrelated? Hmm!lefty2: "It seems Zen has more efficient power management of it's AVX units".
Based on what as I thought Zen's AVX performance wasn't one of its many strengths?
Kevin G - Thursday, October 19, 2017 - link
Zen will execute 256 bit AVX2 instructions but will break them down into two 128 bit chunks for actual execution. Zen does not have the same peak per cycle throughput as Intel's recent Lake family due to this.This could also be how Intel is implementing AVX-512 on the consumer side to keep power consumption in check: breaking down 512 bit instructions into 256 bit operations for execution. AVX-512 does require bit more than just breaking down instructions (it adds 32 masking registers for example). Still adding instruction level support in this fashion will help the propagation of AVX-512 software as more hardware can use. The more hardware there is to use it, the greater the incentive for developers to adopt it.
HStewart - Thursday, October 19, 2017 - link
From what I can tell in Intel's documentation is that they are really 512 instructions and they have performance tests that show it is 2x as fast as AVX256. They also enhanced the compiler for 512 bit support. Keep in mind these are Vectors and not registershttps://www.servethehome.com/wp-content/uploads/20...
From that chart, it looks like to me that AVX512 is 2x as fast as AVX256 which on intel is 256 bit
DanNeely - Thursday, October 19, 2017 - link
That is the case for the server parts. Kevin G was speculating that Intel might go with a fake it until you make it approach to keep costs and power consumption down on (as yet unreleased) consumer parts by splitting the AVX512 into two AVX256's for execution.HStewart - Thursday, October 19, 2017 - link
It appears that people are uneducated on Intel development - a lot of technology does come from server technology as we can see the X series - but Intel does not fake technology like AMD did with stated 256 uses two 128 bits for AVX 2 ( I taking his word on that - so it could be wrong ). But I believe Intel technical document is surely stated 512 bit for the AVX512,Big difference with server chips vs desktop variants is
1. They are intended for multiple cpu access
2. They have more advance IO functionality include ECC ram
3. They are more durable for long term
I have a 10 year dual Xeon 5160 machine and I would say above is true - I would expect that that is true for current xeon chips.
My guess on I9 X series is that they are Xeon variants without the IO enhancement and dual processor support. This Does NOT mean that two i9 X will function in Xeon mother board or Xeon in a i9 X motherboard.
Kevin G - Friday, October 20, 2017 - link
The instructions are 512 bit but the raw execution units perform them in two 256 bit segments. This is exactly what Intel has done in the past with SSE and the 64 bit execution units doing 128 bit operations back in the Netburst days. This would include the Xeon 5160 you mentioned.ddriver - Friday, October 20, 2017 - link
It is important to note that even though a 256 operand is internally split into two 128bit on the logical level, this doesn't involve two separate operations, it simply dispatches two instructions that are executed simultaneously, so the throughout is not any lower. It is just a representational thing, and it was mandated by amd's effort to create a more efficient SIMD unit, which they did.And it is not how intel will be supporting avx512 - the chips already support it, it is just disabled. Only xeon gold and upper get the full 2 512 bit units enabled, the rest either get 1 or 0. It seems intel have decided to throw the dog a bone and enable one of the units on chips that previously had both disabled.
mode_13h - Tuesday, October 24, 2017 - link
The Xeon W CPUs all have both units enabled. Even down to the lowly sub-$300 W-2123. Check ark.intel.com.MrSpadge - Thursday, October 19, 2017 - link
As Zen is splitting those instructions up into 2 128 bit parts it doesn't get a higher throughput than with 128 bit instructions, just smaller benefits due to requiring less instructions. They're not burning more power, so they don't lower the clocks. But they also don't get the same boost in peak performance per core as Intel does.edzieba - Thursday, October 19, 2017 - link
Clock-for-clock, a Zen core's AVX 256 throughput is half that of a Coffee Lake (or Kaby Lake) core's.ddriver - Friday, October 20, 2017 - link
Wrong. It is split into two uops but the two are executed simultaneously nonetheless. Well, technically not, but practically, for all intents and purposes, zen's throughput is not any lower for most avx 256 operations. There are a few very corner cases which are not supported in hardware and emulated which get a hit, but those are rarely used, hence amd's decision to not waste transistors on them for the time being.Kevin G - Friday, October 20, 2017 - link
Only theoretical peak performance fall into that category. Real code is more than just AVX operands which lowers the real world benefits of having such a potentially higher throughput. See Amdahl's Law.bcronce - Thursday, October 19, 2017 - link
AVX can consume large amounts of power because of the nature of the processing. This is why Prime95 can make any AVX supporting cpu run ridiculously hot. Intel had two choices. Lower the clock or raise the voltage. They actually do both, depending on the SKU.extide - Thursday, October 19, 2017 - link
That's because it only does 128-bits at a time and takes 2 clocks where as Intel chips do it all at once and take 1 clock. So not only do Intel's chips have a higher clock speed but AMD takes a 50% penalty for taking 2 clocks instead of 1.jospoortvliet - Friday, October 27, 2017 - link
Actually there are 2 128bit units so the AMD CPU's don't do them one after another.And both in Intel and AMD CPU's do the instructions take multiple clock cycles, that's because they are big and heavy and can't be done in a single clock cycle unless you decrease the clock speed a LOT ;-)
Note that often different instructions take different numbers of clock cycles. It is certainly possible that for some, Intel takes, say, 6 and AMD 5 and for others, AMD takes 12 and Intel 10.
Santoval - Tuesday, December 12, 2017 - link
Zen (up to Epyc, since it is the same core) has no AVX256 blocks, so there is less need to clock down. Zen can do AVX256 by pairing two AVX128 units, but that is not the same thing. Cannon Lake's AVX512 units will be "true" AVX512 ones, not 2 x AVX256 ones, so my guess is that it will need to clock down even further.bobhumplick - Wednesday, August 22, 2018 - link
zen runs at lower clocks to start with and zen does less work than intels avx2. certain operations cant be done in one clock on the zen. i run an 8700k at 5ghz with no avx offset at ~1.35v stable. the intel cpu gets more avx work done per clock than the zen so of course it will draw more power and throttling clocks to stay within tdp means that the intel actually has a higher througput of avx instructions even with the lower clocks. it has nothing to do with "more effiecient power management"Elstar - Thursday, October 19, 2017 - link
I wouldn't read too much into this. Intel regularly enables or disables features in order to fulfill other goals like yield management, price point diversity, marketing, etc.AVX-512 is actually a great example of this behavior because Intel knows that people who need/want AVX-512 will pay dearly for it, and therefore the feature is only enabled on premium Skylake parts at the moment. Maybe Intel will still view AVX-512 as being a premium feature in the Cannonlake timeframe, or maybe not. Either way, the die logic will almost surely be there.
shabby - Thursday, October 19, 2017 - link
Will canonlake users need a new motherboard to support avx512?MrSpadge - Thursday, October 19, 2017 - link
Mobo support won't change with AVX512. Either they need a new one for Cannon Lake or not.Flunk - Thursday, October 19, 2017 - link
Considering Intel is requiring new motherboards for Coffee Lake without any real reason I think we can be assured they'll require new ones for Canon Lake as well.smilingcrow - Thursday, October 19, 2017 - link
There is a reason but not one that everyone accepts as being valid.Personally I don't know enough about how power delivery to chips impacts the requirement to break compatibility with a new CPU design so I just don't know what the truth is.
Some people prefer to believe their own 'story' about the scenario even though they know no more than me on the subject.
That just tells you something about their levels of bias and intelligence rather than the underlying issue.
But this is the age we live in where people believe their own fake 'truths' so are also suggestible to the fake news of others.
edzieba - Thursday, October 19, 2017 - link
Intel have stuck to two-gens-per-socket for the last decade for the consumer socket line (Socket Hx), so I expect that to continue. Whatever the successor to Coffee Lake is, it will likely use the same socket as Coffee Lake. If that's Cannon Lake, then we'd see Cannon Lake on Socket H5 ('LGA 1151-2') and Ice Lake on Socket H6.smilingcrow - Thursday, October 19, 2017 - link
Historically that is correct but there is strong evidence pointing to CL being a platform that is stand alone so is neither forward or backward compatible.gchernis - Thursday, October 19, 2017 - link
Very exciting developments! (A copy-paste bug: "if the same data is accessed temporally after the line is flushed")HStewart - Thursday, October 19, 2017 - link
First of all, these next generation chips sound like a major CPU architecture change. My first job was almost 7 years of assembly language program and one thing is for sure REP MOV type of memory moves is used a lot - so Icy Lake looks quite fast chip especially if they implemented so existing code runs with new micro code.But one thing about Cannon Lake, I thought it was primary aim at low power mobile chips and also that was planned before year end. Maybe the desktop chips are 2018.
This could mean that AVX-512 is coming to laptops which is going to be significant - because it means more applications are likely will used it since it is bigger market than desktop now a days
29a - Thursday, October 19, 2017 - link
I would like to see benchmarks with AVX enabled vs disabled if that is possible.quaz0r - Thursday, October 19, 2017 - link
Sweet! It's about time locusts were able to leverage the power of AVX-512. Honestly though "consumer" is the worst term EVER. EVER. Did I mention EVER? It maybe works to refer to people as locusts within the context of buying food perhaps, but CPUs? Why can't we ever just say "customers" or something that is at least neutral?Elstar - Thursday, October 19, 2017 - link
Just FYI – Intel has been incrementally optimizing REP MOV since at least Nehalem. This particular optimization should make it possible to use REP MOV in cases where the amount of data copied is "short".James5mith - Thursday, October 19, 2017 - link
And which CPU generation gets the onboard TB3 controller we were promised?smilingcrow - Thursday, October 19, 2017 - link
There have been slides posted online about that and I think it's the next one which is not backwardly compatible with CL.HStewart - Thursday, October 19, 2017 - link
I would think TB3 integration is part of motherboard - and since it has functionality related to Video - it probably aim more for mobile platforms.Unlike AMD, Desktop is a limited marked for Intel - mobile is where it is at for Intel now days.
smilingcrow - Thursday, October 19, 2017 - link
They are integrating the controller which is currently a separate chip directly into the actual CPU.Of course the motherboard will need to support its features which is the case for all I/O that a CPU offers.
Your fallacy that the desktop is not important for Intel shows how out of touch you are.
TB3 will be integrated in desktop CPUs for sure.
moozoo - Thursday, October 19, 2017 - link
Wake me up when they add AVX512ER to a consumer CPU, until then I'll stick with GPU'smode_13h - Friday, October 20, 2017 - link
Do GPUs have hardware acceleration of exponential functions? Surely, they must accelerate reciprocal and probably even x^-0.5.Regardless, if you can currently use GPUs, then any AVX-512 support is unlikely to change things for you. Xeon Phi (KNL) has it and still gets stomped on raw compute perf by big GPUs.
peevee - Friday, October 20, 2017 - link
What a mess. Instead of a general highly efficient general vectorized command set they have to have a separate instruction for every possible need.mode_13h - Tuesday, October 24, 2017 - link
You're being unrealistic. The combination of possible operations and operands is dizzying. It took Intel several generations to flesh out their previous vector instructions, and those didn't even support data types like fp16.peevee - Wednesday, October 25, 2017 - link
I am being realistic. There are principal architectural decisions made in the 70s (and even as early as 40s) which need to be ditched to extract full potential from modern manufacturing technologies, but it is possible, and much better than this constant abomination of inventing new commands.mode_13h - Wednesday, October 25, 2017 - link
No, you're not being realistic if you think they're going to just flush x86 down the toilet and start with a clean sheet of paper. There are CPUs like that, but targeted more towards hyperscale and other specialized niches.