Name: Making Sense of the Intel Haswell Transactional Synchronization eXtensions
Item: Making Sense of the Intel Haswell Transactional Synchronization eXtensions
Author: Johan De Gelas

Making Sense of the Intel Haswell Transactional Synchronization eXtensions

by Johan De Gelas on 9/20/2012 12:15 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

29 Comments

Back to Article

Paul Tarnowski - Thursday, September 20, 2012 - link
So it's good that the locking is finally being addressed at the CPU level, but that just means that even fewer developers will bother using fine-grain locking.

Which isn't necessarily a bad thing, because they will be able to either spend the time and money on something else, or save, but it does mean that their software will be less efficient on older CPUs. Which in turn means that unless AMD comes up with a similar system that achieves the same effect, they will be even more behind. In the short term, of course, this just means that Haswell might improve any multi-threaded program that has been recompiled using the updated libraries.

The one thing that does make me hesitant is that it works on only one chacheline at a time, as well as all the abort conditions. That makes me think that the graph shown is a best-case scenario, and actual improvements in real world scenarios would be far less.
anubis44 - Thursday, September 20, 2012 - link
AMD and Intel have cross-licensing agreements for instructions each of them come up with. That's why Intel can build AMDx64 compatible CPUs (AMD designed the 64 bit extensions we're all using in both AMD and Intel CPUs) and AMD has SSE instructions. This is an automatic thing, so no further agreements need to be made. You can bet these instructions will show up in the next generation of AMD CPUs after Intel releases them.
Brutalizer - Saturday, September 22, 2012 - link
Sun built Transactional Memory years ago with their SPARC ROCK cpu, to be used in the new SuperNova Solaris servers. But for various reasons ROCK was killed (I heard that it used too much wattage, among other problems).

The good news is that ROCK research is not wasted, most of it is used in newer products. The new coming SPARC T5 to be released this year, has Transactional Memory. It will probably be the first sold cpu that offers TM.
HibyPrime1 - Thursday, September 20, 2012 - link
I don't really understand why this has to be implemented in hardware.

Can't a developer just write the programming assuming the threads wont interfere with each other? After doing that, the program does a simple check to see if something went wrong, and if it did, it falls back to coarse grained locking.

I'm not sure how this is supposed to make it significantly easier for the developers. I'm sure that I'm missing something, but doesn't TSX just mean the developer doesn't have to write code that checks to see if something got broken? seems to me that part would be the easiest part of all this locking coding...
twotwotwo - Thursday, September 20, 2012 - link
The short answer is, an approach like that kind of *does* exist (search for optimistic concurrency control), but it still takes work to detect when things went wrong and be able to clean up.

In Intel's bank example, you might need some kind of indexed concurrency-proof transaction history so the bank can know that when I gave $50 to Alice and $60 to Bob, both transactions used a $100 starting balance. And the code needs to know how to undo a transaction that collided with another. To complicate things, many live systems deal with larger transactions than just two-person money transfers like Intel's example. Optimistic control can still be a step up from spinlocks (or people would never use it) but it doesn't come for free.
Tuna-Fish - Thursday, September 20, 2012 - link
Transactional memory can, and has been, implemented in software. The typical examples are Clojure and Haskell. However, doing the tracking in software generally takes a lot of resources, especially because you have to deal with all kinds of race conditions. Remember that without some kind of hardware support for concurrency, every single write and read operation is independent, and something could go wrong at any point, including during the verify/restore phase.
name99 - Friday, September 21, 2012 - link
This is not a technology to make parallel programming automatic or even easier. It is a technology to make ONE PART of parallel programming, namely the locking MORE EFFICIENT. That is all.

This has the consequence that one can write an app using fewer, coarser grained locks, and have it perform as well as if you'd used finer grained locked, but again, that is all. In particular

- it doesn't do the locking for you, it doesn't tell you what needs to be protected by locks, it doesn't catch stupid usage of locks, etc etc

- it doesn't help with everything else related to parallel programming, from choosing appropriate data structures to choosing appropriate algorithms.

As for doing it in SW, well, yes, at the end of the day you can do EVERYTHING in software. But Intel is in the business of moving as much as possible of what is slow in software into faster hardware. That's why we have everything from branch prediction to AES instructions to QuickSync in modern CPUs.

Finally, it's foolish to obsess too much about implementation details, like how the L1 cache is used. EVERYTHING in HW is a tradeoff and, just like you can invent some pathological code that runs slower under branch prediction, you can invent pathological code that runs slower under this implementation of the locking mechanism. As always, Intel will look at how these extensions are used in practice, and how they fail, and will modify how they are implemented as a result. This is just common sense.
softdrinkviking - Monday, September 24, 2012 - link
hmm. so would you say that recent generation Intel CPUs are well utilized? All of the instructions and features not only make sense, but are well utilized by a majority of developers?
epobirs - Thursday, September 27, 2012 - link
That depends on what you consider an acceptable time scale for widespread usage of a hardware feature. Everything has to start somewhere. Nobody today would bother to mention that their software takes advantage of MMX and its successors but at one time it carried some novelty value. The real benefits came when it was so common in the hardware and compiler support that it no longer merited mention.
epobirs - Thursday, September 27, 2012 - link
For the traditional reason you create any dedicated function in hardware: performance. Intel is betting this is going to matter hugely to scaling up the value of multi-core processors.

Such things have a long history. MMX was the result of examing many, many pieces of of software and looking for functions then done entirely in software on most systems that could be accelerated for a minimal investment in transistor real estate.

There was a period of a few years when a 3D accelerator board was a separate item from the video board. The 3Dfx Voodoo series worked this way for several generations until the company faltered in trying to transition to complete video solutions on a single board. In that time 3D had gone from something exclusively of interest to gamers and some other specialty apps to a thing expected of every system to some extent. It wasn't long before integrated graphics adapters had the kind of 3D performance that formerly lead people to make a costly separate purchase to obtain.

If something is worth putting in the hardware, it will reveal itself through what is done in the software. From there it is only a question of how many transistors does it take to embody and at what cost? If the numbers are right, into the hardware it goes.
1008anan - Thursday, September 20, 2012 - link
Good summary Johan De Gelas. Look forward to future articles that further elaborate on how exactly Transactional Synchronization technology (TSX) achieves hardware accelerated transactional memory in a scalable generalized way.

Part of the solution in my opinion is for a core that can have 2 or more simultaneous thread to only execute 1 thread at a time for single threaded computationally heavy work loads. To give an example, a single Haswell Corps can execute 32 single precision (32 bit) operations per clock. At 2 gigahertz SoC speed, a single Corps can execute 64 billion operations per second with single threaded code. Unfortunately this can only help so much. To make major performance gains an efficient generalized scalable way needs to be found to distribute single threaded computational work loads to multiple different cores. Much easier said than done.
extide - Saturday, September 22, 2012 - link
32 single precision ops is wrong. How many ops they can do per clock has to do with the front end and how many ports it has in it, not how many bits a particular instruction is.
extide - Saturday, September 22, 2012 - link
to further that, Ivy Bridge has 6 ports and Haswell 8, but each of the ports don't necessarily have the same capabilities.
twotwotwo - Thursday, September 20, 2012 - link
Very smart idea, very clever to implement it with backwards compatibility, and it's good that Intel's working out uses for die area that aren't just multiplying cores or cache.

But a sad thing is, when this feature really helps your app, like doubles or triples throughput, then it means you must be leaving speed on the table on _other_ platforms--manycore Ivy Bridge or AMD or whatever--because you didn't take fine-grained locks. If the difference matters, then for many years you'll have to go to the effort to do fine-grained locks and make it fast on Ivy Bridge too.

The other thing is, the problem of parallelizing tasks is deeper than just fine-grained locks being tricky. If you want to make, say, a Web browser faster by taking advantage of multiple cores, you still have to do deep thinking and hard work to find tasks that can be split off, deal with dependencies, etc. Smart folks are trying to do that, but they _still_ have hard work when they're on systems with transactional memory.

That may be overly pessimistic. It's likely apps today _are_ leaving performance on the table by taking coarse locks for simplicity's sake, and they'll get zippier when TSX is dropped in. Or maybe transactional memory will be everywhere, or on all the massively multicore devices anyway, faster than I think. Or, finally, maybe Intel knows TSX won't have its greatest effects for a long time but is making a plan for the very long term like only Intel can.
Paul Tarnowski - Thursday, September 20, 2012 - link
I think it's the last one more than anything else. It really seems to be about setting up architecture for the future. Right now with four and eight cores the losses aren't that high, and effects won't be seen on anything but servers. While it is a big deal, it will be even less important to the consumer market than most other architecture extensions.
USER8000 - Thursday, September 20, 2012 - link
This an article from them nearly three years ago:

http://blogs.amd.com/developer/2009/11/17/the-velo...
NeBlackCat - Thursday, September 20, 2012 - link
I stopped reading at the end of the 2nd para after the bar chart on the first page:

"The root of the locking problems is that locking is a trade-off. Suppose you have a shared data structure such as a small table or list. If the developer is time constrained, which is often the case, the easiest way to guarantee consistency is to let a certain thread lock the entire data structure (e.g. the table lock of MySQL MyISAM). The thread acquires the lock and then executes its code (the critical section). Meanwhile all other threads are blocked until the thread releases the lock on the structure (*). However, locking a piece of data is only necessary if two threads want to write to it. (**)"

(*) all other threads arent locked, only those that also need access to the same data.
(**) locking a piece of data is only *one* thread wants to write it (else you risk a reader reading it before the writer has finished writing it)

And if the first locker is doing something that may take (in CPU terms) considerable time, likely in the database scenario, then the OS will probably schedule something else (there are always other processes and threads wanting to run) on a core/hyperthread running a blocked thread, so it wont sit idle anyway.

Unless things have changed since I was a real time software engineer, anyway
Wardrop - Thursday, September 20, 2012 - link
I think your nitpicking. We all understand the locking problem (those of us that it's relevant to anyway). There's no point in the author going into more detail to clarify what we all should already know - he gives us enough information for us to at least know what he's talking about, and that was the point of that paragraph.
JohanAnandtech - Thursday, September 20, 2012 - link
"all other threads arent locked, only those that also need access to the same data."

With a bit of goodwill, you would accept that this is implied.

" then the OS will probably schedule something else "

Right. But it is no free lunch. In the case of a spinlock, there will several attempts to lock and then a context switch. That is thousands of cycles lost without doing any useful work. BTW if you really want to see some in depth cases, we linked to

http://www.anandtech.com/show/5279/the-opteron-627...
http://www.anandtech.com/show/5279/the-opteron-627...

which goes in a lot more detail.
gchernis - Thursday, September 20, 2012 - link
OP did not explain why software-transactional memory (STM) is good, before diving into Intel's hardware solution. Current breed of databases use STM, but so can other applications. The prerequisite is the assumption that there's much more reading than there's writing going on.
dealcorn - Thursday, September 20, 2012 - link
The graphs indicate the benefit of HLE versus no HLE gets bigger as the number of threads increase. However, they lack scaling on the horizonal axis: I have no sense of the benefit at real world numbers like 8 threads, 12 threads and 16 threads.

Is the primary benefit of RTM greater CPU performance or programmer productivity?
GatoRat - Thursday, September 20, 2012 - link
With the right application, you get both. With the wrong application, you get an apparent increase in productivity at the cost of performance. In the end, you still need smart, experienced and expensive engineers to know when to use this technology and when not to.
GatoRat - Thursday, September 20, 2012 - link
It fails to properly explain what STM is. Unfortunately, most of the descriptions I've read make the same fundamental mistake made in this article--you don't lock a data structure or variables, you lock code. This difference is important because it means you can control it very exactly. These same articles then basically argue that STM is beneficial to lazy programmers who use locks too broadly. Yet programmers who do that aren't going to use STM, which is far more complicated and for which the benefits are dubious outside of some very narrow applications.

My own experience is that non-locking algorithms are highly problematic and rarely give a performance advantage over using carefully designed locking algorithms. Moreover, in several situations a crudely designed locking algorithm can perform extremely fast due to the decreased complexity of the overall algorithm.

Bottom line is that what Intel has been doing, even with Nehelem, is fantastic, but thinking this will obviate the need for spending a lot of time and effort and brain power is delusional. On the other hand, it will help in the myriad of situations where management contradicts reality and says "it's good enough."
GatoRat - Thursday, September 20, 2012 - link
Incidentally, STM imposes a very real overhead which can easily cause worse performance than the alternative. Between the overhead and the increased complexity in code, in lightly loaded systems, the performance will always be worse.

The hype is similar to that of parallel programming and will have just as dismal results. Like with parallel programming, you need a certain mindset and a problem which is conducive to being solved in that matter. I recently worked on a problem which simply had too many threads and asynchronous events due to the underlying (and well known) framework. STM would have been a disaster. I designed a pseudo-parallel architecture, but was never able to implement and test it due to the project being put on hiatus (by new management that was utterly clueless.)
clarkn0va - Thursday, September 20, 2012 - link
You know these new extensions will succeed and rapidly gain market share because the marketers managed to use a capital X in the name.
glugglug - Thursday, September 20, 2012 - link
Other than real lock-contention, where HLE will net a small improvement even over fine-grained locks, the situations where it won't work (i.e. running out of L1 cache lines) come into play just as much with only 1 thread running, and the code that gets hit by these will lose its single-threaded performance. Worst case would be single threaded performance hit near 50%, but I think that would be extremely rare. Still, the amount of multithreaded gain is deceptive when you are hindering the single threaded performance.

Also, in order to avoid that single threaded hit you end up going back to fine grained locks -- the coarse locks are more likely to hit the restrictions and have to re-run.
JonBendtsen - Friday, September 21, 2012 - link
Where are the scaling on the graphs? without scales the graph is useless.
wwwcd - Friday, September 21, 2012 - link
Conclusion: The Need for doubling or quadrupling the amount of cache on the first floor rather than continuing torture tricks with software to optimize the use of already too little used since time immemorial volume. Of course, accompanied by the necessary improvement of the design of the CPU cores and adjusting volume levels of other caches.
It is also necessary to increase the throughput of caches, or by widening the bus or increase their frequency of use, or by both methods
mallik79 - Monday, September 24, 2012 - link
I work for improving performance on a large commercial database.
Challenge seems to be multi-CPU than multi-core.
Bottlenecks get magnified by adding cpu's instead of cores.
(we have custom spinlock code, donot use system provided primitives).

How many ever cores you add to a CPU, people still want multi-CPU huge boxes to run their databases.
So does this transaction memory caching makes life tougher for multi-CPU as the lock is nearer to a particular CPU, it makes cores of other CPU starve?

Thanks for all the in-depth reviews.

Making Sense of the Intel Haswell Transactional Synchronization eXtensions

Post Your Comment

29 Comments

Back to Article

Paul Tarnowski - Thursday, September 20, 2012 - link

anubis44 - Thursday, September 20, 2012 - link

Brutalizer - Saturday, September 22, 2012 - link

HibyPrime1 - Thursday, September 20, 2012 - link

twotwotwo - Thursday, September 20, 2012 - link

Tuna-Fish - Thursday, September 20, 2012 - link

name99 - Friday, September 21, 2012 - link

softdrinkviking - Monday, September 24, 2012 - link

epobirs - Thursday, September 27, 2012 - link

epobirs - Thursday, September 27, 2012 - link

1008anan - Thursday, September 20, 2012 - link

extide - Saturday, September 22, 2012 - link

extide - Saturday, September 22, 2012 - link

twotwotwo - Thursday, September 20, 2012 - link

Paul Tarnowski - Thursday, September 20, 2012 - link

USER8000 - Thursday, September 20, 2012 - link

NeBlackCat - Thursday, September 20, 2012 - link

Wardrop - Thursday, September 20, 2012 - link

JohanAnandtech - Thursday, September 20, 2012 - link

gchernis - Thursday, September 20, 2012 - link

dealcorn - Thursday, September 20, 2012 - link

GatoRat - Thursday, September 20, 2012 - link

GatoRat - Thursday, September 20, 2012 - link

GatoRat - Thursday, September 20, 2012 - link

clarkn0va - Thursday, September 20, 2012 - link

glugglug - Thursday, September 20, 2012 - link

JonBendtsen - Friday, September 21, 2012 - link

wwwcd - Friday, September 21, 2012 - link

mallik79 - Monday, September 24, 2012 - link

Log in

Don't have an account? Sign up now