Comments Locked

146 Comments

Back to Article

  • Shadowmaster625 - Friday, May 8, 2015 - link

    This kind of provides more proof that Intel would do well to incease its SMT threads per core count.
  • nathanddrews - Friday, May 8, 2015 - link

    I'd like to see that alongside another GHz War.
  • Brutalizer - Sunday, May 10, 2015 - link

    This is silly. For serious enterprise use taking on some serious workload, you need large servers. Not low end 2,4 or 8 sockets. The largest x86 servers are all 8 sockets, there are no larger servers for sale and have never been. If we go into high end servers, we will find POWER8 servers, and SPARC servers. The top SAP benchmark is 836,000 saps, and held by a large Fujitsu SPARC M10-4S server. So... no, x86 has no performance of interest. Scalability is difficult, and that is why x86 stops at 8 sockets, and only Unix/Mainframes go beyond, up to 64 sockets. Remember that I talk about business servers, running monolithic software like sap, databases, etc -also called Scale-up servers.

    The difference is Scale-out servers, I.e HPC clusters, such as SGI UV2000 servers, resembling supercomputers with 100s of sockets. Supercomputers are all clusters. Scale out servers can not run business software, because the code branches too much. They can only run HPC parallel workloads, where each PC node computes in a tight for loop, and at the end summarizes the result of all nodes. The latency in far away nodes is very bad, so scale out servers can not run heavily branching code (business software such as SAP, databases, etc). All top SAP benchmarks are held by large Unix scale up servers, there are no scale out servers on the SAP benchmark list (SGI, ScaleMP).

    So, if we talk about serious business workloads, x86 will not do, because they stop at 8 sockets. Just check the SAP benchmark top - they are all more than 16 sockets, Ie Unix servers. X86 are for low end and can never compete with Unix such as SPARC, POWER etc. scalability is the big problem and x86 has never got passed 8 sockets. Check SAP benchmark list yourselves, all x86 are 8 sockets, there are no larger.

    Sure, if you need to do HPC number crunching, then SGI UV2000 cluster is the best choice. But clusters suck at business software.
  • Kevin G - Monday, May 11, 2015 - link

    Oh this again? The UV 2000 is a single large SMP machine as it has a unified memory space, cache coherent and a single OS/hypervisor can manage all the processors. Take a look at this video from SGI explaining it:
    https://www.youtube.com/watch?v=KI1hU5g0KRo
    And then while you're at it, watch this following video where you see first hand that Linux sees their demo hardware as a 64 socket, 512 core machine (and the limit is far higher):
    https://www.youtube.com/watch?v=lDAR7RoVHp0

    As for your claim that systems like the UV2000 cannot run scale-up applications, SGI sells a version specifically for SAP's HANA in-memory database because their hardware scales all the way upto 64 TB of memory. http://www.enterprisetech.com/2014/06/03/sgi-scale...

    As for x86 scaling to 8 sockets, that is the limit for 'glueless' topologies but you just need to add some glue to the interconnect system. Intel's QPI bus for coherent traffic is common between their latest Xeons and Itanium chips. This allowed HP to use the node controller first introduced in an Itanium system as the necessary glue logic to offer the 16 socket, x86 based Superdome X 'Dragonhawk' system. This is not the systems limit either at the node controller and recent Xeons could scale up to 64 sockets. Similarly when IBM was still in the x86 server game, they made their own chipsets and glue logic to offer 16 or 32 socket x86 systems for nearly a decade now.

    As for SAP benchmarks, x86 is in the top 10 and that is only with an 8 core machine. We oddly haven't yet seen results from HP's new Superdome X system which should be a candidate for the top 10 as well.

    One of these days you'll drop the FUD and realize that there are big SMP x86 offerings on the market. Given your history on the topic, I'm not going to hold my breath waiting.
  • Brutalizer - Monday, May 11, 2015 - link

    There are no stupid people, only uninformed people. You sir, are uninformed.

    First of all, scale up is not equal to smp servers. Sure, SGI claims UV2000 servers are smp servers (miscrosoft claims windows is an enterprise os) but no one use SGI servers for business software, such as SAP or databases which run code that branches heavily. Such code penalizes scale out servers such as SGI clusters because latency is too bad in far away nodes. The only code fit for scale out clusters are running a tight for loop on each node, doing scientific computations.

    But instead of argiung over this, disprove me. I claim no one use SGI nor ScaleMP servers for Business software such as SAP or databases. Prove me wrong. Show me a SAP benchmark with an x86 server having more than 8 sockets. :)

    SAP Hana you mentioned, is a clustered database. It is not monolithic, and a monolithic database running on a scale up server easily beats a cluster. It is very difficult to synchronize data among nodes and guarantee data integrity (rollback etc) on a cluster. It is much easier and faster to do on a scale up server. Oracle M7 server with 32 sockets and 64 TB ram will easily beat any 64 TB SGI cluster on business software. It has 1.024 cores and 8.192 threads. SPARC M7 cpu is 3-4x faster than the SPARC M6 cpu, which is the latest generation cpu, having several records today. One M7 cpu does SQL queries at a rate of 120 GB/sec, whereas a x86 cpu does... 5GB/sec (?) sql queries.

    Oracle data warehouse that some have tried on a scale out cluster, is not a database. It is exclusively used for data mining of static data, which makes heavy use of parallel clustered computations. No one modifies data on a DWH so you dont need to guarantee data integrity on each node, etc.

    So, please show me ANY x86 server that can compete with the top SPARC servers reaching 836.000 saps on the SAP benchmark. SAP is monolithic, which means scaling is very difficult, if you double the number of sockets, you will likely gain 20% or so. Compare the best x86 server on SAP to large Unix servers. SGI or ScaleMP or whatnot. The largest x86 server benchmarked on SAP has 8 sockets. No more. There are no scale out servers on sap. Prove me wrong. Go ahead.

    The HP superdome server is actually a Unix architectured server, that is slowly being transformed to x86 server. Scaling on x86 is difficult, so the new and shiny x86 superdome will never scale as well as the old Unix server having 32 sockets, x86 superdome will remain at 8-16 sockets for a long time, maybe another decade or two. IBM has also tried to compile linux onto their P795 unix server, with bad results but that does not make the p795 a linux server, it is still a unix server. Good luck on getting a good SAP score with any x86 server, HP Superdome, SGI UV2000, whatever.

    Instead of arguing, silence me by proving me wrong: prove that ANY x86 server can compete with large Unix servers on SAP. Show us benchmarks that rival SPARC 836,000 saps. :). It is very uninformed to believe that x86 can get high SAP scores. As i said you will not find any high x86 scores. Why? Answer: x86 does not scale. x86 clusters can not run SAP because of bad latency, that is why SGI nor ScaleMP is benching SAP.

    FUD, eh? It is you that FUD, claiming that x86 can get high scores on bisuness software. If you speak true, then you can show us any good SAP score. If you can not, you are uninformed and hopefully, you will not say so again.

    Btw, both SGI and ScaleMP claim their servers are scale out, only suited for HPC number crunching, (not for business scale up work loads).I can link to these claims straoght from them both.
  • patrickjp93 - Monday, May 11, 2015 - link

    The amount of BS you post is astounding! Latencies over Infiniband are in the < 10nanosecond range these days. Rollback and sync are also easy! MPI fixed those problems a long damn time ago. And monolithic anything is far too expensive and has the same failure rates as x86 machines. Scale up is only worthwhile if you're limited to a janitor's closet for your datacenter.

    SAP is also a fundamentally broken benchmark which Linpack created an equivalent for a long time ago which x86 basically ties in.
  • patrickjp93 - Monday, May 11, 2015 - link

    God damnit Anandtech create an edit feature!

    Furthermore, do you have any knowledge of the networking fabric Intel rolled out a few years ago? Scaling issues in x86 died back with Ivy Bridge.

    There is also no scale-up workload that cannot be refactored for scale-out with minimal loss in scaling.
  • Brutalizer - Tuesday, May 12, 2015 - link

    @KevingG
    You are still uninformed. But let me inform you. I will not answer to each post, I will only make one post, because otherwise there will be lot of different branches.

    First of all, you post a link about the US Postal Service use SGI UV2000 to successfully run a Oracle database, which supposedly proves that I am wrong, when I claim that no one use SGI UV2000 scale-out clusters with 100s of sockets, for business software such as SAP or databases.

    Here we go again. Well for the umpteen time, you are wrong. The Oracle database in question is TimesTen, which is an IMDB (In Memory DataBase). It stores everything in RAM. The largest TimesTen customer has 2TB data. Which is tiny compared to a real DB:
    http://www.oracle.com/technetwork/products/timeste...

    I will quote this technical paper about In Memory DataBases below:
    http://www.google.se/url?sa=t&rct=j&q=&...

    The reason you store everything in RAM, instead of disk, is you want to optimize for analytics, Business Intelligence, etc. If you really want to store data, you use permanent storage. If you only want fast access, you use RAM. TimesTen is not used as a normal DB, it is a nische product similar to a Data WareHouse exclusively for analytics. Your link has this title: "U.S. Postal Service Using Supercomputers to Stamp Out Fraud"
    i.e. they use the UV2000 to analyse and find fraud attempts. Not to store data. Just analyse read queries, like you would a Data WareHouse.

    A normal DB alters data all the time, inducing locks on rows, etc. In Memory DataBases often dont even use locking!! They are not designed for modyfying data, only reading.
    Page 44: "Some IMDBs have no locking at all"

    This lock cheating partly explains why IMDBs are so fast, they do small and simple queries. US Postal Service has a query roundtrip at 300 milliseconds. A real database takes many hours to process some queries. In fact, Merrill Lynch was so happy about the new Oracle SPARC M6 server, because it could even complete some large queries which never even finished on other servers!!!

    This read only makes IMDB easy to run on clusters, you dont have to care about locking, synchronizing data, data integrity, etc - which makes it easy to parallelize. Like SAP Hana, both Oracle TimesTen are clustered. Both are clustered DBs and runs on different partition nodes, i.e. horizontal scaling = scale-out.
    Page 21: "IMDBs Usually scale horizontally over multiple nodes"
    Page 55-60: "IMDB Partitioning & Multi Node Scalability".

    Also, TimesTen is a database on the application layer, a middle ware database!! A real database acts at the back end layer. Period. In fact, IMDB often acts as cache to a real database, similar to a Data WareHouse. I would not be surprised if US Postal Service use TimesTen as a cache to a real Oracle DB on disk. You must store the real data on disk somewhere, or get the data from disk. Probably a real Oracle DB is involved somehow.
    Page 71-72: "TimesTen Cache Connect With Oracle Database"

    So, again, you can not run business software on scale-out servers such as SGI or ScaleMP. You need to redesign and rewrite everything. Look at SAP Hana, which is a redesigned clustered RAM database. Or Orace TimesTen. You can not just take a business software such as SAP or Oracle RDBMS Database and run it ontop SGI UV2000 cluster. You need to redesign and reprogram everything:
    Page 96: "Despite the blazing performance numbers just seen, don’t expect miracles by blindly migrating your RDBMS to IMDB. This approach
    may fetch you just about 3X or lower:... Designing your app specifically for IMDB will be much more rewarding, sometimes even to the tune or 10 to 20X"

    Once again, IMDBs can not replace databases:
    Page97: "Are IMDBs suited for OLAP? They are getting there, but apart from memory size limitations, IMDB query optimizers probably have a while to go before they take on OLAP"

    http://www.google.se/url?sa=t&rct=j&q=&...
    Page 4: "Is IMDB a replacement for Oracle? No"

    I dont know how many times I must say this? You can not just take a normal business software and run it ontop SGI UV2000. Performance would suxx big time. You need to rewrite it as a clustered version. And that is very difficult to do for transaction heavy business software. It can only be done in RAM.

    Here we even see a benchmark of Oracle RDBMS vs Oracle TimeStep. And for large workloads, the Oracle RMDBS database is faster.
    http://www.peakindicators.com/index.php/knowledge-...

    So you are wrong. No body use SGI UV2000 scale out clusters to run business software. The only way to do it, is to redesign and rewrite everything to a clustered version. You can never take a normal monolithic business software and run it ontop SGI UV2000. Never. Ever.

    @Patrickjp93
    No, you are wrong, latency will be bad in large scale out clusters. The problem with scaling, is that ideally, every socket needs a direct data communication channel to every other socket. Like this. Here we see 8 sockets each having a direct channel to every other socket. This is very good and gives excellent scaling:
    http://regmedia.co.uk/2012/09/03/oracle_sparc_t5_c...

    Here we have 32 sockets communicating to each other in a scale-up server. We see that at most, there is one step to reach any other socket. Which is very good and gives good scaling, making this SPARC M6 server suitable for business software. Look at the mess with all interconnects!
    http://regmedia.co.uk/2013/08/28/oracle_sparc_m6_b...

    Now lets look at the largest IBM POWER8 server E880 sporting 16 sockets. We see that only four sockets communicate directly with each other, and then you need to do another step to reach another four socket group. To reach far away sockets, you need to do several steps, on a smallish 16 socket server. This is cheating and scales bad.
    http://www.theplatform.net/wp-content/uploads/2015...

    Here is another example of a 16 socket x86 server. Bull Bullion. Look at the bad scaling. Every socket is grouped as four and four. And to reach sockets far away, you need to do several steps. This construction might be exactly like the IBM POWER8 server above and is bad engineering. Not good scaling.
    https://deinoscloud.files.wordpress.com/2012/10/bu...

    In general, if you want to connect each socket to every other, you need O(n^2) channels. This means for a SGI UV2000 with 256 sockets, you need 35.000 channels!!! Every engineer realizes that it is impossible. You can not have 35.000 data channels in a server. You need to cheat a lot to bring down the channels to a manageble number. Probably SGI has large islands of sockets, connected to islands with one fast highway, and then the data needs to go into smaller paths to reach the destining socket. And then locking signals will be sent back to the issuing socket. And forth to synch. etc etc. The latency will be VERY slow in a transaction heavy environment. Any engineer sees the difficulties with latency. You can have throughput or low latency, but not both.

    Do you finally understand know, why large clusters can not run business software that branches all over the place??? Scalability is difficult! Latency will be extremely bad in transaction heavy software, with all synchronizing going on, all the time.

    SGI explains why their huge Altix cluster with 4.096 cores (predecessor of UV2000) is not suitable for business software, but only good for HPC calculations:
    http://www.realworldtech.com/sgi-interview/6/
    "...The success of Altix systems in the High Performance Computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,
    ...
    However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time ...."

    ScaleMP explains why their large scale-out server with 1000s of cores are not suitable for business software:
    http://www.theregister.co.uk/2011/09/20/scalemp_su...
    "...ScaleMP cooked up a special software hypervisor layer, called vSMP, that rides atop the x64 processors, memory controllers, and I/O controllers in multiple server nodes....vSMP takes multiple physical servers and... makes them look like a giant virtual SMP server with a shared memory space. vSMP has its limits.
    ...
    The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit..."

    https://news.ycombinator.com/item?id=8175726
    ">Still don't understand why businesses buy this SPARC M7 scale-up server instead of scaling-out. Cost? Complexity?"

    >>I'm not saying that Oracle hardware or software is the solution, but "scaling-out" is incredibly difficult in transaction processing. I worked at a mid-size tech company with what I imagine was a fairly typical workload, and we spent a ton of money on database hardware because it would have been either incredibly complicated or slow to maintain data integrity across multiple machines. I imagine that situation is fairly common.

    >>Generally it's just that it's really difficult to do it right. Sometime's it's impossible. It's often loads more work (which can be hard to debug). Furthermore, it's frequently not even an advantage. Have a read of https://research.microsoft.com/pubs/163083/hotcbp1... Remember corporate workloads frequently have very different requirements than consumer."

    @KevinG
    But instead of arguing over how you wish x86 servers would look like, lets go to real hard facts and real proofs instead. Lets settle this once and for all, instead of arguing.

    Fact: x86 servers are useless for business software because of the bad performance. We have two different servers, scale-up (vertical scaling) servers (one single huge server, such as Unix/Mainframes) and scale-out (horizontal scaling) servers (i.e. clusters such as SGI UV2000 and ScaleMP - they only run HPC number crunching on each compute node).

    1) Scale-up x86 servers. The largest in production has 8-sockets (HP Superdome is actually a Unix server, and no one use the Superdome x86 version today because it barely only goes to 16-sockets with abysmal performance, compared to the old Unix 32-socket superdome). And as we all know, 8 sockets does not cut it for large workloads. You need 16 or 32 sockets or more. The largest x86 scale up server has 8 sockets => not good performance. For instance, saps score for the largest x86 scale up server is only 200-300.000 or so. Which is chicken sh-t.

    2) Scale-out x86 servers. Because latency is so bad in far away nodes, you can not use SGI UV2000 or ScaleMP clusters with 100s of sockets. I expect the largest SGI UV2000 cluster post scores of 100-150.000 saps because of bad latency, making performance grind to a halt.

    To disprove my claims and prove that I am wrong: post ANY x86 benchmark with good SAP. Can you do this? Nope. Ergo, you are wrong. x86 can not tackle large workloads. No matter how much you post SGI advertising, it will not change facts. And the fact is: NO ONE USE x86 FOR LARGE SAP INSTALLATIONS. BECAUSE YOU CAN NOT GET HIGH SAP SCORES. It is simply impossible to use x86 to get good business software performance, such as SAP, databases, etc. Scale-up dont do. Scale-out dont do.

    Prove me wrong. I am not interested in you bombarding links on how good SGI clusters are. Just disprove me on this. Just post ONE single good SAP benchmark. If you can not, I am right and you are wrong, which means you probably should sh-t up and stop FUD. There is a reason we have a highend Unix/Mainframe market, x86 are for lowend and can never compete.

    I dont expect you to ever find any good x86 links, so let the cursing and shouting begin. :-)
  • Kevin G - Tuesday, May 12, 2015 - link

    “Here we go again. Well for the umpteen time, you are wrong. The Oracle database in question is TimesTen, which is an IMDB (In Memory DataBase). It stores everything in RAM. The largest TimesTen customer has 2TB data. Which is tiny compared to a real DB:”
    That is very selective reading as the quote is ‘over 2 TB’ from your source. The USPS TimesTen data base is a cache for a >10 TB Oracle data warehouse. Also from the technical paper you linked below, this is not a major issue as several in-memory database applications can spool historical data to disk;

    “I will quote this technical paper about In Memory DataBases below:”
    The main theme of that paper as to why in memory databases have gain in popularity is that they remove the storage subsystem bottleneck for a massive increase in speed. That is the main difference that the technical paper is presenting, not anything specific about the UV 2000.

    “A normal DB alters data all the time, inducing locks on rows, etc. In Memory DataBases often dont even use locking!! They are not designed for modyfying data, only reading.
    Page 44: "Some IMDBs have no locking at all"”
    First off, analytics is a perfectly valid use of a database for businesses.
    Second, the quote form page 44 is wildly out of context to the point of willful ignorance. The word in that sentence is ‘some’ and the rest of that page indicates various locking mechanisms used in In Memory Databases.

    I should also indicate that there are traditional databases that have been developed to not use locking either like MonetDB. In otherwords, the locking distinction has no bearing on being an in-memory database or not.

    “Also, TimesTen is a database on the application layer, a middle ware database!! A real database acts at the back end layer. Period. In fact, IMDB often acts as cache to a real database, similar to a Data WareHouse. I would not be surprised if US Postal Service use TimesTen as a cache to a real Oracle DB on disk. You must store the real data on disk somewhere, or get the data from disk. Probably a real Oracle DB is involved somehow.”
    They do and back in 2010 the disk size used for the warehouse was 10TB. Not sure of the growth rate, but considering the SGI UV2000 can support up to 64 TB of memory, a single system image might be able to host the entirety of it in memory now.

    “So, again, you can not run business software on scale-out servers such as SGI or ScaleMP. You need to redesign and rewrite everything. Look at SAP Hana, which is a redesigned clustered RAM database. Or Orace TimesTen. You can not just take a business software such as SAP or Oracle RDBMS Database and run it ontop SGI UV2000 cluster. “
    If the software can run on ordinary x86 Linux boxes, then why couldn’t they run on the UV2000? What is the technical issue? Performance isn’t a technical issue, it’ll run, just slowly.

    “You need to redesign and reprogram everything:
    Page 96: "Despite the blazing performance numbers just seen, don’t expect miracles by blindly migrating your RDBMS to IMDB. This approach
    may fetch you just about 3X or lower:... Designing your app specifically for IMDB will be much more rewarding, sometimes even to the tune or 10 to 20X"”
    That is in the context of an in memory database, not an optimization specifically for the UV 2000. Claiming otherwise is just deceptive.

    “Once again, IMDBs can not replace databases:
    Page97: "Are IMDBs suited for OLAP? They are getting there, but apart from memory size limitations, IMDB query optimizers probably have a while to go before they take on OLAP"”
    The quote does not claim what you think it say. First, not all databases focus on OLAP queries. Two, the overall statement appears to be made about the general maturity of in memory database software, not that it is an impossibility.

    “Page 4: "Is IMDB a replacement for Oracle? No"
    Wow, you didn’t even get the quote right nor complete it. “Is MCDB a replacement for Oracle? No. MCDB co-exists and enhances regular Oracle processing and has built in synchronization with Oracle.” The fully clarifies that MCDB is a cache for Oracle and it is clear on its purpose. This is applies to the specific MCDB product, not in memory database technologies at as whole.

    “I dont know how many times I must say this? You can not just take a normal business software and run it ontop SGI UV2000. Performance would suxx big time. You need to rewrite it as a clustered version. And that is very difficult to do for transaction heavy business software. It can only be done in RAM.”
    I’m still waiting for a technical reason as to why. What in the UV 2000’s architecture prevent it from running ordinary x86 Linux software? I’ve linked to videos that demonstrate that the UV2000 is a single coherent system. Did you not watch them?

    “So you are wrong. No body use SGI UV2000 scale out clusters to run business software. The only way to do it, is to redesign and rewrite everything to a clustered version. You can never take a normal monolithic business software and run it ontop SGI UV2000. Never. Ever.”
    Citation please. This is just one of your assertions. I’d also say that SAP HANA and Oracle TimesTn qualify as normal monolithic business software.

    “Here we have 32 sockets communicating to each other in a scale-up server. We see that at most, there is one step to reach any other socket. Which is very good and gives good scaling, making this SPARC M6 server suitable for business software. Look at the mess with all interconnects!”
    Actually that is two steps between most nodes: Processor -> node controller -> processor. There are still a few single step processor -> processor hops but they’re a minority.
    Also I think I should share the link where that image came from: http://www.theregister.co.uk/2013/08/28/oracle_spa...
    That article includes the quote: “This is no different than the NUMAlink 6 interconnect from Silicon Graphics, which implements a shared memory space using Xeon E5 chips”. Ultimately UV 2000 has the same topology to scale up as this SPARC system you are as an example.

    “Now lets look at the largest IBM POWER8 server E880 sporting 16 sockets. We see that only four sockets communicate directly with each other, and then you need to do another step to reach another four socket group. To reach far away sockets, you need to do several steps, on a smallish 16 socket server. This is cheating and scales bad.”
    This is no different than the SPARC system you linked to earlier: two hops. At most you have processor -> processor - > processor to reach your destination. The difference is that the middle hop on the SPARC platform doesn’t contain a memory region but a lot more interconnections. The advantage of the SPARC topology is going to large socket counts but at the 16 socket level, the IBM topology would be superior.

    “In general, if you want to connect each socket to every other, you need O(n^2) channels. This means for a SGI UV2000 with 256 sockets, you need 35.000 channels!!! Every engineer realizes that it is impossible. You can not have 35.000 data channels in a server. You need to cheat a lot to bring down the channels to a manageble number. Probably SGI has large islands of sockets, connected to islands with one fast highway, and then the data needs to go into smaller paths to reach the destining socket. “
    SGI is doing the exact same thing as Oracle’s topology with NUMALink6.

    “SGI explains why their huge Altix cluster with 4.096 cores (predecessor of UV2000) is not suitable for business software, but only good for HPC calculations:
    http://www.realworldtech.com/sgi-interview/6/
    The predecessor to the UV 2000 was the UV 1000 which had a slightly different architecture to scale up but it was fully shared memory and cache coherent architecture.
    The Altix system you are citing was indeed a cluster but that was A DECADE AGO. In that time frame, SGI has developed a new architecture to scale up by using shared memory and cache coherency.

    “Fact: x86 servers are useless for business software because of the bad performance. We have two different servers, scale-up (vertical scaling) servers (one single huge server, such as Unix/Mainframes) and scale-out (horizontal scaling) servers (i.e. clusters such as SGI UV2000 and ScaleMP - they only run HPC number crunching on each compute node).”
    INCORRECT. The SGI UV2000 is a scale up system as it is one server. SAP HANA and Oracle TimesTen are not HPC workloads and I’ve given examples of where they are used.

    “1) Scale-up x86 servers. The largest in production has 8-sockets (HP Superdome is actually a Unix server, and no one use the Superdome x86 version today because it barely only goes to 16-sockets with abysmal performance, compared to the old Unix 32-socket superdome). And as we all know, 8 sockets does not cut it for large workloads. You need 16 or 32 sockets or more. The largest x86 scale up server has 8 sockets => not good performance. For instance, saps score for the largest x86 scale up server is only 200-300.000 or so. Which is chicken sh-t.”
    The best SAP Tier-2 score for x86 is actually 320880 with an 8 socket Xeon E7-8890 v3. Not bad in comparison as the best score is 6417670 for a 40 socket, 640 core SPARC box. In other words, it takes SPARC 5x the sockets and 4.5x the cores to do 2x the work.
    “2) Scale-out x86 servers. Because latency is so bad in far away nodes, you can not use SGI UV2000 or ScaleMP clusters with 100s of sockets. I expect the largest SGI UV2000 cluster post scores of 100-150.000 saps because of bad latency, making performance grind to a halt.”

    “To disprove my claims and prove that I am wrong: post ANY x86 benchmark with good SAP. Can you do this?”
    Yes, I believe I just did:
    http://download.sap.com/download.epd?context=40E2D...
  • Brutalizer - Sunday, May 17, 2015 - link

    @KevinG
    "...I’m still waiting for a technical reason as to why [you can not take normal business software and run ontop SGI UV2000]. What in the UV 2000’s architecture prevent it from running ordinary x86 Linux software?..."

    I told you umpteen times. The problem is that scalability in code that branches heavily can not be run on SGI scale-out clusters as explained by links from SGI and links from ScaleMP (who also sells a 100s-socket Linux scale out server). And the SGI UV2000 is just a predecessor in the same line of servers. Again: UV2000 can. not. run. monolithic. business. software. as. explained. by. SGI. and. ScaleMP.

    .

    "...Citation please [about "No body use SGI UV2000 scale out clusters to run business software"]. This is just one of your assertions. I’d also say that SAP HANA and Oracle TimesTn qualify as normal monolithic business software...."

    Again, SAP HANA is a clustered database, designed to run on scale-out servers. Oracle TimesTen is a nische database that is used for in-memory analytics, not used as a normal database - as explained in your own link. No one use scale-out servers to run databases, SAP, etc. No one. Please post ONE SINGLE occurence. You can not.

    If scale-out servers could replace an expensive Unix server, SGI and ScaleMP would brag about it all over their website. Business software is high margin and the servers are very very very expensive. The Unix server IBM P595 with 32 sockets for the old TPC-C record costed $35 million - no typo. On the other hand, a large scale-out cluster with 100s of sockets costs the same as 100 nodes. The pricing is linear because you just add another compute node, which is cheap. On scale-up servers, you need to redesign everything for the scalability problem - that is why they are extremely expensive. 32-socket scale up servers cost many times the price of 256-socket scale-out clusters.

    Banks would be very happy if they could buy a cheap 256-socket SGI UV2000 server with 64TB RAM, to replace a single 16- or 32-socket server with 8-16TB RAM that costs many times more. The Unix high end market would die in an instant if cheap 256-socket scale-out clusters could replace 16- or 32-socket scale-up servers. And facts are: NO ONE USE SCALE-OUT SERVERS FOR BUSINESS SOFTWARE! Why pay many times more if investment banks could buy cheap x86 clusters? You havent thought of that?

    .

    "....Yes, I believe I just did: [“To disprove my claims and prove that I am wrong: post ANY x86 benchmark with good SAP. Can you do this?”]..."
    http://download.sap.com/download.epd?context=40E2D...

    This is silly. Only desktop home users would consider SAP benchmark of 200-300.000 good. I am talking about SAP benchmarks, close to a million. I am talking about GOOD performance. It is now very clear you have no clue about large servers with high performance.

    .

    "....The best SAP Tier-2 score for x86 is actually 320880 with an 8 socket Xeon E7-8890 v3. Not bad in comparison as the best score is 6417670 for a 40 socket, 640 core SPARC box. In other words, it takes SPARC 5x the sockets and 4.5x the cores to do 2x the work...."

    You are wrong again. The top record is held by a SPARC server, and the record is ~850.000 saps.
    download.sap.com/download.epd?context=40E2D9D5E00EEF7C569CD0684C0B9CF192829E2C0C533AA83C6F5D783768476B

    As I told you, scalability on business software is very difficult. Add twice the number of cores and get, say 20% increase in performance (when we talk about a very high number of sockets). If scalability was easy, we would see SGI UV2000 benchmarks all over the place: 256-sockets vs 192-sockets vs 128 sockets, etc etc etc. And ScaleMP would also have many SAP entries. The top list would exclusively be x86 architecture, instead of POWER and SPARC. But fact is, we dont see any x86 top SAP benchmarks. There are no where to be found.

    Let me ask you again, for the umpteen time: CAN YOU POST ANY GOOD x86 SAP SCORE??? I talk about close to a million saps, not 200-300.000. If you can not post any such x86 scores, then I suggest you just sh-t up and stop FUD. You are getting very tiresome with your ignorance. We have talked about this many times, and still you claim that SGI UV2000 can replace Unix/Mainframe servers - well if they can, show us proof, show us links where they do that! If there are no evidence that x86 can run large databases or large SAP configurations, etc, stop FUD will you?????
  • Kevin G - Monday, May 18, 2015 - link

    @Brutalizer
    “I told you umpteen times. The problem is that scalability in code that branches heavily can not be run on SGI scale-out clusters as explained by links from SGI and links from ScaleMP (who also sells a 100s-socket Linux scale out server). And the SGI UV2000 is just a predecessor in the same line of servers. Again: UV2000 can. not. run. monolithic. business. software. as. explained. by. SGI. and. ScaleMP”
    Running branch heavy code doesn’t seem to be a problem for x86: Intel has one of the best branch predictors in the industry. I don’t see a problem there. I do think you mean something else when you say ‘branch predictor’ but I’m not going to help you figure it out.
    The UV line is a different system architecture than a cluster because it has shared memory and cache coherency between all of its sockets. It is fundamentally different than the systems SGI had a decade go where those quotes originated from. Here is my challenge to you: can you show that the UV 2000 is not a scale up system? I’ve provided links and video where it clearly is but have you provided anything tangible to counter it? (Recycling a 10 year old link about different SGI system to claim otherwise continues to be just deceptive.)
    As mentioned by me and others before, Scale MP has nothing to do with the UV 2000. It is relevant to that system. Every time you bring it up it is shifting the discussion to something that does not matter in the context of the UV2000.

    “Again, SAP HANA is a clustered database, designed to run on scale-out servers. Oracle TimesTen is a nische database that is used for in-memory analytics, not used as a normal database - as explained in your own link. No one use scale-out servers to run databases, SAP, etc. No one. Please post ONE SINGLE occurence. You can not.”
    The key thing you’re missing is that UV 2000 is a scale up server, not scale out. You have yet to demonstrate otherwise.
    There are plenty of examples and I’ve provided some here. The US Post Office has a SGI UV2000 to do just that. Software like HANA and TimesTen are database products and are used by businesses for actual work and some businesses will do that work on x86 chips. Your continued denial of this is further demonstrated by your continual shifting of goal posts of what a database now is.
    Enterprise databases are designed to run as a cluster anyway due to the necessity to fail over in the event of a hardware issue. That is how high availability is obtained: if the master server dies then a slave takes over transparently to outside requests. It is common place to order large database servers in sets of two or three for this very reason. Of course I’ve pointed this out to you before: http://www.anandtech.com/comments/7757/quad-ivy-br...

    “On scale-up servers, you need to redesign everything for the scalability problem - that is why they are extremely expensive. 32-socket scale up servers cost many times the price of 256-socket scale-out clusters.”
    I agree and that is why the UV 2000 has the NUMALink6 chips inside: that is the coherent fabric that enables the x86 chips to scale beyond 8 sockets. That is SGI’s solution to the scalability problem. Fundamentally this is the same topology Oracle uses with their Bixby interconnect to scale up their SPARC servers. This is also why adding additional sockets to the UV2000 is not linear: more NUMALink6 node controllers are necessary as socket count goes up. It is a scale up server.
    “Banks would be very happy if they could buy a cheap 256-socket SGI UV2000 server with 64TB RAM, to replace a single 16- or 32-socket server with 8-16TB RAM that costs many times more. The Unix high end market would die in an instant if cheap 256-socket scale-out clusters could replace 16- or 32-socket scale-up servers. And facts are: NO ONE USE SCALE-OUT SERVERS FOR BUSINESS SOFTWARE! Why pay many times more if investment banks could buy cheap x86 clusters? You havent thought of that?”

    Choosing an enterprise system is not solely about performance, hardware platform and cost. While those three variables do weigh on a purchase decision, there are other variables that matter more. Case in point: RAS. Systems like the P595 have far more reliability features than the UV2000. Features like lock step can matter to large institutions where any downtime is unacceptable to the potential tune of tens of millions of dollars lost per minute for say an exchange.
    Similarly there are institutions who have developed their own software on older Unix operating systems. Porting old code from HPUX, OpenVMS, etc. takes time, effort to validate and money to hire the skill to do the port. In many cases, it has been simpler to pay extra for the hardware premium and continue using the legacy code.
    For entities like the US Post Office, they could actually tolerate far more downtime as their application load is not as time sensitive nor carries the same financial burden for the downtime. A SGI UV system would be fine in this scenario.

    “You are wrong again. The top record is held by a SPARC server, and the record is ~850.000 saps.
    download.sap.com/download.epd?context=40E2D9D5E00EEF7C569CD0684C0B9CF192829E2C0C533AA83C6F5D783768476B”
    OK, 5 times the number of sockets, 4.5 times the number of cores for 3x times the work. Still not very impressive in terms of scaling. My point still stands.

    Let me ask you again, for the umpteen time: CAN YOU POST ANY GOOD x86 SAP SCORE??? I talk about close to a million saps, not 200-300.000. If you can not post any such x86 scores, then I suggest you just sh-t up and stop FUD. You are getting very tiresome with your ignorance. We have talked about this many times, and still you claim that SGI UV2000 can replace Unix/Mainframe servers - well if they can, show us proof, show us links where they do that! If there are no evidence that x86 can run large databases or large SAP configurations, etc, stop FUD will you?????
    A score of 320K is actually pretty good for the socket and core count. It is also in the top 10. There are numerous SAP scores below that mark from various Unix vendors from the likes of Oracle/Sun, IBM and HP. I’ve also shown links where HANA and TimesTen are used on UV2000 systems with examples like the US Post Office using them. Instead, you’re ignoring my links and shifting the goal posts further.
  • Brutalizer - Wednesday, May 20, 2015 - link

    @KevinG

    You are confusing things. First of all, normal databases are not scale-out, they are scale-up. Unless you design the database as a clustered scale-out (SAP Hana) to run across many nodes (which is very difficult to do, it is hard to guarantee data integrity in a transaction heavy environment with synchronizing data, roll back, etc etc, se my links). Sure, databases can run in a High Availability configuration but that is not scale-out. If you have one database mirroring everything in two nodes does not mean it is scale-out, it is still scale-up. Scale-out is distributing the workload across many nodes. You should go and study the subject before posting inaccurate stuff.
    http://global.sap.com/corporate-en/news.epx?PressI...
    "With the scale out of SAP HANA, we see almost linear improvement in performance with additional computing nodes," said Dr. Vishal Sikka, member of the SAP Executive Board, Technology & Innovation. ...The business value of SAP HANA was demonstrated recently through scalability testing performed by SAP on SAP HANA with a 16-node server cluster."
    Do you finally understand that HANA is clustered? I dont know how many times I must explain this?

    .

    "...Software like HANA and TimesTen are database products and are used by businesses for actual work and some businesses will do that work on x86 chips. Your continued denial of this is further demonstrated by your continual shifting of goal posts of what a database now is...."

    What denial? I have explained many times that SAP Hana is clustered. And running a database as mirror for High Availability, is not scale-out. It is still scale-up. And I have explained that Oracle TimesTen is a "database" used for queries, not used as a normal database as explained in your own link. Read it again. It says that TimesTen is only used for fraud detection, not used as a normal database storing and altering information, locking rows, etc etc. Read your link again. It is only used for quering data, just like a DataWare House.

    In Memory Databases often don't even have locking of rows, as I showed in links. That means they are not meant for normal database use. It is stupid to claim that a "database" that has no locking, can replace a real database. No, such "databases" are mainly used to query information, not for altering data. What is so difficult to understand here?

    .

    "...OK, 5 times the number of sockets, 4.5 times the number of cores for 3x times the work. Still not very impressive in terms of scaling. My point still stands..."

    What is your point? That there are no suitable x86 servers out there for extreme SAP workloads? That, if you want high SAP score, you need to go to a 16- or 32-socket Unix server? That x86 servers will not do? SAP and business software is very hard to scale, as SGI explained to you, as the code branches too much. So it is outright stupid when you claim that SPARC has bad scaling, as it scales up to close to a million saps. How much does x86 servers scale? Up to a few 100.000 saps? Geee, you know, x86 scales much better than SPARC. What are you thinking? Are you thinking? How can anything that scales up to 32-sockets and close to a million saps not be impressive in comparison to x86 and much less saps? Seriously? O_o

    Seriously, what is it that you don't understand? Can you show us a good scaling x86 benchmark that rivals, or outperforms large SPARC servers? Can you just post one single x86 benchmark close to the top SPARC benchmarks? Use scale-out or scale-up x86 servers, just post one single x86 benchmark. Where are the SGI UV2000 sap benchmarks, as you claim SGI are scale-up, which means they can replace any 16- or 32-socket Unix server? Surely a 256-socket SGI UV2000 must be faster than 32-socket Unix server on sap - according to your FUD? Why are there no top x86 sap benchmarks? Why are you ducking this question - why dont you post any links and prove that SGI UV2000 can replace Unix servers? Why are you FUDing?
  • Kevin G - Wednesday, May 20, 2015 - link

    @Brutalizer
    “You are confusing things. First of all, normal databases are not scale-out, they are scale-up. Unless you design the database as a clustered scale-out (SAP Hana) to run across many nodes (which is very difficult to do, it is hard to guarantee data integrity in a transaction heavy environment with synchronizing data, roll back, etc etc, se my links). Sure, databases can run in a High Availability configuration but that is not scale-out. If you have one database mirroring everything in two nodes does not mean it is scale-out, it is still scale-up. Scale-out is distributing the workload across many nodes. You should go and study the subject before posting inaccurate stuff.”
    SAP HANA is primarily scale up but it can scale out if necessary. https://blogs.saphana.com/2014/12/10/sap-hana-scal...

    In fact, that link has this quote from SAP themselves under the heading ‘How does Scale-up work?’: “SGI have their SGI UV300H appliance, available in building blocks of 4-sockets with up to 8 building blocks to 32 sockets and 8TB for analytics, or 24TB for Business Suite. They use a proprietary connector called NUMAlink, which allows all CPUs to be a single hop from each other.”

    If you read further, you’ll see that SAP recommends scaling up before scaling out with HANA. That article, if you bother to read it, also indicates how they resolved the coherency problem so that it can be both scale up and scale out.

    Though if you really wanted an example of a clustered RMDBS, then the main Oracle database itself can run across a cluster via Oracle RAC. That is as in your own words below ‘normal’ of a database as you can get. http://www.oracle.com/us/products/database/options...

    IBM’s DB2 can also run in a clustered mode where all nodes are active in transaction processing.

    “I have explained many times that SAP Hana is clustered. And running a database as mirror for High Availability, is not scale-out. It is still scale-up. And I have explained that Oracle TimesTen is a "database" used for queries, not used as a normal database as explained in your own link.”
    You are shifting further and further away from the point which was that large x86 based are used for actual production work using databases. This is also highlighted by your continued insistence that SAP HANA and TimesTen are not 'normal' databases and thus don’t count for some arbitrary reason.

    “In Memory Databases often don't even have locking of rows, as I showed in links. That means they are not meant for normal database use. It is stupid to claim that a "database" that has no locking, can replace a real database. No, such "databases" are mainly used to query information, not for altering data. What is so difficult to understand here?”

    I think you need to re-read that link you provided as it clearly had the word *some* in the quote you provided about locking. If you were to actually comprehend that paper, you’d realize that locking and in-memory databases are totally independent concepts.

    “What is your point? That there are no suitable x86 servers out there for extreme SAP workloads? “That, if you want high SAP score, you need to go to a 16- or 32-socket Unix server?[ … ]So it is outright stupid when you claim that SPARC has bad scaling, as it scales up to close to a million saps. How much does x86 servers scale? Up to a few 100.000 saps? Geee, you know, x86 scales much better than SPARC. What are you thinking? Are you thinking? How can anything that scales up to 32-sockets and close to a million saps not be impressive in comparison to x86 and much less saps? Seriously? O_o”
    Yes, being able to do a third of the work with one fifth of the resources is indeed better scaling and that’s only with an 8 socket system. While true that scaling is not going to be linear as socket count increases, it would be easy to predict that a 32 socket x86 system would take the top spot. As for the Fujitsu system itself, a score of ~320K is in the top 10 were the fastest system doesn’t break a million. It also out performs many (but obviously not all) 16 and 32 socket Unix servers. Thus I would call it suitable for ‘extreme’ workloads.

    “That x86 servers will not do? SAP and business software is very hard to scale, as SGI explained to you, as the code branches too much.”
    To quote the Princess Bride “You keep using that word. I do not think it means what you think it means.” Define code branches in context of increasing socket count for scaling.
  • Brutalizer - Thursday, May 21, 2015 - link

    Seriously, I dont get it, what is so difficult to understand? Have you read your links?

    For SGI, they have UV2000 scale-out 256-socket servers and the SGI UV300H which only goes to 32-sockets. I suspect UV300H is a scale-up server, trying to compete with POWER and SPARC. This means that SGI UV2000 can not replace a UV300H. This is the reason SGI manufactures UV300H, instead of offering a small 32-socket UV2000 server. But now SGI sells UV2000 and UV300H, targeting different markets. The performance of UV300H sucks big time compared to POWER and SPARC, because otherwise we would see benchmarks all over the place as SGI claimed the crown. But SGI has no top records at all. And we all know that x86 does scale-up bad, which means performance sucks on SGI UV300H.

    .

    SAP Hana is designed to run on a cluster, and therefore it is a scale-out system. If you choose to run Hana on a single node (and scale-up by adding cpus and RAM) does not make Hana non clustered. If you claim that "Hana is not a clustered system, because you can scale-up on a single node" then you have not really understood much about scale-out or scale-up.

    SAP says in your own link, that you should keep adding cpus and RAM to your single node as far as possible, and when you hit the limit, you need to switch to a cluster. And for x86, the normal limit, is 8-socket scale-up servers. (The 32-socket SGI UV300H has so bad performance that no one use it - there are no SAP or other ERP benchmarks nowhere. It is worthless vs Unix).

    And in your own link:
    https://blogs.saphana.com/2014/12/10/sap-hana-scal...
    SAP says that Hana clustered version is only used for reading data, for gathering analytics. This scales exceptionally well and can be run on large clusters. SAP Hana also has a non clustered version for writing and storing business data which stops at 32 sockets because of complex synching.

    There are mainly two different use cases for SAP Hana; "SAP BW and Analytics", and "SAP Business Suite". As I explained earlier talking about Oracle TimesTen, BusinessWarehouse and Analytics is used for reading data: gathering data, to make light queries, not to store and alter data. The data is fix, so no locking of rows are necessary, no complex synchronization between nodes. That is why DataWare House and other Business Intelligence stuff, runs great on clusters.

    OTOH, Business Suite is another thing, it is used to record business work which alters data, it is not used as a analysis tool. Business Suite needs complex locking, which makes it unsuitable to run on clusters. It is in effect, a normal database used to write data. So this requires a scale-up server, it can not run on scale-out clusters because of complex locking, synching, code branches everywhere, etc. Now let me quote your own link from SAP:
    https://blogs.saphana.com/2014/12/10/sap-hana-scal...

    "...Should you scale-up, or out Business Suite?: The answer for the SAP Business Suite is simple right now: you HAVE to scale-up. This advice might change in future, but even an 8-socket 6TB system will fit 95% of SAP customers, and the biggest Business Suite installations in the world can fit in a SGI 32-socket with 24TB..."

    Note that SAP does not say you can run Business Suite on a 256-socket UV2000. They explicitly say SGI 32-socket servers is the limit. Why? Do you really believe a UV2000 can replace a scale-up server??? Even SAP confirms it can not be done!!

    "...Should you scale-up, or out BW and Analytics?...My advice is to scale-up first before considering scale-out....If, given all of this, you need BW or Analytics greater than 2TB, then you should scale-out. BW scale-out works extremely well, and scales exceptionally well – better than 16-socket or 32-socket scale-up systems even....Don’t consider any of the > 8-socket systems for BW or Analytics, because the NUMA overhead of those sockets is already in effect at 8-sockets (you lose 10-12% of power, or thereabouts). With 16- and 32-sockets, this is amplified slightly, and whilst this is acceptable for Business Suite, but not necessary for BW..."

    SAP says that for analytics, avoid 16- or 32-sockets because you will be latency punished. This punishment is not necessary for analytics, because you dont need complex synching when only reading data in a cluster. Reading data "scales exceptionally well". SAP also says that for Business Suite, when you write data, the latency punishment is unavoidable and you must accept it, which means you can as well as go for one 16- or 32-socket server. Diminishing returns. Scaling is difficult.

    There you have it. SAP Hana clustered version is only used for reading data, for analytics. This scales exceptionally well. SAP Hana also has a non clustered version for writing and storing business data which stops at 32 sockets. Now, can you stop your FUD? Stop say that scale-out servers can replace scale-up servers. If you claim they can, where are your proofs? Links? Nowhere. "Trust me on this, I will not prove this, you have to trust me". That is pure FUD.

    .

    Again, SAP Hana and TimesTen is only used for gathering and reading data. OTOH, normal databases are used for altering data, which needs heavy locking algorithms.
    "...You are shifting further and further away from the point which was that large x86 based are used for actual production work using databases. This is also highlighted by your continued insistence that SAP HANA and TimesTen are not 'normal' databases and thus don’t count for some arbitrary reason..."

    .

    "...If you were to actually comprehend that paper, you’d realize that locking and in-memory databases are totally independent concepts...."

    Again, in memory databases are mainly used for reading data. This means that they dont need locking of rows or other synchronizing features which are required for editing data. You only need to lock when you need to alter data, so that no one else changes the same data as you. Hence, normal databases always have elaborate locking algorithms, which is why you can only run databases on scale-up servers. In memory databases often dont even have locks - which means they are designed to only read data, which makes them excellent at clusters. Why is this concept so difficult to understand? Have you studied comp sci at all at the university?

    .

    "...Yes, being able to do a third of the work with one fifth of the resources is indeed better scaling and that’s only with an 8 socket system. While true that scaling is not going to be linear as socket count increases, it would be easy to predict that a 32 socket x86 system would take the top spot. As for the Fujitsu system itself, a score of ~320K is in the top 10 were the fastest system doesn’t break a million. It also out performs many (but obviously not all) 16 and 32 socket Unix servers. Thus I would call it suitable for ‘extreme’ workloads...."

    This must be the weirdest thing I have heard for a while. On the internet, there are strange people. Look, scaling IS difficult. That means as you add more resources, the benefit will be smaller and smaller. That is why scaling is difficult. At some point, you can not just add more and more resources. If you were a programmer you would have known. But you are not, everybody can tell from your flawed logic. And you have not studied comp sci either. Sigh. You are making it hard for us with your absolute ignorance. I need to learn you what you should have learned at uni.

    Fujitsu SPARC 40-socket M10-4S gets 844,000 saps. It uses a variant as the same highly performant SPARC 16-core cpu as in the K supercomputer, no 5 in top500.

    Fujitsu SPARC 32-socket M10-4S gets 836.000 saps. When you add 8 sockets to a 32-socket server, you gain 1000 sap per each cpu.

    Fujitsu SPARC 16-socket M10-4S gets 448.000 saps. When you add another 16 sockets you gain 24.000 sap per socket.

    See? You go from 16-sockets to 32-sockets and gain 24.000 sap per cpu. Then you go from 32-sockets to 40-sockets and gain 1,000 sap per cpu. Performance has dropped 96% per cpu!!!

    And if you went from 40 sockets to 48 sockets I guess you gain 100 saps per cpu. And if you go up to 64 sockets I guess you gain 10 saps per cpu. That is the reason there are no 64-socket M10-4S benchmarks, because the score would roughly be the same as a 40-socket server. SCALING IS DIFFICULT.

    Your fantasies about the x86 sap numbers above are just ridiculous. As ridiculous as your claim "SPARC does not scale" - well, it scales up to 40-sockets. And x86 scales up to 8-sockets. Who has bad scaling? x86 or SPARC? I dont get it, how can you even think so flawed?

    .

    And your claim that "8-socket E7 server is in the top 10 of SAP benchmarks therefore 300.000 sap is a top notch score" is also ridiculous. There are only SPARC and POWER, there are no other good scaling servers out there. After a few SPARC and POWER, you only have x86, which comes in at the very bottom, far far away from the top Unix servers. That does not make x86 a top notch score.

    Again, post a x86 sap benchmarks close to a million saps. Do it, or stop FUDing. x86 has no use in the high end scale-up server market as we can see from your lack of links.

    POST ONE SINGLE X86 BENCHMARK RIVALING OR OUTPERFORMING SPARC!!! Do it. Just one single link. Prove your claim, that SGI UV2000 with 256 sockets can replace a POWER or SPARC server in business software. Prove it. If you can not, you are lying. Are you a liar? And FUDer?
  • Kevin G - Saturday, May 23, 2015 - link

    @Brutalizer
    “Seriously, I dont get it, what is so difficult to understand? Have you read your links?”
    I have and I understand them rather well. In fact, I think I’ve helped you understand several of my points due to your changing positions as I’ll highlight below.

    “For SGI, they have UV2000 scale-out 256-socket servers and the SGI UV300H which only goes to 32-sockets. I suspect UV300H is a scale-up server, trying to compete with POWER and SPARC. This means that SGI UV2000 can not replace a UV300H. This is the reason SGI manufactures UV300H, instead of offering a small 32-socket UV2000 server. But now SGI sells UV2000 and UV300H, targeting different markets. The performance of UV300H sucks big time compared to POWER and SPARC, because otherwise we would see benchmarks all over the place as SGI claimed the crown. But SGI has no top records at all. And we all know that x86 does scale-up bad, which means performance sucks on SGI UV300H.”
    The UV 2000 can replace the UV300H if necessary as they’re both scale up. The reason you don’t see SAP HANA benchmarks with the 32 socket UV300H is simple: while it is expected to pass the UV300H is still going through validation. The UV 2000 has no such certification and that’s a relatively big deal for support from SAP customers (it can be used for nonproduction tasks per SAP’s guidelines). The UV 3000 coming later this year may obtain it but they’d only go through the process if a customer actually asks for it from SGI as there is an associated cost to doing the tests.

    “SAP Hana is designed to run on a cluster, and therefore it is a scale-out system. If you choose to run Hana on a single node (and scale-up by adding cpus and RAM) does not make Hana non clustered. If you claim that "Hana is not a clustered system, because you can scale-up on a single node" then you have not really understood much about scale-out or scale-up.”
    I’ll take SAP’s own words from my previous link over your assertions. HANA is primarily scale up with the option to scale out for certain workloads. It was designed from the ground up to do *both* per SAP so that it is flexible based upon the workload you need it to run. There is nothing wrong with using the best tool for the job.

    “OTOH, Business Suite is another thing, it is used to record business work which alters data, it is not used as a analysis tool. Business Suite needs complex locking, which makes it unsuitable to run on clusters. It is in effect, a normal database used to write data. So this requires a scale-up server, it can not run on scale-out clusters because of complex locking, synching, code branches everywhere, etc. “
    Correct and that is what HANA does. The key point here is that x86 systems like the UV300H can be used for the business suite despite your continued claims that x86 does does scale up to perform such workloads.

    “Note that SAP does not say you can run Business Suite on a 256-socket UV2000. They explicitly say SGI 32-socket servers is the limit. Why? Do you really believe a UV2000 can replace a scale-up server??? Even SAP confirms it can not be done!!”
    The key point here is that you are *finally* admitting is that the UV 300H is indeed a scale up server. UV 2000 and UV 300H do share a common topology for interconnect. The difference between the NUMALink6 in the UV 2000 and the NUMALink7 in the UV 300H is mainly the latency involved with a slight increase in bandwidth between nodes. The UV300H only has one central NUMALink7 chip to ensure that latency is consistent between nodes but this limits scalability to 32 sockets. The UV 2000 uses several NUMALink6 nodes to scale up to 256 sockets but access latencies vary between the source and destination sockets due the additional hops between nodes. The future UV 3000 will be using the NUMALink7 chip to lower latencies but they will not be uniform as it will inherit the same node topology as the UV2000. Essentially if one system is scale up, so is the other. So yes, the UV 2000 is a scale up server and can replace the UV 300H if additional memory or processors are necessary. Sure the additional latencies between sockets will hurt the performance gains but the point is that you can get to that level by scaling upward with a x86 platform.

    http://www.theplatform.net/2015/05/01/sgi-awaits-u...
    http://www.enterprisetech.com/2014/03/12/sgi-revea...

    Also I think it would be fair to repost this link that you originally provided: http://www.theregister.co.uk/2013/08/28/oracle_spa...
    This link discusses the Bixby interconnect on SPARC M5 system and includes this quote for comparison about the topology: “This is no different than the NUMAlink 6 interconnect from Silicon Graphics, which implements a shared memory space using Xeon E5 chips”

    “There you have it. SAP Hana clustered version is only used for reading data, for analytics. This scales exceptionally well. SAP Hana also has a non clustered version for writing and storing business data which stops at 32 sockets. Now, can you stop your FUD? Stop say that scale-out servers can replace scale-up servers. If you claim they can, where are your proofs? Links? Nowhere. "Trust me on this, I will not prove this, you have to trust me". That is pure FUD.”
    Progress! Initially you were previously claiming that there were no x86 servers that scaled past 8 sockets that were used for business applications. Now you have finally accepted that there is a 32 socket x86 scale up server with the UV 300H as compared to your earlier FUD remarks here: http://www.anandtech.com/comments/9193/the-xeon-e7...

    “Again, in memory databases are mainly used for reading data. This means that they dont need locking of rows or other synchronizing features which are required for editing data. You only need to lock when you need to alter data, so that no one else changes the same data as you. Hence, normal databases always have elaborate locking algorithms, which is why you can only run databases on scale-up servers. In memory databases often dont even have locks - which means they are designed to only read data, which makes them excellent at clusters. Why is this concept so difficult to understand? Have you studied comp sci at all at the university?”
    Oh I have and one of my favorite courses was logic. My point here is that being an in-memory database and requiring locks are NOT mutually exclusive ideas as you’re attempting to argue. Furthermore the locking is not a requirement is a strict necessity for a write heavy transaction processing as long as there is a method in place to maintain data concurrency. Now for some examples.

    ENEA AB is an in-memory database with a traditional locking mechanism for transactions: http://www.enea.com/Corporate/Press/Press-releases...

    MonetDB doesn’t use a locking mechanism for write concurrency and started life out as a traditional disk based DB. (It has since evolved to add in-memory support.)

    Microsoft’s Hekaton does fit the description of an in-memory database without a locking mechanism but is targeted at the OLTP market. There are other methods of maintaining data concurrency outside of locking and with memory being orders of magnitude faster than disk, more of these techniques are being viable. http://research.microsoft.com/en-us/news/features/...

    “Fujitsu SPARC 40-socket M10-4S gets 844,000 saps. It uses a variant as the same highly
    performant SPARC 16-core cpu as in the K supercomputer, no 5 in top500.
    Fujitsu SPARC 32-socket M10-4S gets 836.000 saps. When you add 8 sockets to a 32-socket server, you gain 1000 sap per each cpu.
    Fujitsu SPARC 16-socket M10-4S gets 448.000 saps. When you add another 16 sockets you gain 24.000 sap per socket.”
    Actually if you looked at the details between these submissions, you should be able to spot why the 40 socket and 32 socket systems have very similar scores. Look at the clock speeds: the 40 socket system is running at 3 Ghz while the 32 socket system is running at 3.7 Ghz.
    The 32 socket system also using a different version of the Oracle database than the 40 and 16 socket systems which could also impact results, especially depending on how well tuned each version was for the test.

    Thus using these systems as means to determine performance scaling by adding additional sockets to a design is inherently flawed.

    “Your fantasies about the x86 sap numbers above are just ridiculous. As ridiculous as your claim "SPARC does not scale" - well, it scales up to 40-sockets. And x86 scales up to 8-sockets. Who has bad scaling? x86 or SPARC? I dont get it, how can you even think so flawed?”
    Well considering you finally accepted that a 32 socket x86 scale up systems exists earlier this post, I believe you need to revise that statement. Also the SGI UV 2000 is a scale up server that goes to 256 sockets.

    “And your claim that "8-socket E7 server is in the top 10 of SAP benchmarks therefore 300.000 sap is a top notch score" is also ridiculous. There are only SPARC and POWER, there are no other good scaling servers out there. After a few SPARC and POWER, you only have x86, which comes in at the very bottom, far far away from the top Unix servers. That does not make x86 a top notch score.”
    Go here:
    http://global.sap.com/solutions/benchmark/sd2tier....
    Sort by SAP score. Count down. As for 5/22/2015, the 9th highest score out of *789* submissions is this:
    http://download.sap.com/download.epd?context=40E2D...
    Then compare it to the 10th system on that very same list, a 12 socket POWER system:
    http://download.sap.com/download.epd?context=40E2D...
    Again, I would say x86 is competitive as an 8 socket machine is in the top 10 contrary to your claims that it could not compete.
    You want a faster system, there are higher socket count x86 systems from SGI and HP.
  • Brutalizer - Sunday, May 24, 2015 - link

    @KevinG
    You claim you have studied "logic" at university, well if you have, it is obvious you did not complete the course because you have not understood anything of logic.

    For instance, you claim "x86 that goes maximum to 8-sockets and ~300.000 saps has better scaling than SPARC that scales to 40 sockets and getting close to a million saps". This logic of yours is just wrong. Scalability is about how successfull a system is in tackling larger and larger workloads. From wikipedia article on "Scalability":
    "...Scalability is the ability of a system... to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth..."

    QUESTION_A) If x86 stops at 8-sockets and ~300.000 saps vs SPARC stops at 40-sockets and ~850.000 saps - who is most capable of tackling larger and larger workloads? And still you insist in numerous posts that x86 is more scalable?? Que?

    Have you problems understanding basic concepts? "There are no stupid people, only uninformed" - well, if you are informed by wikipedia and you still reject basic concepts, then you _are_ stupid. And above all, anyone versed in logic would change their mind after being proven wrong. As you dont change your mind after proven wrong, it is clear you have not understood any logic, which means you were a drop out on logic classes. The most probable is you never studied at uni because of your flawed reasoning ability. This is obvious, so dont lie about you studied any logic classes at uni. I have double Master's, one in math (including logic) and one in theoretical comp sci, so I can tell you know nothing about these subjects because you have proved numerous times you dont even understand basic concepts.

    .

    Another example of you not understanding basic concepts is when you discuss why customers choose extremely expensive IBM P595 Unix 32-sockets server costing $35 million, instead of a very cheap SGI cluster with 256-socket (which do have much higher performance). Your explanation why customers choose IBM P595? Because P595 has better RAS!!! That is so silly I havent bothered to correct this among all your silly misconceptions. It is well known that it is much cheaper and gives better RAS to scale-out than scale-up:
    http://natishalom.typepad.com/nati_shaloms_blog/20...
    "....Continuous Availability/Redundancy: You should assume that it is inevitable that servers will crash, and therefore having one big system is going to lead to a single point of failure. In addition, the recovery process is going to be fairly long which could lead to a extended down-time..."

    Have you heard about Google? They have loads of money, and still choose millions of very cheap x86 servers because if any crashes, they just failover to another server. This gives much better uptime to have many cheap servers in a cluster, than one single large server that might crash. So, all customers prioritizing uptime would choose a cheap large x86 cluster, instead of a single very very expensive server. So this misconception is also wrong, customers dont choose expensive 32-socket Unix servers because of better uptime. Any cluster beats a single server when talking about uptime. So, now knowing this, answer me again on QUESTION_B) Why have not the market for high end Unix servers died immediately, if a cheap x86 scale-out cluster can replace any expensive Unix server? Why do customers choose to spend $millions on a single Unix server, instead of $100.000s on a x86 cluster? Unix servers are slow compared to a large x86 cluster, that is a fact.

    .

    I have shown that you gain less and less saps for every cpu you add, for 16-sockets SPARC M10-4S server you gain 24.000 saps for every cpu. And after 32-socket you gain only 1000 saps for every cpu you add up to 40-sockets M10-4S server.

    To this you answer that the SPARC cpus are differently clocked, and the servers use different versions of the database - therefore I am wrong when I show that adding more and more cpus gives diminishing returns.

    Ok, I understand what you are doing. For instance, the 16-socket M10-4S has 28.000 saps per socket, and the 32-socket M10-4S has 26.000 saps per socket - with exactly the same 3.7GHz cpu. Sure, the 32 socket used a newer and faster version of the database - and still gained less for every socket as it had more sockets i.e. scalability is difficult when you have a large number of sockets. Scaling small number of sockets is never a problem, you almost always gets linear performance in the beginning, and then it drops of later.

    But, it does not matter much. If I showed you identical databases, you would say "but that scalability problem is restricted to SPARC, therefore you are wrong when you claim that adding cpus gives diminishing performance". And if I showed you that SPARC scales best on the market, you would say "then the scalability problem is in Solaris so you are wrong". And if I showed you that Solaris also scales best on the market you would make up some other excuse. It does not matter how many links I show you, your tactic is very ugly. Maybe the cpus dont come from the same wafer so the benchmarks are not comparable. And if I can prove they do, maybe the wafer differ a bit in different places. etc etc etc.

    The point in having benchmarks is to compare between different systems, even though not everything is equal to 100%. Instead you extrapolate and try to draw conclusions. If you reject that premise and require that everything must be 100% equal, I can never prove anything with benchmarks or links. I am surprised you dont reject SPARC having close to a million saps because those saps are not 100% equal to x86 saps. The hard disks are different, they did not share the same disks. The current is different, etc.

    Anyway, your ugly tactic forces me to prove my claim about "law of diminishing returns" by show you links instead. I actually try to avoid mathematics and comp sci, because many might not understand it, but your ugly tactic forces me to do it. Here you see that there actually is something called law of diminshing returns:
    http://en.wikipedia.org/wiki/Amdahl%27s_law#Relati...

    "...Each new processor you add to the system will add less usable power than the previous one. Each time you double the number of processors the speedup ratio will diminish, as the total throughput heads toward the limit of 1 / (1 - P)...."

    So there you have it. You are wrong when you believe that scaling always occurs linearly.

    You need to give an example where your claim is correct: show us a SAP server that runs on 8-sockets and when you go to 32-sockets the performance has increased 4x. You will not find any such benchmarks, because your misconception is not true. There is something called law of diminishing returns. Which you totally have missed, obviously.

    .

    "...The UV 2000 can replace the UV300H if necessary as they’re both scale up...."

    It seems you claim that UV300H is basically a stripped down UV2000. Well, if the UV2000 really can replace the UV300H, why dont SGI just sell a smaller configuration, say a 32-socket UV2000 with only one NUMAlink6? This is QUESTION_C)
    The most probable explanation is they are different servers targeting different workloads. UV2000 is scale-out and UV300H is scale-up. It does not make sense to manufacture different lines of servers, if they have the same use case. Have never thought about that?

    As you strongly believe, and claim that UV2000 can replace UV300H, you should prove this. Show us links where anyone has done so. You do realize you can not just spew out things without backing claims up? That would be the very definition of FUD: "trust me on this, I will not prove this, but you will have to believe me. Yes, SGI UV2000 can replace a large Unix server, I will not show any links on this, you have to trust me" - that is actually pure FUD.

    .

    "...The reason you don’t see SAP HANA benchmarks with the 32 socket UV300H is simple: while it is expected to pass the UV300H is still going through validation....
    ...The UV 3000 coming later this year may obtain it but they’d only go through the process if a customer actually asks for it from SGI as there is an associated cost to doing the tests...."

    QUESTION_D) Why have we never ever seen any SGI server on the sap benchmark list?
    SGI claims they have the fastest servers on the market, and they have been selling 100s-socket server for decades (SGI Altix) but still no SGI server has never ever made it into the SAP top benchmark list. Why dont any customer run SAP on cheap SGI servers, and never have? This is related to QUESTION_B). Is it because SAP is a scale-up system and SGI only does scale-out clusters?

    Show us a x86 benchmark close to a million saps. If you can not, stop saying that x86 is able to tackle the largest sap workloads - because it can not. If it can, you can show us benchmarks.

    .

    "...Correct and that is what HANA does. The key point here is that x86 systems like the UV300H can be used for the business suite despite your continued claims that x86 does [not] scale up to perform such workloads..."

    But if I am wrong, then you can prove it, right? Show us ONE SINGLE LINK that supports your claim. If you are right, then there will be lot of links on the vast internet showing that customers replace large Unix servers with x86 clusters such as SGI UV2000 and saving lot of money and gaining performance in the process.

    Show us ONE SINGLE link where a cheap SGI UV2000 or Altix or whatever SGI scale-out server replaces one very expensive high end Unix on SAP. During all these decades, one single customer must have existed that want to save millions by choosing cheap x86 servers? But there are no one. Why?

    QUESTION_E) Why are there no such links nowhere on the whole internet?
    Maybe it is not true that SGI UV2000 servers can do scale-up?

    .

    Regarding the SPARC interconnect Bixby: “This is no different than the NUMAlink 6 interconnect from Silicon Graphics, which implements a shared memory space using Xeon E5 chips”

    There is a big difference! The Bixby stops at 32 sockets. Oracle that charges millions for large servers and tries to win business benchmarks such as database, SAP, etc - knows that if Oracle goes to 256 socket SPARC servers with Bixby, then business performance would suck. It would be a cluster. Unix and Mainframes have for decades scaled very well, and they have always stopped at 32-sockets or so. They could of course gone higher, but Big Iron targets business workloads, not HPC number crunching clustering. That is the reason you always see Big Iron on the top business benchmarks, and never see any HPC cluster. Bixby goes up to 96-sockets but Oracle will never release such a server, performance would be bad.

    QUESTION_F) Small supercomputers (just a cluster of a 100 of nodes) are quite cheap in comparison to a single expensive 32-socket Unix server. Why do we never see supercomputers in top benchmarks for SAP and other business enterprise benchmarks? The business systems are very very expensive, and which HPC company would not want to earn easy peasy $millions by selling dozens of small HPC clusters to SAP installations? I know people that are SAP consultants, and large SAP installations can easily cost more than $100 millions. And also talking about extremely expensive database servers, why dont we see any small supercomputers for database workloads running Oracle? Why dont small supercomputers replace very expensive Unix servers in the enterprise business workloads?

    Why have never ever a single customer on the whole internet, replaced business enterprise servers with a cheap small supercomputer? Why are there no such links nowhere?

    .

    "...Progress! Initially you were previously claiming that there were no x86 servers that scaled past 8 sockets that were used for business applications. Now you have finally accepted that there is a 32 socket x86 scale up server with the UV 300H as compared to your earlier FUD remarks here..."

    Wrong again. I have never changed my mind. There are NO x86 servers that scales past 8-sockets used for business applications today. Sure, 16-sockets x86 servers are for sale, but no one use them. Earlier I showed links to Bull Bullion 16-socket x86 server. And I have known about 32-socket SGI x86 servers earlier.

    The problem is that all x86 servers larger than 8-sockets - are not deployed anywhere. No customer are using them. There are no business benchmarks. I have also known about HP Big Tux experiment, it is a 64-socket Integrity (similar to Superdome) Unix server running Linux. But performance was so terribly bad, that cpu utilization under full load, under 100% full burn load - the cpu utilization was ~40% on 64-socket Unix server running Linux. So, HP never sold the Big Tux because it sucked big time. Just because there exists 16-sockets x86 servers such as Bull Bullion - does not mean anybody are using them. Because they suck big time. SGI have had large x86 servers for decades, and nobody have ever replaced large Unix servers with them. Nobody. Ever.

    The largest in use for business workloads are 8-sockets. So, no, I have not changed my mind. The largest x86 servers used for business workloads are 8-sockets. Can you show us a single link with a customer using a 16-socket or 32-socket x86 server used to replace a high end Unix server? No you can not. Search the whole internet, there are no such links.

    .

    "...My point here is that being an in-memory database and requiring locks are NOT mutually exclusive ideas as you’re attempting to argue. Furthermore the locking is not a requirement is a strict necessity for a write heavy transaction processing as long as there is a method in place to maintain data concurrency...."

    Jesus. "Locking is no requirement as long as there is another method to maintain data integrity"? How do you think these methods maintains data integrity? By locking of course! Deep down in these methods, they must make sure that only one alters the data. Two cpus altering the same data will result in data loss. And the only way to ensure they do not alter the same data, is by some sort of synchronization mechanism: one alters, and the other waits - that is, one locks the data before altering data. Making a locking signal is the only way to signal other cpus that certain data is untouchable at the moment! So, the other cpus reads the signal and waits. And if you have 100s of cpus, all trying to heavily alter data, chances increase some data is locked, so many cpus need to wait => performance drops. The more cpus, the higher the chance of waiting because someone else is locking the data. The more cpus you add in a transaction heavy environment -> performance drops because of locking. This is called "Law of diminishing returns". How in earth can you say that: locks are not necessary in a transaction heavy environment??? What do you know about parallel computing and comp sci??? Jesus you really do know much about programming or comp sci. O_o

    Regarding in memory databases, some of them dont even have locking mechanisms, that is, no method of guaranteeing data integrity. The majority (all?) of them have only very crude basic locking mechanisms. On such "databases" you can not guarantee data integrity which makes them useless in an environment that alters data heavily - i.e. normal database usage. That is the reason in memory databases such as Oracle TimesTen are only used for reading data for analytics. Not normal database usage. A normal database stores and edits data. These can not alter data.

    I ask again:
    “There you have it. SAP Hana clustered version is only used for reading data, for analytics. This scales exceptionally well. SAP Hana also has a non clustered version for writing and storing business data which stops at 32 sockets. Now, can you stop your FUD? Stop say that scale-out servers can replace scale-up servers. If you claim they can, where are your proofs? Links? Nowhere."

    So now you can read from SAP web site in your own link, that SAP Hana cluster which is in memory database is only used for data analytics read usage. And for the business suite, you need a scale-up server. So you are wrong again.

    .

    As you display such large ignorance, here is the basics about scale-up vs scale-out that you should read before trying to discuss this. Otherwise everything will be wrong and people will mistake you for being an stupid person, when instead you are just informed and have not studied the subject. Here they say that maintaining state in transaction heavy environments is fiendishly complex and can only be done by scaling up.
    http://www.servercentral.com/scalability-101/

    "....Not every application or system is designed to scale out. Issues of data storage, synchronization, and inter-application communication are critical to resolve.

    To scale out, each server in the pool needs to be interchangeable. Another way of putting this is servers need to be “stateless”, meaning no unique data is kept on the server. For instance, an application server may be involved in a transaction, but once that transaction is complete the details are logged elsewhere – typically to a single database server.

    For servers that must maintain state—database servers, for instance—scaling out requires they keep this state in sync amongst themselves. This can be straightforward or fiendishly complex depending on the nature of the state and software involved. For this reason, some systems may still need to scale up despite the benefits of scaling out

    However, scaling up to ever-larger hardware poses some serious problems.

    The larger a server gets, the more expensive it becomes. It’s more difficult to design a ten-core processor than a dual-core, and it’s more difficult to create a four-processor server than one with a single CPU. As such, the cost for a given amount of processing power tends to increase as the size of the server increases. Eventually, as you reach into the largest servers, the number of vendors decreases and you can be more locked into specific platforms.

    As the size of the server increases, you’re placing more and more computing resources into a single basket. What happens if that server fails? If it’s redundant, that’s another large node you have to keep available as insurance. The larger the server, the more you’re exposed for failure.

    Most systems only scale up so far before diminishing returns set in. One process may have to wait on another, or has a series of tasks to process in sequence. The more programs and threads that run sequentially rather than in parallel, the less likely you’ll be to take advantage of the additional processor power and memory provided by scaling up." - Law of diminishing returns!!!!

    .

    I am not going to let you get away this time. We have had this discussion before, and everytime you come in an spew your FUD without any links backing up your claims. I have not bothered correcting your misconceptions before but I am tired of your FUD, in every discussion. We WILL settle this, once and for all. You WILL back up your claims with links, or stop FUD. And in the future, you will know better than FUDing. Period.

    Show us one single customer that has replaced a single Unix high end server with a SGI scale-out server, such as UV2000 or Altix or ScaleMP or whatever. Just one single link.
  • Kevin G - Monday, May 25, 2015 - link

    @Brutalizer
    “For instance, you claim "x86 that goes maximum to 8-sockets and ~300.000 saps has better scaling than SPARC that scales to 40 sockets and getting close to a million saps". This logic of yours is just wrong. Scalability is about how successfull a system is in tackling larger and larger workloads."
    And if you can do the same work with less resources, scalability is also better as the overhead is less. Say if you want to increase performance to a given level, x86 systems would require fewer additional sockets to do it and thus lower overhead that reduces scalability.

    “QUESTION_A) If x86 stops at 8-sockets and ~300.000 saps vs SPARC stops at 40-sockets and ~850.000 saps - who is most capable of tackling larger and larger workloads? And still you insist in numerous posts that x86 is more scalable?? Que?”
    Actually the limits you state are flawed on both the x86 *AND* SPARC sides. Oracle can go to a 96 socket version of the M6 if they want. Even though you refuse to acknowledge it, the UV 2000 is a scale up server that goes to 256 sockets. Last I checked, 256 > 96 in terms of scalability. However socket count is only one aspect of performance but when the Xeon E5v2’s and E7v3’s are out running the SPARC chips in the M6 on a per core and per socket basis, it would be logical to conclude that an UV 2000 system would be faster and it wouldn’t even need all 256 sockets to do so.

    “Another example of you not understanding basic concepts is when you discuss why customers choose extremely expensive IBM P595 Unix 32-sockets server costing $35 million, instead of a very cheap SGI cluster with 256-socket (which do have much higher performance). Your explanation why customers choose IBM P595? Because P595 has better RAS!!! That is so silly I havent bothered to correct this among all your silly misconceptions. It is well known that it is much cheaper and gives better RAS to scale-out than scale-up:”
    My previous statements on this matter are perfectly in-line with the quote you’ve provided. So I’ll just repeat myself “That is how high availability is obtained: if the master server dies then a slave takes over transparently to outside requests. It is common place to order large database servers in sets of two or three for this very reason.” And then you scoffed at that statement but now you are effectively using your own points against yourself now. http://anandtech.com/comments/9193/the-xeon-e78800...

    “Have you heard about Google? They have loads of money, and still choose millions of very cheap x86 servers because if any crashes, they just failover to another server. This gives much better uptime to have many cheap servers in a cluster, than one single large server that might crash. So, all customers prioritizing uptime would choose a cheap large x86 cluster, instead of a single very very expensive server. So this misconception is also wrong, customers dont choose expensive 32-socket Unix servers because of better uptime. Any cluster beats a single server when talking about uptime. So, now knowing this, answer me again on QUESTION_B) Why have not the market for high end Unix servers died immediately, if a cheap x86 scale-out cluster can replace any expensive Unix server? Why do customers choose to spend $millions on a single Unix server, instead of $100.000s on a x86 cluster? Unix servers are slow compared to a large x86 cluster, that is a fact.”
    No surprise that you have forsaken your database arguments here as that would explain the differences between what Google is doing with their massive x86 clusters and the large 32 socket servers: the workloads are radically different. For Google search, the clusters of x86 system are front end web servers and back end application servers that crawl through Google’s web cache to provide search result. Concurrency here isn’t an issue as the clusters are purely reading from the web index so scaling out works exceptionally well for this workload. However there are some workloads that are best scale up due to the need to maintain concurrency like OLTP databases which we were previously discussing. So we should probably get back on topic instead of shifting further and further away.
    I’ve also given another reason before why large Unix systems will continue to exist today: “Similarly there are institutions who have developed their own software on older Unix operating systems. Porting old code from HPUX, OpenVMS, etc. takes time, effort to validate and money to hire the skill to do the port. In many cases, it has been simpler to pay extra for the hardware premium and continue using the legacy code.” http://anandtech.com/comments/9193/the-xeon-e78800...

    “I have shown that you gain less and less saps for every cpu you add, for 16-sockets SPARC M10-4S server you gain 24.000 saps for every cpu. And after 32-socket you gain only 1000 saps for every cpu you add up to 40-sockets M10-4S server.”
    All you have shown is a horribly flawed analysis since the comparison was done between systems with different processor clock speeds. Attempting to determine the scaling of additional sockets between a system with 3.0 Ghz processors and another 3.7 Ghz processor isn’t going to work as the performance per socket is inherently different due to the clock speeds involved. This should be obvious.

    “The point in having benchmarks is to compare between different systems, even though not everything is equal to 100%. Instead you extrapolate and try to draw conclusions. If you reject that premise and require that everything must be 100% equal, I can never prove anything with benchmarks or links. I am surprised you dont reject SPARC having close to a million saps because those saps are not 100% equal to x86 saps. The hard disks are different, they did not share the same disks. The current is different, etc.”
    I reject your scaling comparison because to determine scaling by adding additional sockets, the only variable you want to change is just the number of sockets in a system.
    For SAP scores as a whole, the comparison is raw performance, not specifically how that performance is obtained. Using different platforms here is fine as long as they perform the benchmark and validated.

    “Anyway, your ugly tactic forces me to prove my claim about "law of diminishing returns" by show you links instead. I actually try to avoid mathematics and comp sci, because many might not understand it, but your ugly tactic forces me to do it. Here you see that there actually is something called law of diminshing returns [. . .] So there you have it. You are wrong when you believe that scaling always occurs linearly.”
    Citation please where I stated that performance scales linearly. In fact, I can quote myself twice in this discussion where I’ve stated that performance *does not* scale linearly. “While true that scaling is not going to be linear as socket count increases,” from http://anandtech.com/comments/9193/the-xeon-e78800... and “This is also why adding additional sockets to the UV2000 is not linear: more NUMALink6 node controllers are necessary as socket count goes up.” from http://anandtech.com/comments/9193/the-xeon-e78800...
    So please stop the idea that performance scales linearly as it is just dishonest on your part.

    “It seems you claim that UV300H is basically a stripped down UV2000. Well, if the UV2000 really can replace the UV300H, why dont SGI just sell a smaller configuration, say a 32-socket UV2000 with only one NUMAlink6?
    They do. it is the 16 socket version of the UV 2000. I believe you have missed the ‘up to’ part of scaling up to 256 sockets. The UV 300H came about mainly due to market segmentation due to the lower latency involved by reducing the number of hops between sockets. By having a system with a more uniform latency, the scaling is better as socket count increases.

    “This is QUESTION_C) The most probable explanation is they are different servers targeting different workloads. UV2000 is scale-out and UV300H is scale-up. It does not make sense to manufacture different lines of servers, if they have the same use case. Have never thought about that?”
    Or the answer could simply be that the UV 2000 is a scale up server. Have you thought about that? I’ve given evidence before that the UV 2000 is a scale up server. I’ll repost again:
    https://www.youtube.com/watch?v=lDAR7RoVHp0 <- shows how many processors and sockets that are on a single system running a single instance of Linux
    https://www.youtube.com/watch?v=KI1hU5g0KRo <- explains the topology and how it is able to have cache coherency and shared memory making it a single large SMP system.
    So far you have no provided anything that indicates that UV 2000 is a scale-out server as you claim.

    “QUESTION_D) Why have we never ever seen any SGI server on the sap benchmark list?
    SGI claims they have the fastest servers on the market, and they have been selling 100s-socket server for decades (SGI Altix) but still no SGI server has never ever made it into the SAP top benchmark list. Why dont any customer run SAP on cheap SGI servers, and never have? This is related to QUESTION_B). Is it because SAP is a scale-up system and SGI only does scale-out clusters?”
    SGI has only been offering large scale up x86 servers for a few years starting with the UV 1000 in 2010. The main reason is that such a configuration would not be supported by SAP in production. In fact, when it comes to Xeons, only the E7 and older 7000 series get certified in production. The UV 2000 interestingly enough uses Xeon E5v2 chips.

    “Show us a x86 benchmark close to a million saps. If you can not, stop saying that x86 is able to tackle the largest sap workloads - because it can not. If it can, you can show us benchmarks.”
    This is again a shifting of the goal posts as initially you wanted to see an x86 system that simply could compete. So initially I provided a link to the Fujitsu system in the top 10. I would consider an 8 socket x86 system in the top 10 out of 789 systems that ran that benchmark as competitive. If wanted to, there are x86 systems that scale up further for additional performance.

    “But if I am wrong, then you can prove it, right? Show us ONE SINGLE LINK that supports your claim. If you are right, then there will be lot of links on the vast internet showing that customers replace large Unix servers with x86 clusters such as SGI UV2000 and saving lot of money and gaining performance in the process.”
    I’ll just repost this then: http://www.theplatform.net/2015/05/01/sgi-awaits-u...
    The big quote from the article: “CEO Jorge Titinger said that through the end of that quarter SGI had deployed UV 300H machines at thirteen customers and system integrators, and one was a US Federal agency that was testing a move to convert a 60 TB Oracle database system to a single instance of UV running HANA.”

    “There is a big difference! The Bixby stops at 32 sockets. Oracle that charges millions for large servers and tries to win business benchmarks such as database, SAP, etc - knows that if Oracle goes to 256 socket SPARC servers with Bixby, then business performance would suck. It would be a cluster.[. . .] Bixby goes up to 96-sockets but Oracle will never release such a server, performance would be bad.”
    So using the Bixby interconnect to 32 socket is not a cluster but then scaling to 96 sockets with Bixby is a cluster? Sure performance gains here would be far from linear due to additional latency overhead between sockets but the system would still be scale up as memory would continued to be shared and cache coherency maintained. Those are the two big factors which determine if a system is scale up SMP device or a cluster of smaller systems. Your distinction between what is a cluster and what is scale up is apparently arbitrary as you’ve provided absolutely no technical reason.

    “QUESTION_F) [. . .]Why have never ever a single customer on the whole internet, replaced business enterprise servers with a cheap small supercomputer? Why are there no such links nowhere?”
    People like using the best tool for the job. Enterprise workloads benefit heavily from a shared memory space and cache coherent architectures for concurrency. This is why scale-up servers are generally preferred for business workloads (though there are exceptions).

    “Wrong again. I have never changed my mind. There are NO x86 servers that scales past 8-sockets used for business applications today. Sure, 16-sockets x86 servers are for sale, but no one use them. Earlier I showed links to Bull Bullion 16-socket x86 server. And I have known about 32-socket SGI x86 servers earlier.”
    Indeed you have. “The largest x86 servers are all 8 sockets, there are no larger servers for sale and have never been.” was said here http://anandtech.com/comments/9193/the-xeon-e78800... and now you’re claiming you’ve known about 16 socket x86 servers from Bull and earlier 32 socket x86 SGI servers. These statements made by you are clearly contradictory.

    “The largest in use for business workloads are 8-sockets. So, no, I have not changed my mind. The largest x86 servers used for business workloads are 8-sockets. Can you show us a single link with a customer using a 16-socket or 32-socket x86 server used to replace a high end Unix server? No you can not. Search the whole internet, there are no such links.
    At the launch HP’s x86 based SuperDome X included a customer name, Cerner. ( http://www8.hp.com/us/en/hp-news/press-release.htm... )
    A quick bit of searching turned put that they used HPUX and Itanium systems before.
    ( http://www.cerner.com/About_Cerner/Partnerships/HP... )

    “Jesus. "Locking is no requirement as long as there is another method to maintain data integrity"? How do you think these methods maintains data integrity? By locking of course! Deep down in these methods, they must make sure that only one alters the data.”
    The key point, which you quote, is to maintain concurrency. Locking is just a means to achieve that goal but there are others. So instead of resorting to personal attacks as you have done, I’ll post some links to some non-locking concurrency techniques:
    http://en.wikipedia.org/wiki/Multiversion_concurre...
    http://en.wikipedia.org/wiki/Optimistic_concurrenc...
    http://en.wikipedia.org/wiki/Timestamp-based_concu...
    These other techniques are used in production systems. In particular Microsft’s Hekaton as part of SQL Server 2014 uses multi version concurrency control which I previously provided a link to.

    “So now you can read from SAP web site in your own link, that SAP Hana cluster which is in memory database is only used for data analytics read usage. And for the business suite, you need a scale-up server. So you are wrong again.”
    How so? The UV 300H works fine by your admission for the HANA business suite as it is a scale up server. We are in agreement on this!

    “I am not going to let you get away this time. We have had this discussion before, and everytime you come in an spew your FUD without any links backing up your claims. I have not bothered correcting your misconceptions before but I am tired of your FUD, in every discussion. We WILL settle this, once and for all. You WILL back up your claims with links, or stop FUD. And in the future, you will know better than FUDing. Period.”
    Fine, I accept your surrender.

    “Show us one single customer that has replaced a single Unix high end server with a SGI scale-out server, such as UV2000 or Altix or ScaleMP or whatever. Just one single link.”

    The Institute of Statistical Mathematics in Japan replaced their Fujitsu M9000 running Solaris with UV 2000’s running Linux:
    http://virtualization.sys-con.com/node/2874817
    http://www.ism.ac.jp/computer_system/eng/sc/super-...
    http://www.ism.ac.jp/computer_system/eng/sc/super....
    *Note that the Institute of Statistical Mathematics also keeps a separate HPC cluster. This used to be a Fujitsu PRIMERGY x86 cluster but has been replaced by a SGI ICE X cluster also using x86 chips.

    Hamilton Sundstrand did their migration in two steps. The first was to migrate from Unix to Linux ( http://www.prnewswire.com/news-releases/hamilton-s... ) and then later migrated to a UV 1000 system ( http://pages.mscsoftware.com/rs/mscsoftware/images... )
  • Brutalizer - Monday, May 25, 2015 - link

    @FUDer KevinG

    "...Say if you want to increase performance to a given level, x86 systems would require fewer additional sockets to do it and thus lower overhead that reduces scalability...."

    And how do you know that x86 requires fewer sockets than SPARC? I have posted links about "Law of diminishing returns" and that sometimes you can only scale-up, scale-out does not do. SAP is a business system, i.e. scale-up system meaning scaling is difficult the more sockets you add - and if you claim that SAP gives linear scalability on x86, but not on SPARC - you need to show us links and backup up your claim. Otherwise it is just pure disinformation:
    http://en.wikipedia.org/wiki/Fear,_uncertainty_and...
    "FUD is generally a strategic attempt to influence perception by disseminating negative and dubious or false information...The term FUD originated to describe disinformation tactics in the computer hardware industry".

    There you have it. If you spread pure disinformation (all your false statements) you are FUDing. So, I expect you to either confess that you FUD and stop spreading disinformation, or backup all your claims (which you can not because they are not true). Admit it, that you are a Troll and FUDer running SGI's errands, because nothing you ever say can be proven, because everything are lies.

    .

    "...Last I checked, 256 > 96 in terms of scalability. However socket count is only one aspect of performance but when the Xeon E5v2’s and E7v3’s are out running the SPARC chips in the M6 on a per core and per socket basis, it would be logical to conclude that an UV 2000 system would be faster and it wouldn’t even need all 256 sockets to do so...."

    Why is it "logical"? You have numerous times proven your logic is wrong. On exactly what grounds do you base your weird claim? So, explain your reasoning or show us links to why your claim is true, why you believe that a 256 socket SGI would easily outperform 32-socket Unix servers:
    I quote myself: "Note that SAP does not say you can run Business Suite on a 256-socket UV2000. They explicitly say SGI 32-socket servers is the limit. Why? Do you really believe a UV2000 can replace a scale-up server??? Even SAP confirms it can not be done!!"

    And again, can you answer QUESTION_A)? Why do you claim that x86 going to 8-sockets and 300.000 saps can tackle larger workloads (i.e. scales better) than SPARC with 40-sockets and 850.000 saps? Why are you ducking the question?

    .

    "...My previous statements on this matter are perfectly in-line with the quote you’ve provided. So I’ll just repeat myself “That is how high availability is obtained: if the master server dies then a slave takes over transparently to outside requests. It is common place to order large database servers in sets of two or three for this very reason.” And then you scoffed at that statement but now you are effectively using your own points against yourself now..."

    Jesus. You are in a twisted maze, and can't get out. Look. I asked you why companies pay $35 millions for a high end Unix IBM P595 with 32-sockets, when they can get a cheap 256-socket SGI server for much less money. To that you answered "because IBM P595 has better RAS". Now I explained that clusters have better RAS than a single point of failure server - so your RAS argument is wrong when you say that companies choose to spend more money on slower Unix servers than a cluster. To this you answer: "My previous statements on this matter are perfectly in-line with the quote you’ve provided. So I’ll just repeat myself “That is how high availability is obtained...now you are effectively using your own points against yourself now".

    I did not ask about High Availability. Do you know what we are discussing at all? I asked why do companies pay $35 million for a 32-socket Unix IBM P595 server, when they can get a much cheaper 256-socket SGI server? And it is not about RAS as clusters have better RAS!!! So explain again why companies choose to pay many times more for a Unix server that has much lower socket count and lower performance. This is QUESTION_G)

    .

    I asked you "...So, now knowing this, answer me again on QUESTION_B) Why have not the market for high end Unix servers died immediately, if a cheap x86 scale-out cluster can replace any expensive Unix server?"

    To this you answered:
    “...Similarly there are institutions who have developed their own software on older Unix operating systems. Porting old code from HPUX, OpenVMS, etc. takes time..."

    It seems that you claim that vendor lockin causes companies to continue buy expensive Unix servers, instead of choosing cheap Linux servers. Well, I got news for you, mr FUDer: if you have Unix code, then you can very easy recompile it for Linux on x86. So there is no vendor lockin, you are not forced to continue buying expensive Unix servers because you can not migrate off to Linux.

    Sure if you have IBM Mainframes, or OpenVMS then you are locked in, and there are huge costs to migrate, essentially you have to rewrite large portions of the code as they are very different from POSIX Unix. But now we are talking about going from Unix to Linux. That step is very small and people routinely recompile code among Linux, FreeBSD, Solaris, etc. There is no need to pay $35 million for a single Unix servers because of vendor lockin.

    So you are wrong again. Companies choose to pay $millions for Unix servers, not because of RAS, and not because of vendor lockin. So why do they do it? Can you answer this question? It is still QUESTION_B). Is it because large Unix servers are all scale-up whereas large x86 servers are all scale-out?

    .

    "....However there are some workloads that are best scale up due to the need to maintain concurrency like OLTP databases..."

    Progress! You DO admit that there exists workloads that are scale-up!!! Do you also admit that scale-out servers can not handle scale-up workloads???? O_o

    .

    "...All you have shown is a horribly flawed analysis since the comparison was done between systems with different processor clock speeds. Attempting to determine the scaling of additional sockets between a system with 3.0 Ghz processors and another 3.7 Ghz processor isn’t going to work as the performance per socket is inherently different due to the clock speeds involved. This should be obvious....I reject your scaling comparison because to determine scaling by adding additional sockets, the only variable you want to change is just the number of sockets in a system...."

    I also showed you that the same 3.7GHz SPARC cpu on the same server, achieves higher saps with 32-sockets, and achieves lower saps with 40-sockets. This was an effort to explain to you that as you add more sockets, SAP performance drops off sharply. The 40-socket benchmark uses the 12c latest version of the database vs 11g, which is marginally faster, but still performance dropped:
    http://shallahamer-orapub.blogspot.se/2013/08/is-o...

    .

    "...Citation please where I stated that performance scales linearly. In fact, I can quote myself twice in this discussion where I’ve stated that performance *does not* scale linearly...."

    So, why do believe that 32-socket x86 would easily be faster than Unix servers? Your analysis is... non rigorous. If you assume linear scaling, so yes. But as I showed you with SPARC 3.7GHz cpus, performance drops sub linear. Performance is not linear, so how do you know how well SAP scales? All accounts shows x86 scales awful (there would be SAP benchmarks all over the place if routinely x86 scaled to a milion saps). You write:
    "...Yes, being able to do a third of the work with one fifth of the resources is indeed better scaling and that’s only with an 8 socket system. While true that scaling is not going to be linear as socket count increases, it would be easy to predict that a 32 socket x86 system would take the top spot...."

    .

    ME:“It seems you claim that UV300H is basically a stripped down UV2000. Well, if the UV2000 really can replace the UV300H, why dont SGI just sell a smaller configuration, say a 32-socket UV2000 with only one NUMAlink6?"
    YOU:...They do. it is the 16 socket version of the UV 2000."

    Oh, so the UV300 is the same as a smaller UV2000? Do you have links confirming this or did you just made it up?

    Here SGI says they are targeting different workloads, it sounds they are different servers. Otherwise SGI would have said that UV2000 can handle all workloads so UV2000 is future proof:
    https://www.sgi.com/products/servers/uv/
    "SGI UV 2000...systems are designed for compute-intensive, fast algorithm workloads such as CAE, CFD, and scientific simulations...SGI UV 300...servers are designed for data-intensive, I/O heavy workloads"

    I quote from your own link:
    http://www.theplatform.net/2015/05/01/sgi-awaits-u...
    "....This UV 300 does not have some of the issues that SGI’s larger-scale NUMA UV 2000 machines have, which make them difficult to program even if they do scale further....This is not the first time that SGI, in several of its incarnations, has tried to move from HPC into the enterprise space....Now you understand why SGI is so focused on SAP HANA for its UV 300 systems."

    .

    "...Or the answer could simply be that the UV 2000 is a scale up server...."
    In the SGI link above, SGI explicitly says that UV2000 is for HPC number crunching workloads, i.e. scale-out workloads. Nowhere says SGI that is it good for business workloads, such as SAP or databases. SGI does not say UV2000 is a scale up server.

    "...https://www.youtube.com/watch?v=lDAR7RoVHp0 <- shows how many processors and sockets that are on a single system running a single instance of Linux..."

    It is totally irrelevant if it runs a single instance of Linux. As I have explained earlier, ScaleMP also runs a single instance of Linux. But ScaleMP is a software hypervisor that tricks the Linux kernel into believing the cluster is a single scale-up server. Some developer said "It looks like single image but latency was awful to far away nodes" I can show you the link. What IS relevant, is what kind of workloads do people run on the server. That is the only relevant thing. Not how SGI marketing describes the server. Microsoft marketing claims Windows is an enterprise OS, but no one would run a stock exchange on Windows. The only relevant thing is, if Windows is an enterprise OS, how many runs large demanding systems such as stock exchanges on Windows? No one. How many runs enterprise business systems on SGI UV2000? No one.

    "...So far you have no provided anything that indicates that UV 2000 is a scale-out server as you claim..."

    Que? I have talked all the time about how SGI says that UV2000 is only used for HPC number crunching work loads, an never business workloads! HPC number crunching and data analytics => scale out. Business Enterprise systems => scale-up. Scientific simulations are run on clusters. Business workloads can not run on clusters, you need a scale-up server for that.

    OTOH, you have NEVER showed any links that proves UV2000 is suitable for scale-up workloads. Show us SAP benchmarks, I have asked numerous times in every post. You have not. No customer has ever run SAP on large SGI servers. No one. Never. Ever. For decades. If UV2000 is a scale up server, someone would have ran SAP. But no one. Why? Have you ever thought about this? Maybe it is not suitable for SAP?

    .

    QUESTION_D) Why have we never ever seen any SGI server on the sap benchmark list?
    "...SGI has only been offering large scale up x86 servers for a few years starting with the UV 1000 in 2010. The main reason is that such a configuration would not be supported by SAP in production...."

    But this is wrong again, SAP explicitly says 32-sockets is the limit. SAP says you can't run on UV2000. I quote myself: "Note that SAP does not say you can run Business Suite on a 256-socket UV2000. They explicitly say SGI 32-socket servers is the limit. Why? Do you really believe a UV2000 can replace a scale-up server??? Even SAP confirms it can not be done!!"

    Also, SGI has certified SAP for the UV300H server which only scales to 16-sockets. So, SGI has certified servers. But not certified UV1000 nor UV2000 servers. SGI "UV" real name is SGI "Altix UV", and Altix servers have been on the market for long time. The main difference seems to be they used older NUMAlink3(?), and today it is NUMAlink6.
    http://www.sgi.com/solutions/sap_hana/

    .

    ME:“Show us a x86 benchmark close to a million saps. If you can not, stop saying that x86 is able to tackle the largest sap workloads - because it can not. If it can, you can show us benchmarks.”
    YOU:"This is again a shifting of the goal posts as initially you wanted to see an x86 system that simply could compete. So initially I provided a link to the Fujitsu system in the top 10. I would consider an 8 socket x86 system in the top 10 out of 789 systems that ran that benchmark as competitive. If wanted to, there are x86 systems that scale up further for additional performance."

    No, this is wrong again. I wanted you to post x86 sap benchmarks with high scores that could compete at the very top. You posted small 8-socket servers and said they could compete with the very fastest servers. Well, to most people 300.000 saps can not compete with 850.000 saps. So you have not done what I asked you. So, I ask you again for the umpteen time; show us a x86 sap benchmark that can scale and compete with the fastest Unix servers. I have never "shifted goal posts" with this. My point is; x86 can not scale up to handle the largest workloads. And you claim this is wrong, that x86 can do that. In that case; show us sap benchmarks proving your claim. That has been the whole point all the time: asking you to prove your claim.

    So again, show us SAP benchmarks that can compete with the largest Unix servers. And no, 300.000 saps can not compete with 850.000 saps. Who are you trying to fool? So, prove your claim. Show us a single benchmark.

    .

    Me:“But if I am wrong, then you can prove it, right? Show us ONE SINGLE LINK that supports your claim. If you are right, then there will be lot of links on the vast internet showing that customers replace large Unix servers with x86 clusters such as SGI UV2000 and saving lot of money and gaining performance in the process.”
    YOU:"I’ll just repost this then: http://www.theplatform.net/2015/05/01/sgi-awaits-u...
    The big quote from the article: “CEO Jorge Titinger said that through the end of that quarter SGI had deployed UV 300H machines at thirteen customers"

    Que? Who are you trying to fool? I talked about UV2000, not UV300H!!! Show us ONE link where customers replaced high end Unix servers with a UV2000 scale-out server. UV300H stops at 32-sockets, so they are likely scale-up. And there have never been a question whether a scale-up server can run SAP. The question is whether UV2000 is suitable for scale-up workloads, and can replace Unix servers on e.g. SAP workloads. They can not. But you vehemently claim the opposite. So, show us ONE single link where customers run SAP on UV2000.

    I quote myself: "Note that SAP does not say you can run Business Suite on a 256-socket UV2000. They explicitly say SGI 32-socket servers is the limit. Why? Do you really believe a UV2000 can replace a scale-up server??? Even SAP confirms it can not be done!!"

    Why are you not giving satisfactory answers to my questions? I ask about UV2000, you show UV300H. I ask about one single link with UV2000 running large SAP, you never ever post such links. Why? If it is so easy to replace high end Unix servers with UV2000 - show us one single example. If you can not, who are you trying to fool? Is it even possible, or impossible? Has someone ever done it?? Nope. So... are you FUDing?

    I quote from your own link:
    http://www.theplatform.net/2015/05/01/sgi-awaits-u...
    "....This UV 300 does not have some of the issues that SGI’s larger-scale NUMA UV 2000 machines have, which make them difficult to program even if they do scale further....This is not the first time that SGI, in several of its incarnations, has tried to move from HPC into the enterprise space....Now you understand why SGI is so focused on SAP HANA for its UV 300 systems."

    It says, just like SAP says, that UV2000 is not suitable for SAP workloads. How can you keep on claiming that UV2000 is suitable for SAP workloads, when SGI and SAP says the opposite?

    .

    "...So using the Bixby interconnect to 32 socket is not a cluster but then scaling to 96 sockets with Bixby is a cluster?"

    Of course a 32-socket Unix server is a NUMA server, they all are! The difference is that if you have very few memory hops between far away nodes as Bixby has, it can run scale-up workloads fine. But if you have too many hops between far away nodes as scale-out clusters have, performance will suck and you can not run scale-up workloads. Oracle can scale Bixby much higher I guess, but performance would be bad and no one would use it to run scale-up workloads.

    Oracle is in war with IBM and both tries vigorously to snatch top benchmarks, and do whatever they can. If Oracle 96-socket Bixby could win all benchmarks, Oracle would have. But scaling drops off sharply, that is the reason Oracle stops at 32-sockets. In earlier road maps, Oracle talked about 64-socket SPARC servers, but dont sell any such today. I guess scaling would not be optimal as SPARC M7 servers with 32-sockets already has great scalability with 1024 cores and 8.192 threads and 64 TB RAM. And M7 servers are targeting databases and other business workloads. That is what Oracle do. So M7 servers are scale-up. Not scale-out.

    .

    "...People like using the best tool for the job. Enterprise workloads benefit heavily from a shared memory space and cache coherent architectures for concurrency. This is why scale-up servers are generally preferred for business workloads (though there are exceptions)...."

    Wtf? So you agree that there is a difference between scale-up and scale-out workloads? Do you also agree that a scale-out server can not replace a scale-up server?

    .

    "...Indeed you have. “The largest x86 servers are all 8 sockets, there are no larger servers for sale and have never been.” was said here http://anandtech.com/comments/9193/the-xeon-e78800... and now you’re claiming you’ve known about 16 socket x86 servers from Bull and earlier 32 socket x86 SGI servers. These statements made by you are clearly contradictory...."

    As I explained, no one puts larger servers than 8-sockets in production, because they scale too bad (where are all benchmarks???). I also talked about business enterprise workloads. We all know that SGI has manufactured larger servers than 8-sockets, for decades, targeting HPC number crunching. This discussion is about monolithic business software with code branching all over the place - that is the reason syncing and locking is hard on larger servers. If you have many cpus, syncing will be a big problem. This problem rules out 256-socket servers for business workloads.

    So again; no one use larger than 8-socket x86 servers for business enterprise systems, because x86 scale too bad. Where are all the customers doing that? Where are all benchmarks outclassing Unix high end servers?

    .

    "...At the launch HP’s x86 based SuperDome X included a customer name, Cerner. ( http://www8.hp.com/us/en/hp-news/press-release.htm... )
    A quick bit of searching turned put that they used HPUX and Itanium systems before.
    ( http://www.cerner.com/About_Cerner/Partnerships/HP... )..."

    And where does it say that Cerner used this 16-socket x86 server to replace a high end Unix server? Is it your own conclusion, or do you have links?

    .

    "...The key point, which you quote, is to maintain concurrency. Locking is just a means to achieve that goal but there are others. So instead of resorting to personal attacks as you have done, I’ll post some links to some non-locking concurrency techniques:..."

    Wow! Excellent that you for once post RELEVANT links backing up your claims. We are not used to that as you frequently post claims that are never backed up ("show us x86 sap scores competing with high end Unix servers").

    You show some links, but my point is that when you finally save the data, you need to use some synchronizing mechanism, some kind of lock. The links you show all delay the writing of data, but in the end, there is some kind of synchronizing mechanism.

    This is not used in heavy transaction environments so it is not interesting.
    http://en.wikipedia.org/wiki/Optimistic_concurrenc...
    "...OCC is generally used in environments with low data contention....However, if contention for data resources is frequent, the cost of repeatedly restarting transactions hurts performance significantly..."

    This has a lock deep down:
    http://en.wikipedia.org/wiki/Timestamp-based_concu...
    "....Even though this technique is a non-locking one,...the act of recording each timestamp against the Object requires an extremely short duration lock..."

    This must also synchronize writes, so there is a mechanism to hinder writes, ie. lock
    http://en.wikipedia.org/wiki/Multiversion_concurre...
    "...Which is to say a write cannot complete if there are outstanding transactions with an earlier timestamp....Every object would also have a read timestamp, and if a transaction Ti wanted to write to object P, and the timestamp of that transaction is earlier than the object's read timestamp (TS(Ti) < RTS(P)), the transaction Ti is aborted and restarted..."

    If you had been studying comp sci, you know that there is no way to guarantee data integrity without some kind of mechanism to stop multiple writes. You MUST synchronize multiple writes. There are ways to delay writes, but you must ultimately write all your edits - it must occur. Then you MUST stop things to write simultaneously. There is no way to get around the final writing down to disk. And that final writing can get messy unless you synchronize. So all methods have some kind of mechanism to avoid simultaneus writes. I dont even have to read the links, this is common comp sci knowledge. It can not be done.

    .

    ME:“So now you can read from SAP web site in your own link, that SAP Hana cluster which is in memory database is only used for data analytics read usage. And for the business suite, you need a scale-up server. So you are wrong again.”
    YOU:"How so? The UV 300H works fine by your admission for the HANA business suite as it is a scale up server. We are in agreement on this!"

    Let me explain again. SAP Hana in memory version can run on scale-out systems as it only does reads on data analytics, whereas business suite can only run on scale-up systems as it need to write data. This is what SAP says.

    .

    ME:“Show us one single customer that has replaced a single Unix high end server with a SGI scale-out server, such as UV2000 or Altix or ScaleMP or whatever. Just one single link.”
    YOU: "...The Institute of Statistical Mathematics in Japan replaced their Fujitsu M9000 ..."

    Jesus. This whole discussion has been about whether scale-out systems can replace scale-up systems. You know that. Of course you can replace a SPARC M9000 (once held world record on floating point calculation) used for HPC workloads, with a scale-out server. That has never been the question whether a scale-out server can replace scale-up doing HPC computations.

    The question is, show us a single link where some customer replaced Unix high end server running scale-up business enterprise software (SAP, databases,etc) with a SGI UV2000 server. You can not.

    I dont know how many times I have asked this? And you have never posted any such links. Of course you have googled like mad, but never found any such links. Have you ever thought about why there are no such links out there on the whole internet? Maybe SGI UV2000 is not suitable for scale-up workloads? You even post links saying exactly this:
    http://www.theplatform.net/2015/05/01/sgi-awaits-u...
    "....This UV 300 does not have some of the issues that SGI’s larger-scale NUMA UV 2000 machines have, which make them difficult to program even if they do scale further....This is not the first time that SGI, in several of its incarnations, has tried to move from HPC into the enterprise space....Now you understand why SGI is so focused on SAP HANA for its UV 300 systems."

    Why does not SGI move into the lucrative enterprise space with UV2000???? Why must SGI manufacture 16-socket UV300H instead of 256-socket UV2000? Why?
  • Brutalizer - Monday, May 25, 2015 - link

    (PS. I just emailed SGI sales rep for more information/links on how to replace high end 32-socket Unix servers with the 256-socket SGI UV2000 server on enterprise business workloads such as SAP and oracle databases. I hope they will get in touch soon and I will repost all information/links here. Maybe you could do the same if you dont trust my replies: ask for links on how to replace Unix servers with UV2000 - I suspect SGI will say it can not be done. :)
  • Kevin G - Monday, May 25, 2015 - link

    @FBrutalizer
    “And how do you know that x86 requires fewer sockets than SPARC?”
    Even with diminishing returns, there are still returns. In other benchmarks regarding 16 socket x86 systems, performance isn’t double that of an 8 socket system but it still a very respectable gain. Going to 32 sockets with the recent Xeon E7v3 chips should be able to capture the top spot.

    “If you spread pure disinformation (all your false statements) you are FUDing. So, I expect you to either confess that you FUD and stop spreading disinformation, or backup all your claims.”
    How about a change of pace and you start backing up your claims?

    “Look. I asked you why companies pay $35 millions for a high end Unix IBM P595 with 32-sockets, when they can get a cheap 256-socket SGI server for much less money. To that you answered "because IBM P595 has better RAS”
    I’ve stated that you chose the best tool for the job. Some workloads you can scale out where a cluster of systems would be most appropriate. Some workloads it is best scale up. If it can only scale up, you want to get a reliable system, especially for critical systems that cannot have any down time.
    You should also stop bringing up the P595 for this comparison: it hasn’t been sold by IBM for years now.

    “It seems that you claim that vendor lockin causes companies to continue buy expensive Unix servers, instead of choosing cheap Linux servers. Well, I got news for you, mr FUDer: if you have Unix code, then you can very easy recompile it for Linux on x86. So there is no vendor lockin, you are not forced to continue buying expensive Unix servers because you can not migrate off to Linux.”
    Unix systems do offer proprietary libraries and features that Linux does not offer. If a developer’s code sticks to POSIX and ANSI C, then it is very portable but the more you dig into the unique features of a particular flavor of Unix, the harder it is to switch to another. Certainly there is overlap between Unix and Linux but there is plenty unique to each OS.

    “I also showed you that the same 3.7GHz SPARC cpu on the same server, achieves higher saps with 32-sockets, and achieves lower saps with 40-sockets. This was an effort to explain to you that as you add more sockets, SAP performance drops off sharply. The 40-socket benchmark uses the 12c latest version of the database vs 11g, which is marginally faster, but still performance dropped”
    No, you didn’t even notice the clock speed differences and thought that they were the same as per your flawed analysis on the matter. It is a pretty big factor for performance and you seemingly have overlooked it. Thus your entire analysis on how those SPARC systems scaled with additional sockets is inherently flawed.

    >>“…Citation please where I stated that performance scales linearly.”
    “So, why do believe that 32-socket x86 would easily be faster than Unix servers?”
    That is not a citation. Another failed claim on your part.

    “Oh, so the UV300 is the same as a smaller UV2000? Do you have links confirming this or did you just made it up?”
    No, in fact I pointed out that the UV 300 uses NUMALink7 interconnect where as the UV 2000 uses NUMALink6 ( http://anandtech.com/comments/9193/the-xeon-e78800... ). I also provided some links regarding this which you apparently did not read but I’ll repost anyway:
    http://www.theplatform.net/2015/05/01/sgi-awaits-u...
    http://www.enterprisetech.com/2014/03/12/sgi-revea...

    "SGI does not say UV2000 is a scale up server.”
    You apparently did not watch this where they explain it is a giant SMP machine that goes up to 256 sockets: https://www.youtube.com/watch?v=KI1hU5g0KRo

    “It is totally irrelevant if it runs a single instance of Linux. As I have explained earlier, ScaleMP also runs a single instance of Linux.”
    Except that ScaleMP is not used by the UV2000 so discussing it here is irrelevant as I’ve pointed out before.

    “Que? I have talked all the time about how SGI says that UV2000 is only used for HPC number crunching work loads, an never business workloads!”
    That’s nice. HPC workloads can still be performed on large scale-up system just fine. So what specifically in the design of the UV 2000 makes it scale out as you claim? Just because you say so doesn’t make it true. Give me something about its actual design that counters SGI own documentation about it being one large SMP device. Actually back up this claim.
    Also you haven’t posted a link quoting SGI that the UV 2000 is only good for HPC. In fact, the link you continually post about this is a decade old and in the context of older cluster offers SGI promoted, long before SGI introduced the UV 2000.

    “But this is wrong again, SAP explicitly says 32-sockets is the limit.”
    SAP doesn’t actually say that 32 sockets is a hard limit. Rather it is the most number of sockets for a system that will be validated (and it is SAP here that is expecting the 32 socket UV 300 to be validated). Please in the link below quote where SAP say that HANA strictly cannot go past 32 sockets:
    https://blogs.saphana.com/2014/12/10/sap-hana-scal...

    “The main difference seems to be they used older NUMAlink3(?), and today it is NUMAlink6.”
    Again, the UV 2000 is using the NUMALink6 as many of my links have pointed out. The UV 300 is using NUMALink7 which use the same topology but increase bandwidth slightly while cutting latency.

    >>”This is again a shifting of the goal posts as initially you wanted to see an x86 system that simply could compete. So initially I provided a link to the Fujitsu system in the top 10. I would consider an 8 socket x86 system in the top 10 out of 789 systems that ran that benchmark as competitive. If wanted to, there are x86 systems that scale up further for additional performance."
    “No, this is wrong again. I wanted you to post x86 sap benchmarks with high scores that could compete at the very top.”
    And I did, a score in the top 10. I did not misquote you and I provided exactly what you initially asked for. Now you continue to attempt to shift the goal posts.

    “Why are you not giving satisfactory answers to my questions?”
    Because when I answer them you shift the goal posts and resort to some personal attacks.

    “It says, just like SAP says, that UV2000 is not suitable for SAP workloads. How can you keep on claiming that UV2000 is suitable for SAP workloads, when SGI and SAP says the opposite?”
    That quote indicates that it does scale further. The difficulty being mentioned is a point that been brought up before: the additional latencies due to more interconnections as the system scales up.

    >>"...So using the Bixby interconnect to 32 socket is not a cluster but then scaling to 96 sockets with Bixby is a cluster?"
    “Of course a 32-socket Unix server is a NUMA server, they all are! The difference is that if you have very few memory hops between far away nodes as Bixby has, it can run scale-up workloads fine. But if you have too many hops between far away nodes as scale-out clusters have, performance will suck and you can not run scale-up workloads. Oracle can scale Bixby much higher I guess, but performance would be bad and no one would use it to run scale-up workloads.”
    Except you are not answering the question I asked. Simply put, what actually would make the 96 socket version a cluster where as the 32 socket version is not? The answer is that they are both scale up and not clusters. The idea that performance would suck wouldn’t inherently change it from a scale-up system to a cluster as you’re alluding to. It would still be a scale up system, just like how the UV 2000 would be a scale up system.

    >>”…Indeed you have. “The largest x86 servers are all 8 sockets, there are no larger servers for sale and have never been.” was said here http://anandtech.com/comments/9193/the-xeon-e78800... and now you’re claiming you’ve known about 16 socket x86 servers from Bull and earlier 32 socket x86 SGI servers. These statements made by you are clearly contradictory...."
    “As I explained, no one puts larger servers than 8-sockets in production, because they scale too bad (where are all benchmarks???). I also talked about business enterprise workloads.”
    And this does not explain the contradiction in your statements. Rather this is more goal post shifting.

    "And where does it say that Cerner used this 16-socket x86 server to replace a high end Unix server? Is it your own conclusion, or do you have links?”
    Yes as I’ve stated, it wouldn’t make sense to use anything less than the 16 socket version. They could have purchased an 8 socket server from various other vendors like IBM/Lenovo or HP’s own DL980. Regardless of socket count, it is replacing a Unix system as HPUX is being phased out. If you doubt it, I’d just email the gentlemen that HP quoted and ask.

    “You show some links, but my point is that when you finally save the data, you need to use some synchronizing mechanism, some kind of lock. The links you show all delay the writing of data, but in the end, there is some kind of synchronizing mechanism.”
    Congratulations on actually reading my links! This must be the first time. However, it matters not as the actual point has been lost entirely again. There indeed has to be a mechanism in place for concurrency but I’ll repeat that my point is that it does not have to be a lock as there are alternatives. Even after reading my links on the subject, you simply just don’t get it that there are alternatives to locking.

    “If you had been studying comp sci, you know that there is no way to guarantee data integrity without some kind of mechanism to stop multiple writes. You MUST synchronize multiple writes. There are ways to delay writes, but you must ultimately write all your edits - it must occur. Then you MUST stop things to write simultaneously. There is no way to get around the final writing down to disk. And that final writing can get messy unless you synchronize. So all methods have some kind of mechanism to avoid simultaneus writes. I dont even have to read the links, this is common comp sci knowledge. It can not be done.”
    Apparently it does work and it is used in production enterprise databases as I’ve given examples of it with appropriate links. If you claim other wise, how about a formal proof as to why it cannot as you claim? How about backing one of your claims up for once?

    "Jesus. This whole discussion has been about whether scale-out systems can replace scale-up systems. You know that. Of course you can replace a SPARC M9000 (once held world record on floating point calculation) used for HPC workloads, with a scale-out server. That has never been the question whether a scale-out server can replace scale-up doing HPC computations.”
    Actually I’ve been trying to get you to realize that the UV 2000 is a scale up SMP system. Just like you’ve been missing my main point, you have overlook where the institute was using it was a scale up system due the need for a large shared memory space for their workloads. It was replaced by a UV 2000 due to its large memory capacity as a scale up server. Again, this answer fit your requirements of “Show us one single customer that has replaced a single Unix high end server with a SGI scale-out server, such as UV2000 or Altix or ScaleMP or whatever.” and again you are shifting the goal posts.
    And as a bonus, I posted a second set of links which you’ve ignored so I’ll repost them. Hamilton Sundstrand did their migration in two steps. The first was to migrate from Unix to Linux ( http://www.prnewswire.com/news-releases/hamilton-s... and then later migrated to a UV 1000 system ( http://pages.mscsoftware.com/rs/mscsoftware/images... )
  • Brutalizer - Tuesday, May 26, 2015 - link

    @FUDer KevinG

    ME: And how do you know that x86 requires fewer sockets than SPARC?
    YOU: "...Even with diminishing returns, there are still returns. In other benchmarks regarding 16 socket x86 systems, performance isn’t double that of an 8 socket system but it still a very respectable gain. Going to 32 sockets with the recent Xeon E7v3 chips should be able to capture the top spot...."

    I must again remind myself that "there are no stupid people, only uninformed people". Your ignorance make it hard to have a discussion with you, because there are so much stuff you have no clue of, you lack basic math knowledge, your logic is totally wrong, your comp sci knowledge is abysmal, and still you make up lot of stuff without backing things up. How do you explain to a fourth grader that his understanding of complexity theory is wrong, when he lack basic knowledge? You explain again and again, but he does not get it. How could he get it???

    Look, let me teach you. Benchmarks between a small number of sockets are not conclusive when you go to a higher number of sockets. It is easy to get good scaling from 1 to 2 sockets, but to go from 16 to 32 is another thing. Everything is different, locking is much worse because race conditions are more frequent, etc etc. Heck, even you write that scaling is difficult as you go to a high number of sockets, and still you claim that x86 would scale much better than SPARC, you claim x86 scales close to linear? On what grounds?

    Your grounds are that you have seen OTHER benchmarks. And what other benchmarks do you refer to? Did you look at scale-out benchmarks? Has it occured to you that scale-out benchmarks can not be compared to scale-up benchmarks? So, what other benchmarks do you refer to, show us the links where 16-socket x86 systems get good scaling from 8-sockets. I would not be surprised if it were Java SPECjbb2005, LINPACK or SPECint2006 or some other clustered benchmark you refer to. That would be: not clever if you looked at clustered benchmarks and drew conclusions about scale-up benchmarks. I hope even you understand what is wrong with your argument? Scale-out clustered benchmarks always scale well because workload is distributed, so you can get good scaling. But scale-up SAP scaling is another thing.

    Show us the business enterprise scale-up benchmark, where they go from 8-socket x86 server up to 16-socket and get good scaling. This is going to be fun; I am going to slaughter all your scale-out 16-socket x86 benchmarks (you will only find scale-out clustered benchmarks which makes you laughable as you compare with scale-up :). Your "analysis" is a bit... non rigorous. :-)

    .

    "...How about a change of pace and you start backing up your claims?..."

    What claims should I backup? This whole thread started by me, claiming that it is impossible to get high business enterprise SAP scores on x86, because x86 scale-up servers stop at 8-sockets and scale-out servers such as SGI UV2000 can not handle scale-up workloads. You claim this is wrong, you claim that SGI UV2000 can handle business enterprise workloads such as SAP. To this I have asked you to post only one single SAP link where a x86 server can compete with the top Unix servers. You have never posted such a link. At the top are only Unix servers, at the very bottom are x86, far away. The performance difference is huge.

    Instead you claim that x86 320.000 saps can do compete with SPARC 850.000 saps - well the x86 gets 37% of the SPARC score. Why do you claim it? Because the x86 score is in the top 10 list!!! That is so wrong logically. There are no other good performing servers than SPARC, it is alone far ahead at the top. The competition is left far behind, because they scale so bad. To this you say; well, x86 is among the rest in the bottom, so therefore it is competetive with SPARC.

    There are no other good business SAP servers today than SPARC and POWER8. Itanium is dead. Your only choice to get extremly high SAP scores is go to SPARC. x86 can not do that. But you claim that UV2000 can do that. Well, in that case you should show us links where UV2000 does that, back up your claims and stop FUDing. Otherwise you make up things, and call them facts, when they are made up. Made up false claims that SGI UV2000 can achieve extreme SAP scores is called "negative or dubious information designed to spread confusion" - in other words: FUD.

    You are doing the very definition of FUD. You are claiming that UV2000 can do things, it can not. And you can not prove it. In other words, everything is made up. And that, is FUD. So, show us links, or admit you are FUDing.

    How about this scenario: "Did you know that SPARC M6 with 32-sockets can outclass SGI UV2000 with 256-sockets on HPC computations? Yes it can, SPARC is several times faster than x86! I am not going to show you benchmarks on this, you have to trust me when I say that SPARC M6 is much faster than UV2000." - is this FUD or what???

    Or this scenario: "Did you know that SGI UV2000 is quite unstable and crashes all the time? And no, I am not going to post links proving this, you have to trust me when I say this" - is this FUD or what???

    How about this familiar KevinG scenario: "Did you know that SGI UV2000 is faster at SAP than high end Unix servers? No, I am not going to show you benchmarks, you have to trust me on this" - is this FUD or what???

    Hey FUDer, can you back up your claims? Show us a SAP ranked benchmark with a x86 server. I dont know how many times I need to ask you to do this? Is this... the tenth time? Or 15th? Google a bit more, and hope you will find someone that uses SGI or ScaleMP for SAP benchmarks - but you will not find any because it is impossible. :)

    .

    "....Unix systems do offer proprietary libraries and features that Linux does not offer. If a developer’s code sticks to POSIX and ANSI C, then it is very portable but the more you dig into the unique features of a particular flavor of Unix, the harder it is to switch to another. Certainly there is overlap between Unix and Linux but there is plenty unique to each OS...."

    Sigh. So much... ignorance. Large enterprise systems such as SAP, Oracle database, etc - are written to be portable between different architectures and OSes; Linux, Solaris, IBM AIX, HP-UX, etc. So you are wrong again: the reason companies continue to shell out $millions for Unix servers is not because of vendor lockin. And it is not because of RAS. Your explanations are all flawed.

    So, answer me again: why has not the high end Unix market died in an instant if x86 scale-out servers such as SGI UV2000 can replace Unix servers at SAP, Databases, etc? Why are you ducking this question?

    .

    ME: “...I also showed you that the same 3.7GHz SPARC cpu on the same server, achieves higher saps with 32-sockets, and achieves lower saps with 40-sockets....”
    YOU: "...No, you didn’t even notice the clock speed differences and thought that they were the same as per your flawed analysis on the matter. It is a pretty big factor for performance and you seemingly have overlooked it. Thus your entire analysis on how those SPARC systems scaled with additional sockets is inherently flawed...."

    Wrong again. I quote myself: "I also showed you that the same 3.7GHz SPARC cpu on the same server, achieves higher saps with 32-sockets, and achieves lower saps with 40-sockets."

    And here is a quote from an earlier post where I do exactly this: compare the same 3.7GHz cpu: "....Ok, I understand what you are doing. For instance, the 16-socket M10-4S has 28.000 saps per socket, and the 32-socket M10-4S has 26.000 saps per socket - with exactly the same 3.7GHz cpu. Sure, the 32 socket used a newer and faster version of the database - and still gained less for every socket as it had more sockets...."

    And also, earlier, I noticed the clock speed differences, but it was exactly the same cpu model. I thought you would accept an extrapolation. Which you did not. So I compare same cpu model and same 3.7GHz clock speed, and I show that 40-socket gains less saps than 32-socket do - and this is an "inherently flawed" comparison? Why?

    Do you think I should accept your "analysis" where you look at scale-out benchmarks to conclude x86 scale-up scalability? Do you see the big glaring error in your "analysis"?

    .

    ME:“So, why do believe that 32-socket x86 would easily be faster than Unix servers?”
    YOU: "That is not a citation. Another failed claim on your part."

    Explain again what makes you believe a 32-socket x86 server would scale better than Unix servers. Is it because you looked at scale-out clustered x86 benchmarks and concluded about x86 scalability for SAP? And therefore you believe x86 would scale much better?

    .

    ME:“Oh, so the UV300 is the same as a smaller UV2000? Do you have links confirming this or did you just made it up?”
    YOU: "No, in fact I pointed out that the UV 300 uses NUMALink7 interconnect where as the UV 2000 uses NUMALink6..."

    Can you quote the links and the where SGI say that UV300 is just a 16-socket UV2000 server?

    .

    ME:"SGI does not say UV2000 is a scale up server.”
    YOU: "You apparently did not watch this where they explain it is a giant SMP machine that goes up to 256 sockets:"

    But still UV2000 is exclusively viewed as a scale-out HPC server by SGI (see my quote from your own link below where SGI talks about getting into the enterprise market with HPC servers), and never used for scale-up workloads. So what does it matter what SGI marketing label the server? Can you show us one single customer that use UV2000 as a scale-up server? Nope. Why?

    .

    ME: “It is totally irrelevant if it runs a single instance of Linux. As I have explained earlier, ScaleMP also runs a single instance of Linux.”
    YOU:"...Except that ScaleMP is not used by the UV2000 so discussing it here is irrelevant.."

    It is relevant. You claim that because UV2000 runs a single image kernel with shared memory UV2000 is a scale-up server. That criteria is wrong as I explained. I can explain again: ScaleMP is a true clustered scale-out server which they themselves explain, and ScaleMP runs a single image kernel and shared memory. Hence, your argument is wrong when you claim that UV2000 is a scale-up server because it runs a single image kernel. ScaleMP which is a true scale-out server, does the same. So your argument is invalid, by this explanation which is very relevant.

    .

    YOU:"...So what specifically in the design of the UV 2000 makes it scale out as you claim? Just because you say so doesn’t make it true. Give me something about its actual design that counters SGI own documentation about it being one large SMP device. Actually back up this claim...."

    All customers are using SGI UV2000 for scale-out HPC computations. No one has ever used it for scale-up workloads. Not a single customer. There are no examples, no records, no links, no scale up benchmarks, no nothing. No one use UV2000 for scale up workloads such as big databases - why??
    http://www.zdnet.com/article/scale-up-and-scale-ou...
    "...Databases are a good example of the kind of application that people run on big SMP boxes because of cache coherency and other issues..."

    Sure, let SGI marketing call UV2000 a SMP server, they can say it is a carrot if they like - but still no one is going to eat it, nor use it to run scale-up business workloads. There are no records, nowhere. Show us one single link where customers use UV2000 for enterprise workloads. OTOH, can you show us links where customers use UV2000 for scale-out HPC computations? Yes you can - the internet is swarming with such links! Where are the scale-up links? Nowhere. Why?

    See links immediately below here:

    YOU:"...Also you haven’t posted a link quoting SGI that the UV 2000 is only good for HPC. In fact, the link you continually post about this is a decade old and in the context of older cluster offers SGI promoted, long before SGI introduced the UV 2000...."

    Que? I have posted several links on this! For instance, here is another, where I quote from your own link:
    http://www.enterprisetech.com/2014/03/12/sgi-revea...
    "...What I am trying to understand is how you, SGI, is going to be deploying technologies that it has developed for supercomputing in the business environment. I know about the deal you have done with SAP on a future HANA system, but this question goes beyond in-memory databases. I just want to get my brain wrapped around the shape of the high-end enterprise market you are chasing..."

    "...Obviously, Linux can span an entire UV 2000 system because it does so for HPC workloads, but I am not sure how far a commercial database made for Linux can span....

    "...So in a [SGI UV2000] system that we, SGI, build, we can build it for high-performance computing or data-intensive computing. They are basically the same structure at a baseline..."

    "...IBM and Oracle have become solution stack players and SAP doesn’t have a high-end vendor to compete with those two. That’s where we, SGI, see ourselves getting traction with HPC servers_ into this enterprise space...."

    "...The goal with NUMAlink 7 is...reducing latency for remote memory. Even with coherent shared memory, it is still NUMA and it still takes a bit more time to access the remote memory..."

    In all other links I have posted, SGI says the same thing as here "SGI is not getting into the enterprise market segment yet", etc. In this link SGI says UV2000 systems are for computing, not for business enterprise. SGI says they are getting traction with HPC servers, into enterprise. SGI talks about difficulties to get into the enterprise space with HPC servers. They do not mention any scale-up servers ready to get into enterprise.

    So here you have it again. SGI exclusively talks about getting HPC computation servers into enterprise. Read your link again and you will see that SGI only talks about HPC servers.

    .

    "...SAP doesn’t actually say that 32 sockets is a hard limit. Rather it is the most number of sockets for a system that will be validated (and it is SAP here that is expecting the 32 socket UV 300 to be validated). Please in the link below quote where SAP say that HANA strictly cannot go past 32 sockets..."

    In the link they only talk about 32-sockets, they explicitly mention 32-socket SGI UV3000H and dont mention UV2000 with 256 sockets. They say that bigger scale-up servers than 32-sockets will come later. I quote from the link:
    "....The answer for the SAP Business Suite is simple right now: you have to scale-up. This advice might change in future, but even an 8-socket 6TB system will fit 95% of SAP customers, and the biggest Business Suite installations in the world can fit in a SGI 32-socket with 24TB..."

    "...HP ConvergedSystem 900... is available with up to 16 sockets and 4TB for analytics, or 12TB for Business Suite. The HP CS900 uses their Superdome 2 architecture

    SGI have their SGI UV300H appliance... 32 sockets and 8TB for analytics, or 24TB for Business Suite.

    Bear in mind that bigger scale-up systems will come, as newer generations of Intel CPUs come around. The refresh cycle is roughly every 3-4 years, with the last refresh happening in 2013...."

    .

    ME:“No, this is wrong again. I wanted you to post x86 sap benchmarks with high scores that could compete at the very top.”
    YOU: "And I did, a score in the top 10. I did not misquote you and I provided exactly what you initially asked for. Now you continue to attempt to shift the goal posts."

    No, not a score in top 10. Why do you believe I mean top 10? I meant at the very top. I want you to show links where x86 beat the Unix servers in SAP benchmarks. Go ahead. This whole thread started by me posting that x86 can never challenge high end Unix servers in SAP, that you need to go to Unix if you want the best performance. Scale-up x86 wont do, and scale-out x86 wont do. I quote myself:
    "...So, if we talk about serious business workloads, x86 will not do, because they stop at 8 sockets. Just check the SAP benchmark top - they are all more than 16 sockets, Ie Unix servers. X86 are for low end and can never compete with Unix such as SPARC, POWER etc. scalability is the big problem and x86 has never got passed 8 sockets. Check SAP benchmark list yourselves, all x86 are 8 sockets, there are no larger...."

    So, now I ask you again: show us a x86 sap benchmark that can compete at the very top. Not at the very bottom with 37% of the top performance - that is laughable.
    QUESTION_H) Can x86 reach close to a million saps at all? Is it even possible with any x86 server? Answer this question with links to benchmarks. And no "yes they can, trust me on this, I am not going to prove this" - doesnt count as it is pure FUD. So answer this question.

    .

    ME:"It says, just like SAP says, that UV2000 is not suitable for SAP workloads. How can you keep on claiming that UV2000 is suitable for SAP workloads, when SGI and SAP says the opposite?”
    YOU:"That quote indicates that it does scale further. The difficulty being mentioned is a point that been brought up before: the additional latencies due to more interconnections as the system scales up."

    The question is not if you can scale further, the question is if it scale good enough for actual use. And your link says that UV2000 has scaling issues for enterprise usage, and it is a HPC server. Here you have it again, how many links on this do you want?

    ...This UV 300 does not have some of the issues that SGI’s larger-scale NUMA UV 2000 machines have, which make them difficult to program even if they do scale further....This is not the first time that SGI, in several of its incarnations, has tried to move from HPC into the enterprise space....

    .

    "...Except you are not answering the question I asked. Simply put, what actually would make the 96 Bixby socket version a cluster where as the 32 socket version is not?..."

    I answered and said that both of them are NUMA servers. I think I need to explain to you more, as you dont know so much about these things. Larger servers are all NUMA (i.e. tightly coupled cluster), meaning they have bad latency to far away nodes. Latency differs from close nodes and far away nodes - i.e NUMA. True SMP servers have the same latency no matter which cpu you reach. If you can keep the NUMA server small and with good engineering, you can still get a decent latency making it suitable for scale-up enterprise business workloads where the code branches heavily. As SGI explained in an earlier link, enterprise workloads branch heavily in the source code making that type of code less suitable for scale-out servers - this is common knowledge. I know you have read this link earlier.

    Obviously 32-socket SPARC Bixby has low enough latency to be good for enterprise business usage, as Oracle dabbles in that market segment. But as we have not seen 96-socket bixby servers yet, I suspect that latency to far away cpus differ too much, making performance less than optimal. Otherwise Oracle would have sold 96-socket servers, if performance would be good enough. But they dont.

    .

    ME: “As I explained, no one puts larger servers than 8-sockets in production, because they scale too bad (where are all benchmarks???). I also talked about business enterprise workloads.”
    YOU: "And this does not explain the contradiction in your statements. Rather this is more goal post shifting."

    It is not goal shifting. All the time we have talked about large scale-up servers for enterprise usage, and if I forget to explicitly use all those words in every sentence, you call it "goal shifting". It is you that pretend to not understand. If I say "x86 can not cope with large server workloads" you will immediately talk about HPC computations, ignoring the fact that this whole thread is about enterprise workloads.

    You on the other hand, is deliberately spreading a lot of negative or false disinformation - i.e. FUD. You know that no one has ever used UV2000 for sap usage, there are no records on the whole internet - but still you write it. That is actually false information.

    .

    ME: "And where does it say that Cerner used this 16-socket x86 server to replace a high end Unix server? Is it your own conclusion, or do you have links?”
    YOU: "Yes as I’ve stated, it wouldn’t make sense to use anything less than the 16 socket version. They could have purchased an 8 socket server from various other vendors like IBM/Lenovo or HP’s own DL980. Regardless of socket count, it is replacing a Unix system as HPUX is being phased out. If you doubt it, I’d just email the gentlemen that HP quoted and ask."

    So, basically you are claiming that your own "conclusions" are facts? And if I doubt your conclusions, I should find it out myself by trying to get NDA closed information from some guy at large large HP? Are you serious?

    Wow. Now we see this again "trust me on this, I will not prove it to you. If you want to find out, you can find it out yourself. I dont know how you should find this guy, but it is your task to prove my made up claim". Wow, you are really good at FUDing. So, I guess this "fact" is also something you are not going to back up? Just store it among the rest of the "facts" that never gets proven? Lot of FUD here...

    .

    "....Congratulations on actually reading my links! This must be the first time. However, it matters not as the actual point has been lost entirely again. There indeed has to be a mechanism in place for concurrency but I’ll repeat that my point is that it does not have to be a lock as there are alternatives. Even after reading my links on the subject, you simply just don’t get it that there are alternatives to locking...."

    Of course I read it because I know comp sci very well, and I _know_ your claim is impossible, it was just a matter of finding the text to quote. I knew the text would be there, so I just had to find it.

    Can you explain again why there are alternatives to locking? You linked to three "non-locking" methods - but I showed that they all do have some kind of lock deep down, they must have some mechanism to synch other threads so they dont simultaneously overwrite data. If you want to guarantee data integrity you MUST have some kind of way of stopping others to write data. So if you claim it is possible to not do this, it is revoultionzing and I think you should inform the whole comp sci community. Which you obviously are not a part of.

    .

    "...Apparently it does work and it is used in production enterprise databases as I’ve given examples of it with appropriate links. If you claim other wise, how about a formal proof as to why it cannot as you claim? How about backing one of your claims up for once?..."

    I backed up my claims by quoting text from your own links, that they all have some kind of locking mechanism deep down. Ive also explained why it can not be done. If several threads write the same data, you need some way of synching the writing, to stop others to overwrite. It can not be done in another way. This is common sense, and in parallel computations you talk about race conditions, mutex, etc etc. It would be a great break through if you could write same data simultaneously and at the same time guarantee data integrity. But that can not be done. This is common sense.

    .

    "....Actually I’ve been trying to get you to realize that the UV 2000 is a scale up SMP system...."

    It is very easy to make me realize UV2000 is a scale up system - just prove it. I am a mathematician and if I see a proof, I immediately change my mind. Why would I not? If I believe something false, I must change my mind. So, show us some links proving that UV2000 are used for enterprise business workloads such as SAP or databases, etc. If you can show such links, I will immediately say I was wrong and that UV2000 is a good all round server suitable for enterprise usage as well. But, the thing is - there are no such links! No one does it! What does it say you?

    Unless you can prove it, you can not change my mind. It would be impossible.
    Look: "SPARC M6 32-sockets are much faster than UV2000 with 256-sockets on HPC computations. I am not going to prove it, but Oracle marketing says SPARC is the fastest cpu in the world, so it must be true. I just want to make you realize that SPARC M6 is much faster than UV2000. But I will not prove it by links nor benchmarks".

    Would you change your mind on this claim? Would you believe SPARC M6 is much faster than UV2000? Just because I say so? Nope you would not. But, if I could show you benchmarks where SPARC M6 was in fact, much faster than UV2000? Would you change your mind then? Yes you would!

    In effect you are FUDing. And unless you post links and prove your claims, I am not going to change my mind. I hope you realize that. The only way to convince a mathematician is to prove it. Show us links and benchmarks. Credible I must add. One blog where some random guy writes something does not count. Show us official and validated benchmarks.

    .

    "...Just like you’ve been missing my main point, you have overlook where the institute was using it was a scale up system due the need for a large shared memory space for their workloads. It was replaced by a UV 2000 due to its large memory capacity as a scale up server. Again, this answer fit your requirements of “Show us one single customer that has replaced a single Unix high end server with a SGI scale-out server, such as UV2000 or Altix or ScaleMP or whatever.” and again you are shifting the goal posts...."

    Jesus. It is YOU that constantly shifts the goal posts. You KNOW that this whole discussion is about x86 and scale up business enterprise workloads. Nothing else. And if I dont specify "business enterprise workloads" in every sentence, you immediately jumps on that and shift to talking about HPC calculations or whatever I did not specify. You KNOW we talk only about scale-up workloads. Math institutes doing computations is NOT business enterprise, it is all about HPC. You know that. And because I thought you were clever enough to know we both talked about enterprise business workloads, I did not specify that in every sentence - and immediately you shifted goal posts at once, taking the chance to talk about math institutes doing HPC calculations. And at the same time accuse ME for shifting goal posts??? Wtf??? Impertinent indeed.

    So, obviously you shift goal posts and you FUD. A lot. What will you try next? How about the truth? Show us links where one single customer that replaced Unix high end servers with a large scale-out server on BUSINESS ENTERPRISE WORKLOADS SUCH AS SAP OR DATABASES??? (I did not forget to specify this time)

    Test1
    [bold]Test2[/bold]
  • Kevin G - Wednesday, May 27, 2015 - link

    @Brutalizer
    “Look, let me teach you. Benchmarks between a small number of sockets are not conclusive when you go to a higher number of sockets. It is easy to get good scaling from 1 to 2 sockets, but to go from 16 to 32 is another thing. “
    Please quote me where I explicitly claim otherwise. I have stated that scaling is non-linear as socket count increases. We’re actually in agreement on this point but it continues to be something you insist otherwise. Also if you feel the need to actually demonstrate this idea again, be more careful as your last attempt had some serious issues.

    “Your grounds are that you have seen OTHER benchmarks. And what other benchmarks do you refer to? Did you look at scale-out benchmarks? Has it occured to you that scale-out benchmarks can not be compared to scale-up benchmarks? So, what other benchmarks do you refer to, show us the links where 16-socket x86 systems get good scaling from 8-sockets. I would not be surprised if it were Java SPECjbb2005, LINPACK or SPECint2006 or some other clustered benchmark you refer to. ”
    Those are perfectly valid benchmarks as well to determine scaling. Remember, a scale up system can still run scale out software as a single node just fine. A basic principle still has to be maintained to isolate scaling: similar system specifications with just varying socket count to scale up. For example, SPECint2006 can be run on an SPARC M10 with 4 socket as well as 8 socket to 16 sockets etc. It’d just be a generic test of integer performance as additional sockets are added which can be used to determine how well the system scales with that workload. Also due to the overhead of adding another socket, performance scaling will be less than linear.
    While you can say that SPECint2006 is not a business workload, which is correct, you cannot deny its utility to determine system scaling. The result of SPECint2006 scaling as an optimistic case for as you claim would then serve as an upper bound for other benchmarks (i.e. system scaling cannot beyond this factor). It can also be used to indicate where diminishing returns, if any, can be found as socket count goes up. If diminishing returns are severe with an optimistic scaling benchmark, then they should appear sooner with a more rigorous test. This would put an upper limit to how many additional sockets would be worthwhile to include in a system.

    “What claims should I backup?”
    How about that the UV 2000 is a cluster. You have yet to demonstrate that point while I’ve been able to provide evidence that it is a scale up server.

    “This whole thread started by me, claiming that it is impossible to get high business enterprise SAP scores on x86”
    Incorrect. A top 10 score using only eight sockets on an x86 system for SAP has been validated. Apparently the impossible has been done.

    “Sigh. So much... ignorance. Large enterprise systems such as SAP, Oracle database, etc “
    The context was with regards to custom applications that companies themselves would write. The portability of the code was the businesses to determine, not a 3rd party vendor. Legacy support and unique features to Unix are just some of the reasons why people will continue to use those system even in the face of faster hardware. Hence another point you don’t understand.
    Even with your context of 3rd party vendors, businesses fail or they’re bought out by another company where products only move to legacy and are no longer updated. Not all 3rd party software gets ported between the all the various flavors of Unix and Linux. Case in point: HPUX is tied to Itanium and thus a dead platform. Thus any HPUX exclusive 3rd party software is effectively dead as well.

    “Wrong again. I quote myself: "I also showed you that the same 3.7GHz SPARC cpu on the same server, achieves higher saps with 32-sockets, and achieves lower saps with 40-sockets."
    And now you are even missing the points that you yourself were trying to make which how scaling from 32 sockets to 40 sockets was poor when in fact that comparison was invalid due to the differences in clock speed.

    “And also, earlier, I noticed the clock speed differences, but it was exactly the same cpu model. I thought you would accept an extrapolation.”
    I rejected it appropriately as you never pointed out the clock speed differences in your analysis, hence your conclusions were flawed. Also I think it is fair to reject extrapolation as you’ve also rejected my extrapolations elsewhere even as indicate as such. Fair game.

    “Explain again what makes you believe a 32-socket x86 server would scale better than Unix servers. Is it because you looked at scale-out clustered x86 benchmarks and concluded about x86 scalability for SAP?”
    Again, this is not a citation.

    “Can you quote the links and the where SGI say that UV300 is just a 16-socket UV2000 server?”
    Or you know you could have just read what I stated and realize that I’m saying that the UV300 is not a scaled down UV2000. (The UV 300 is a scaled down version of the UV 3000 that is coming later this year to replace the UV 2000.) Rather a 16 socket UV 2000 would have same attribute of having a uniform latency as all the sockets would be the same distance from each other in terms of latency. Again, this is yet another point you’ve missed.

    “But still UV2000 is exclusively viewed as a scale-out HPC server by SGI (see my quote from your own link below where SGI talks about getting into the enterprise market with HPC servers), and never used for scale-up workloads. So what does it matter what SGI marketing label the server? “
    No, SGI states that it is a scale up server plus provides the technical documentation to backup that claim. The idea that they’re trying to get into the enterprise market should be further confirmation that it is a scale-up servers that can run business workloads. Do you actually read what you’re quoting?

    “Can you show us one single customer that use UV2000 as a scale-up server?”
    I have before: the US Post Office.

    “All customers are using SGI UV2000 for scale-out HPC computations. No one has ever used it for scale-up workloads. Not a single customer. There are no examples, no records, no links, no scale up benchmarks, no nothing. No one use UV2000 for scale up workloads such as big databases - why??”
    I’ve given you a big example before: the US Post Office. Sticking your head in the sand is not a technical reason for the UV 2000 being a cluster as you claim. Seriously, back up your claim that the UV 2000 is a cluster.

    “Sure, let SGI marketing call UV2000 a SMP server, they can say it is a carrot if they like - but still no one is going to eat it, nor use it to run scale-up business workloads.”
    Or you could read the technical data on the UV 2000 and realize that it has a shared memory architecture with cache coherency, two attributes that define a modern scale up SMP system. And again, I’ll reiterate that the US Post Office is indeed using these systems for scale-up business workloads.

    “Que? I have posted several links on this!”
    Really? The only one I’ve seen from you on this matter is a decade old and not in the context of SGI’s modern line up. The rest are just links I’ve presented that you quote out of context or you just do not understand what is being discussed.

    “For instance, here is another, where I quote from your own link:”
    Excellent! You’re finally able to accept that these systems can be used for databases and business workloads as the quote indicates that is what SGI is doing. Otherwise I find it rather strange that you’d quote things that run counter to your claims.

    ARTICLE: "...Obviously, Linux can span an entire UV 2000 system because it does so for HPC workloads, but I am not sure how far a commercial database made for Linux can span....”
    Ah! This actually interesting as it is in the context of the maximum number of threads a database can actually use. For example, MS SQL Server prior to 2014 could only scale to a maximum of 80 concurrent threads per database. Thus for previous versions of MS SQL Server, any core count past 80 would simply go to waste due to software limitations. As such, there may be similar limitations in other commercial databases that would be exposed on the UV 2000 that wouldn’t apply else where. Thus the scaling limitation being discussed here is with the database software, not the hardware so you missed the point of this discussion.

    ARTICLE:"...So in a [SGI UV2000] system that we, SGI, build, we can build it for high-performance computing or data-intensive computing. They are basically the same structure at a baseline..."
    And you missed the part of the quote indicating ‘or data-intensive computing’ which is a key part of the point being quoted that you’ve missed. Please actually read what you are posting please.

    ARTICLE: "...IBM and Oracle have become solution stack players and SAP doesn’t have a high-end vendor to compete with those two. That’s where we, SGI, see ourselves getting traction with HPC servers into this enterprise space...."
    This would indicate that the UV 2000 and UV 300 are suitable for enterprise workloads which runs counter to your various claims here.

    ARTICLE: "...The goal with NUMAlink 7 is...reducing latency for remote memory. Even with coherent shared memory, it is still NUMA and it still takes a bit more time to access the remote memory..."
    Coherency and shared memory are two trademarks of a large scale-up server. Quoting this actually hurts your arguments about the UV 2000 being a cluster as I presume you’ve also read the parts leading up to this quote and the segment you cut. The idea that accessing remote memory adds additional latency is a point I’ve made else where in our discussion and it is one of the reasons why scaling up is nonlinear. Thus I can only conclude that your quoting of this is to support my argument. Thank you!

    “So here you have it again. SGI exclusively talks about getting HPC computation servers into enterprise. Read your link again and you will see that SGI only talks about HPC servers.”
    And yet you missed the part where they were talking about those systems being used for enterprise workloads. Again, thank you for agreeing with me!

    “In the link they only talk about 32-sockets, they explicitly mention 32-socket SGI UV3000H and dont mention UV2000 with 256 sockets. They say that bigger scale-up servers than 32-sockets will come later.”
    Which would ultimately means that 32 sockets is not a hard limit for SAP HANA as you’ve claimed. I’m glad you’ve changed your mind on this point and agree with me on it.

    “No, not a score in top 10. Why do you believe I mean top 10? I meant at the very top. I want you to show links where x86 beat the Unix servers in SAP benchmarks. Go ahead. This whole thread started by me posting that x86 can never challenge high end Unix servers in SAP, that you need to go to Unix if you want the best performance. Scale-up x86 wont do, and scale-out x86 wont do.”
    Except a score in the top 10 does mean that they are competitive as what you originally asked for. The top 10 score was an 8 socket offering, counter to your claims that all the top scores were this 16 sockets or more. (And it isn’t the only 8 socket system in the top 10 either, IBM has a POWER8 system ranked 7th.)
    Also if you really looked, there are 16 socket x86 score from several years ago. At the time of there submission they were rather good but newer systems have displaced them over time. The main reason the x86 market went back to 8 sockets was that Intel reigned in chipset support with the Nehalem generation (the 16 socket x86 systems used 3rd party chipsets to achieve that number). This was pure market segmentation as Intel still had hopes for the Itanium line at the time. Thankfully the last two generations of Itanium chips have used QPI so that the glue logic developed for them can be repurposed for today’s Xeons. This is why we’re seeing x86 systems with more than 8 sockets reappear today.
    http://download.sap.com/download.epd?context=40E2D...
    http://download.sap.com/download.epd?context=9B280...

    “I answered and said that both of them are NUMA servers. I think I need to explain to you more, as you dont know so much about these things. Larger servers are all NUMA (i.e. tightly coupled cluster), meaning they have bad latency to far away nodes. Latency differs from close nodes and far away nodes - i.e NUMA. True SMP servers have the same latency no matter which cpu you reach.”
    So by your definition above, all the large 32 socket systems are then clusters because they don’t offer uniform latency. For example, the Fujitsu SPARC M10-4S needs additional interconnect chips to scale past 16 sockets and thus latency on opposite sides of this interconnect are not uniform. IBM’s P795 uses a two tier topology with distinct MCM and remote regions for latency. IBM’s older P595 used two different ring buses for an interconnect where latency even on a single ring was not uniform. A majority of 16 socket systems are also clusters by your definition as there are distinct local and remote latency regions. By your definition, only select 8 sockets systems and most 4 and 2 socket systems are SMP devices as processors at this scale can provide a single directly link between all other sockets.
    Or it could be that your definition of what a true SMP server is incorrect as systems like the SPARC M10-4S, IBM P795, IBM P595, SGI UV 300and SGI UV 2000 are all large SMP systems. Rather the defining traits are rather straightforward: a single logical system with shared memory and cache coherency between all cores and sockets. Having equal latency between sockets, while ideal for performance, is not a necessary component of the definition.

    “If you can keep the NUMA server small and with good engineering, you can still get a decent latency making it suitable for scale-up enterprise business workloads where the code branches heavily. As SGI explained in an earlier link, enterprise workloads branch heavily in the source code making that type of code less suitable for scale-out servers - this is common knowledge. I know you have read this link earlier.”
    Again, define branch heavy in this context. I’ve asked for this before without answer. I believe you mean something else entirely.

    “It is not goal shifting. All the time we have talked about large scale-up servers for enterprise usage, and if I forget to explicitly use all those words in every sentence, you call it "goal shifting".”
    Since that is pretty much the definition of goal shifting, thank you for admitting to it. In other news, you still have not explained the contradiction in your previous statements.

    “So, basically you are claiming that your own "conclusions" are facts? And if I doubt your conclusions, I should find it out myself by trying to get NDA closed information from some guy at large large HP? Are you serious?”
    Apparently it isn’t much of a NDA if it is part of a press release. Go ask the actual customer HP quoted as they’re already indicating that they are using a Superdome X system. 

    “Can you explain again why there are alternatives to locking?”
    There are alternative methods for maintaining concurrency that do not use locking. Locking is just one of several techniques for maintaining concurrency. There is no inherent reason to believe that there should only be one solution to provide concurrency.

    “ You linked to three "non-locking" methods - but I showed that they all do have some kind of lock deep down, they must have some mechanism to synch other threads so they dont simultaneously overwrite data. If you want to guarantee data integrity you MUST have some kind of way of stopping others to write data.”
    You don’t actually demonstrate that locking was used for OCC or MVCC. Rather you’ve argued that since concurrency is maintained, it has to have locking even though you didn’t demonstrate where the locking is used in these techniques. Of course since they functionally replace locking for concurrency control, you won’t find it. Also skip the personal attacks shown where these techniques are used in enterprise production databases.

    “I backed up my claims by quoting text from your own links, that they all have some kind of locking mechanism deep down. Ive also explained why it can not be done. If several threads write the same data, you need some way of synching the writing, to stop others to overwrite. It can not be done in another way. This is common sense, and in parallel computations you talk about race conditions, mutex, etc etc. It would be a great break through if you could write same data simultaneously and at the same time guarantee data integrity. But that can not be done. This is common sense.”
    This is the problem here: the end goal of concurrency I’m not arguing about. Rather it is how concurrency is obtained that you’re missing the point entirely. There are other ways of doing it than a lock. It can be done and I’ve shown that they’re used in production grade software.

    “It is very easy to make me realize UV2000 is a scale up system - just prove it. I am a mathematician and if I see a proof, I immediately change my mind. Why would I not? If I believe something false, I must change my mind. So, show us some links proving that UV2000 are used for enterprise business workloads such as SAP or databases, etc. If you can show such links, I will immediately say I was wrong and that UV2000 is a good all round server suitable for enterprise usage as well. But, the thing is - there are no such links! No one does it! What does it say you?”
    Oh I have before, the US Post Office has a UV 2000 for database work. Of course you then move the goal posts to where SAP HANA was no longer a real database.

    “Jesus. It is YOU that constantly shifts the goal posts. You KNOW that this whole discussion is about x86 and scale up business enterprise workloads. Nothing else. And if I dont specify "business enterprise workloads" in every sentence, you immediately jumps on that and shift to talking about HPC calculations or whatever I did not specify.”
    Again, that is pretty much the definition of shifting the goals post and I thank you again for admitting to it.

    “ You KNOW we talk only about scale-up workloads. Math institutes doing computations is NOT business enterprise, it is all about HPC. You know that.”
    Actually what I pointed out as a key attribute of those large scale up machines: a single large memory space. That is why the institute purchased the M9000 as well the UV 2000. If they just wanted an HPC system, they’d get a cluster which they did separately alongside each of these units. In other words, they bought *both* a scale up and a scale out system at the same time. In 2009 the scale up server selected was a M9000 and in 2013 their scale up server was a UV 2000. It fits your initial request for UV 2000 replacing a large scale up Unix machine.
  • Brutalizer - Sunday, May 31, 2015 - link

    @FUDer KevinG

    Ive caught a flu, but now I feel better.

    .

    ME:"It is easy to get good scaling from 1 to 2 sockets, but to go from 16 to 32 is another thing. “
    YOU: "Please quote me where I explicitly claim otherwise."

    Well, you say that because x86 benchmarks scales well going from 8-sockets to 16 sockets, you expect x86 to scale well for 32-sockets too, on SAP. Does this not mean you expect x86 scales close to linear?

    .

    ME: "I would not be surprised if it were Java SPECjbb2005, LINPACK or SPECint2006 or some other clustered benchmark you refer to.”
    YOU: "Those are perfectly valid benchmarks as well to determine scaling."

    Hmmm.... actually, this is really uneducated. Are you trolling or do you really not know the difference? All these clustered benchmarks are designed for clustered scale-out servers. For instance, LINPACK is typically run on supercomputers, big scale-out servers with 100.000s of cpus. There is no way these cluster benchmarks can asses the scalability on SAP and other business workloads on 16- or 32-socket scale-up servers. Another example is SETI@home which can run on millions on cpus, but that does not mean SAP nor databases could also run on millions on cpus. I hope you realize you can not use scale-out benchmarks to draw conclusions for scale-up servers? Basic comp sci knowledge says there is a big difference between scale-up and scale-out. Did you not know, are you just pretending to not know? Trolling? Seriously?
    http://en.wikipedia.org/wiki/Embarrassingly_parall...
    "...In parallel computing, an embarrassingly parallel workload... is one for which little or no effort is required to separate the problem into a number of parallel tasks..."

    BTW, have you heard about P-complete problems? Or NC-complete problems? Do you know something about parallel computations? You are not going to answer this question as well, right?

    Where are the benchmarks on x86 servers going from 8-sockets up to 16-sockets, you have used to conclude about x86 scalability? I have asked you about these benchmarks. Can you post them and backup your claims and prove you speak true or is this also more of your lies, i.e. FUD?
    http://en.wikipedia.org/wiki/Fear,_uncertainty_and...
    "...FUD is generally a strategic attempt to influence perception by disseminating...false information..."

    .

    ME: “What claims should I backup?”
    YOU: "How about that the UV 2000 is a cluster. You have yet to demonstrate that point while I’ve been able to provide evidence that it is a scale up server."

    I showed you several links from SGI, where they talk about trying to going into scale-up enterprise market, coming from the HPC market. Nowhere do SGI say they have a scale-up server. SGI always talk about their HPC servers, trying to break into the enterprise market. You have seen several such links, you have even posted such links yourself. If SGI had good scale-up servers that easily bested Unix high end servers, SGI would not talk abou their HPC servers. Instead SGI talk about their UV300H 16-socket server trying to get a piece of the enterprise market. Why does not SGI use their UV2000 server if UV2000 is a scale-up server?

    And where are the UV2000 enterprise benchmarks? Where are the SAP benchmarks?

    .

    ME: “This whole thread started by me, claiming that it is impossible to get high business enterprise SAP scores on x86”
    YOU: "Incorrect. A top 10 score using only eight sockets on an x86 system for SAP has been validated. Apparently the impossible has been done."

    Que? That is not what I asked! Are you trying to shift goal posts again? I quote myself again in my first post, nowhere do I ask about top 10 results:
    "So, if we talk about serious business workloads, x86 will not do, because they stop at 8 sockets. Just check the SAP benchmark top - they are all more than 16 sockets, Ie Unix servers. X86 are for low end and can never compete with Unix such as SPARC, POWER etc. scalability is the big problem and x86 has never got passed 8 sockets. Check SAP benchmark list yourselves, all x86 are 8 sockets, there are no larger."

    So, again, post a SAP benchmark competing with the largest Unix servers, with close to a million saps. Go ahead, we are all waiting. Or is it impossible for x86 to achieve close to a million saps? There is no way, no matter how hard you try? You must go to Unix? Or, can you post a x86 benchmark doing that? Well, x86 is worthless for SAP as it dont scale beyond 8-sockets on SAP and therefore can not handle extreme workloads.

    .

    ME:“Sigh. So much... ignorance. Large enterprise systems such as SAP, Oracle database, etc “
    YOU: "The context was with regards to custom applications that companies themselves would write. The portability of the code was the businesses to determine, not a 3rd party vendor. "

    What ramblings. I asked you about why high end Unix market has not died an instant, if x86 can replace them (which they can not, it is impossible to reach close to a million saps with any x86 server, SGI or not) to which you replied something like "it is because of vendor lockin companies continue to buy expensive Unix servers instead of cheap x86 servers". And then I explained you are wrong because Unix code is portable which makes it is easy to recompile among Linux, FreeBSD, Solaris, AIX,..., - just look at SAP, Oracle, etc they are all available under multiple Unixes, including Linux. To this you replied some incomprehensible ramblings? And you claim you have studied logic? I ask one question, you duck it (where are all links) or answer to another question which I did not ask. Again, can you explain why Unix high end market has not been replaced by x86 servers? It is not about RAS, and it is not about vendor lockin. So, why do companies pay $millions for one paltry 32-socket Unix server, when they can get a cheap 256-socket SGI server?

    .

    ME: “Wrong again. I quote myself: "I also showed you that the same 3.7GHz SPARC cpu on the same server, achieves higher saps with 32-sockets, and achieves lower saps with 40-sockets."
    YOU: "And now you are even missing the points that you yourself were trying to make which how scaling from 32 sockets to 40 sockets was poor when in fact that comparison was invalid due to the differences in clock speed."

    Que? I accepted your rejection of my initial analysis where I compared 3GHz cpu vs 3.7GHz of the same cpu model on the same server. And I made another analysis where I compared 3.7GHz vs 3.7GHz on the same server and showed that performance dropped with 40-sockets compared to 32-sockets, on a cpu per cpu basis. Explain how I was "missing the points that you yourself were trying to make"?

    .

    "...Also I think it is fair to reject extrapolation as you’ve also rejected my extrapolations elsewhere even as indicate as such..."

    But your extrapolations are just.. quite stupid. For instance, comparing scale-out LINPACK benchmarks to asses scalability on SAP benchmarks? You are comparing apples to oranges. I compared same cpu model, with same GHz, on the same server - which you rejected.

    .

    ME:“Explain again what makes you believe a 32-socket x86 server would scale better than Unix servers. Is it because you looked at scale-out clustered x86 benchmarks and concluded about x86 scalability for SAP?”
    YOU: "Again, this is not a citation."

    No, but it is sheer stupidity. Can you explain again what makes you believe that? Or are you going to duck that question again? Or shift goal posts?

    .

    ME:“Can you quote the links and the where SGI say that UV300 is just a 16-socket UV2000 server?”
    YOU: "Or you know you could have just read what I stated and realize that I’m saying that the UV300 is not a scaled down UV2000. (The UV 300 is a scaled down version of the UV 3000 that is coming later this year to replace the UV 2000.) Rather a 16 socket UV 2000 would have same attribute of having a uniform latency as all the sockets would be the same distance from each other in terms of latency. Again, this is yet another point you’ve missed."

    What point have I missed? You claim that UV300 is just a 16-socket version of the UV2000. And I asked for links proving your claim. Instead of showing us links, you ramble something and conclude with "you missed the point"? What point? You missed the entire question! Show us links proving your claim. And stop about talking about that, I have missed some incomprehensible point you made. Instead, show us the links you refer to. Or is this also lies, aka FUD?

    "...FUD is generally a strategic attempt to influence perception by disseminating...false information..."

    .

    "...No, SGI states UV2000 it is a scale up server plus provides the technical documentation to backup that claim. The idea that they’re trying to get into the enterprise market should be further confirmation that it is a scale-up servers that can run business workloads. Do you actually read what you’re quoting?..."

    Great! Then we can finally settle this question! Show us scale-up benchmarks done with the UV2000 server, for instance SAP or large databases or other large business workloads. Or dont they exist? You do know that North Korea claims they are democratic and just, etc - do you believe that, or do you look at the results? Where are the UV2000 results running enterprise workloads? Why do SGI tout UV300H for the enterprise market instead of UV2000? SGI does not mention UV2000 running entprise workloads, they only talk about UV300H. But I might have missed them links, show us them. Or is it more of the same old lies, ie. the links do not exist, it is only FUD?

    .

    ME: “Can you show us one single customer that use UV2000 as a scale-up server?”
    YOU: "I have before: the US Post Office."

    I have seen your link where USP used a UV2000 for fraud detection, not used as a database storing data. Read your link again: "U.S. Postal Service Using Supercomputers to Stamp Out Fraud"

    Analytics is not scale-up, it is scale-out. I have explained this in some detail and posted links from e.g. SAP and links talking about in memory databases which are exclusively used for analytics. Do you really believe anyone stores persistent data in RAM? No, RAM based databases are only used for analytics, as explained by SAP, etc.

    You have not showed a scale-up usage of the UV2000 server, for instance, running SAP or Oracle databases for storing data. Can you post such a link? Any link at all?

    .

    "...I’ve given you a big example before: the US Post Office. Sticking your head in the sand is not a technical reason for the UV 2000 being a cluster as you claim. Seriously, back up your claim that the UV 2000 is a cluster..."

    I have showed numerous links that UV2000 is used for HPC clustered workloads, and SGI talks about HPC market segment, etc. There do not exist any enterprise UV2000 benchmarks, such as SAP. Not a single customer during decades, has ever used a large HPC cluster from SGI for SAP. No one. Never. Ever. On the whole internet. Why is that do you think? If you claim SGI's large servers are faster and cheaper and can replace high end Unix 32-socket servers - why have no one ever done that? Dont they want to save $millions? Dont they want much higher performance? Why?

    .

    ME:“Sure, let SGI marketing call UV2000 a SMP server, they can say it is a carrot if they like - but still no one is going to eat it, nor use it to run scale-up business workloads.”
    YOU; "Or you could read the technical data on the UV 2000 and realize that it has a shared memory architecture with cache coherency, two attributes that define a modern scale up SMP system. And again, I’ll reiterate that the US Post Office is indeed using these systems for scale-up business workloads."

    Well, no one use SGI UV2000 for enterprise business workloads. US Post Office are using it for fraud detection, that is analysis in memory database. Not storing data. You store data on disks, not in memory.

    .

    ME: “Que? I have posted several links on this!”
    YOU; "Really? The only one I’ve seen from you on this matter is a decade old and not in the context of SGI’s modern line up."

    Que? That SGI link explains the main difference between HPC workloads and enterprise business workloads. It was valid back then and it is valid today: the link says that HPC workloads runs in a tight for loop, crunching data, there is not much data passed between the cpus. And Enterprise code branches all over the place, so there is much communication among the cpus making it hard for scale-out servers. This is something that has always been true and true today. And in the link, SGI said that their large Altix UV1000 server are not suitable for enterprise workloads.

    In your links you posted, SGI talks about trying to break into the enterprise market with the help of the UV300H server. SGI does not talk about the UV2000 server for breaking into the enterprise market.

    I quote SGI from one of your own link discussing SGI trying to break into enterprise:
    "...So in a [SGI UV2000] system that we, SGI, build, we can build it for High-Performance Computing or data-intensive computing. They are basically the same structure at a baseline..."

    SGI explicitly says that UV2000 is for HPC in one way or the other. I have posted numerous such links and quoted SGI numerous times, often from your own links! How can you say you have never seen any quotes???

    .

    ME: “For instance, here is another, where I quote from your own link:”
    YOU: "Excellent! You’re finally able to accept that these systems can be used for databases and business workloads as the quote indicates that is what SGI is doing. Otherwise I find it rather strange that you’d quote things that run counter to your claims."

    Que? Do you think we are stupid or who are trying to fool?

    ARTICLE: "...Obviously, Linux can span an entire UV 2000 system because it does so for HPC workloads, but I am not sure how far a commercial database made for Linux can span....”
    Ah! This actually interesting as it is in the context of the maximum number of threads a database can actually use. For example, MS SQL Server prior to 2014 could only scale to a maximum of 80 concurrent threads per database. Thus for previous versions of MS SQL Server, any core count past 80 would simply go to waste due to software limitations. As such, there may be similar limitations in other commercial databases that would be exposed on the UV 2000 that wouldn’t apply else where. Thus the scaling limitation being discussed here is with the database software, not the hardware so you missed the point of this discussion."

    How the h-ck did you draw this weird conclusion? By pure il-logic? The guy in the article says that it is well known that HPC can span the entire UV2000 server but it is not known how far databases span the UV2000 server. And from this talk about the UV2000 server hardware, you conclude he talks about software limitations? Que? What have you been smoking? Nowhere does he talk about limitations on the database, the question is how well UV2000 scales on databases. And that is the big question. You have not missed the point, you have missed everything. As you can tell, I am not a native english speaker, but your english reading comprehension is beyond repair. Did you drop out of college as well? Sixth form? How could you even finish something with such a bad reading comprehension? You must have failed everything in school? How do you think a teacher would grade an essay of yours? Shake their head in disbelief.

    .

    ARTICLE:"...So in a [SGI UV2000] system that we, SGI, build, we can build it for high-performance computing or data-intensive computing. They are basically the same structure at a baseline..."
    YOU:"And you missed the part of the quote indicating ‘or data-intensive computing’ which is a key part of the point being quoted that you’ve missed. Please actually read what you are posting please."

    Que? Seriously? Do you know how to shorten High Performance Computing? By "HPC". SGI explicitly says they build the UV2000 for HPC or Data-Intensive Computing, both are scale-out workloads and runs on clusters. In your quote SGI explicitly says that UV2000 are used for clustered scale-out workloads, i.e. HPC and DIC. So you are smoked.
    http://en.wikipedia.org/wiki/Data-intensive_comput...
    Data-intensive processing requirements normally scale linearly according to the size of the data and are very amenable to straightforward parallelization....Data-intensive computing platforms typically use a parallel computing approach combining multiple processors and disks in large commodity computing CLUSTERS

    .

    ARTICLE: "...IBM and Oracle have become solution stack players and SAP doesn’t have a high-end vendor to compete with those two. That’s where we, SGI, see ourselves getting traction with HPC servers into this enterprise space...."
    YOU: "This would indicate that the UV 2000 and UV 300 are suitable for enterprise workloads which runs counter to your various claims here."

    BEEP! Wrong. No it does not "indicate" that. SGI talks about using UV300 to get into the enterprise market. They only mention UV2000 when talking about HPC or DIC, both clustered scale-out workloads. You quoted that above.

    .

    ARTICLE: "...The goal with NUMAlink 7 is...reducing latency for remote memory. Even with coherent shared memory, it is still NUMA and it still takes a bit more time to access the remote memory..."
    YOU: "Coherency and shared memory are two trademarks of a large scale-up server. Quoting this actually hurts your arguments about the UV 2000 being a cluster as I presume you’ve also read the parts leading up to this quote and the segment you cut. The idea that accessing remote memory adds additional latency is a point I’ve made else where in our discussion and it is one of the reasons why scaling up is nonlinear. Thus I can only conclude that your quoting of this is to support my argument. Thank you!"

    Well, your conclusion is wrong. SGI talks about UV2000 built for HPC or DIC. Not enterprise. So you have missed the whole article, you did not only miss the point. You missed everything. Nowhere do SGI say that UV2000 is for enterprise. Isntead UV300H is for enterprise. You are making things up. Or can you quote where SGI says UV2000 is for enterprise, such as SAP or databases?

    .

    ME:“So here you have it again. SGI exclusively talks about getting HPC computation servers into enterprise. Read your link again and you will see that SGI only talks about HPC servers.”
    YOU: "And yet you missed the part where they were talking about those systems being used for enterprise workloads."

    Que? Nowhere do SGI say so. Go ahead and quote the article where SGI say so. I can not decide if you are Trolling or if you are a bit dumb, judging from your interpretation of the above links?

    .

    ME:“In the link they only talk about 32-sockets, they explicitly mention 32-socket SGI UV3000H and dont mention UV2000 with 256 sockets. They say that bigger scale-up servers than 32-sockets will come later.”
    YOU: "Which would ultimately means that 32 sockets is not a hard limit for SAP HANA as you’ve claimed. I’m glad you’ve changed your mind on this point and agree with me on it."

    Duh, you missed the point. SAP does say that there are no larger scale-up x86 servers than 32-sockets. SAP does not say that UV2000 256-sockets are usable for this scenario. So, here again do we see that UV2000 is not suitable for Hana, but instead UV300H is mentioned. So, why dont they talk about UV2000 256-sockets, instead of only saying that 32-sockets are the largest scale-up servers? So, you are wrong and have been wrong all the time. SGI and SAP supports me. Why? Because, it is actually the other way around, I support them. I would never lie (like you do), I only reiterate what I read. If SGI and SAP said UV2000 were good for SAP, I would write that instead, yes it it true. I dont lie. I know it is hard for you to believe in (liars believe everyone lies). But some people dont like lies, mathematicians like the truth.

    And Hana is distributed so it scales to a large amount of cpus, no one has denied that.

    .

    "....Except a score in the SAP top 10 does mean that they are competitive as what you originally asked for. The top 10 score was an 8 socket offering, counter to your claims that all the top scores were this 16 sockets or more...."

    No, I did not "originally" ask for top 10. Stop shifting goal posts all the time or abuse the truth, aka lie. I asked for the very top, to compete with high end Unix servers. And I dont see the best x86 server achieving ~35% of the top Unix server is competing. There is no competition. So, again, show me a x86 server that can compete with the best Unix server in SAP benchmarks. You can not because there is no x86 server than can tackle the largest SAP workloads.

    "...Also if you really looked, there are 16 socket x86 score from several years ago. At the time of there submission they were rather good but newer systems have displaced them over time...."

    "Rather good"? You are lying so much that you believe yourself. Those worthless 16-socket x86 servers gets 54.000 saps as best which is really bad. And the best SPARC server from the same year, the M9000 gets 197.000 saps, almost 4x more. Try to fool someone else with "x86 can compete at the very top with high end Unix servers".
    download.sap.com/download.epd?context=40E2D9D5E00EEF7C9AF1134809FF8557055EFBE3810C5CE80E06D1AE6A251B04

    Do you know anything about the best 16-socket server, the IBM X3950M2? The IBM server is built from four individual 4-socket units, connected together with a single cable into a 16-socket configuration. This single cable between the four nodes, makes scalability awfully bad and makes performance more resemble a cluster. I dont know if anyone ever used it in the largest 16-socket configuration as a scale-up server. I know that several customers used it as a multi-node scale-out server, but I never heard about any scale-up customer. Maybe because IBM X3950 M2 only gets ~3.400 saps per cpu in the 16-socket configuration. Whereas two entries below we have a 4-socket x86 server with the same cpu, and the same GHz, and it gets 4.400 saps. So the 4-socket version is 33% faster, just by using fewer cpus. So SAP scaling drops off really fast, especially on x86.

    The SPARC T5440 from same year has 4-sockets, and gets 26.000 saps. Half the score of the 16-socket IBM X3950M2. But I would never conclude that a 8-socket T5440 would get 52.000 saps, as you would wrongly conclude. I know SAP scaling drops off really fast.

    This only proves my point, there is no way x86 can compete with high end Unix servers, at any given point in time: even if you go to 16-sockets, x86 had nothing to come up with because scaling via cables is too bad. What a silly idea.

    .

    "...So by your definition above, all the large 32 socket systems are then clusters because they don’t offer uniform latency..."

    Correct. If we are going to be strict, yes. NUMA systems by definition, have different latency to different cpus. and ALL high end Unix servers are NUMA. But they are well designed, with low worst-case latency, so they can in fact run enterprise systems, as can be seen in all SAP and oracle benchmarks all over the internet.

    "...Rather the defining traits are rather straightforward: a single logical system with shared memory and cache coherency between all cores and sockets. Having equal latency between sockets, while ideal for performance, is not a necessary component of the definition...."

    No, because the defining trait is: what are the usage scenarios for the servers? All high end Unix servers you mentioned, are used for enterprise usage. Just look at the sap benchmarks, databases, etc. Whereas SGI UV2000 no one use them for enterprise usage. You can google all you want, but no one has ever used SGI's large HPC servers for enterprise usage. It does not matter how much SGI calls it a carrot, it still not a carrot. Microsoft calls Windows an Enterprise OS, but still no stock exchange in the world, use Windows. They all run Linux or Unix.

    Instead of reading the marketing material; ask yourself what are the servers used for in production? Do you believe MS marketing too?

    .

    "...Again, define branch heavy in this context. I’ve asked for this before without answer. I believe you mean something else entirely..."

    I mean exactly what SGI explained in my other link. But you know nothing about programming, so I understand you have difficulties with this concept. But this is really basic to a programmer. I am not going to teach you to program, though.

    .

    ME;“It is not goal shifting. All the time we have talked about large scale-up servers for enterprise usage, and if I forget to explicitly use all those words in every sentence, you call it "goal shifting".”
    YOU; Since that is pretty much the definition of goal shifting, thank you for admitting to it. In other news, you still have not explained the contradiction in your previous statements.

    No it is not the "definition" of goal shifting. Normal people as they discuss something at length, do not forget what they talked about. If I ask 5 times in every post, about one single link where SGI UV2000 replaced a scale-up server on enterprise workloads, and once just write "can you show us a single link where a SGI UV2000 replaced a scale-up server" - it is desperate to show a link where UV2000 replaces scale-out workloads. That is pure goal shifting from your side, and at the same time you accuse me for doing goal shifting? It reeks desperation. Or plain dumbness. Or both.

    You have showed us a link on something I did not ask of (why do you duck my questions all the time?), can you show us a link where a SGI UV2000 replaced a scale-up server, on enterprise business workloads?

    .

    "...Apparently it isn’t much of a NDA if it is part of a press release. Go ask the actual customer HP quoted as they’re already indicating that they are using a Superdome X system...Regardless of socket count, it is replacing a Unix system as HPUX is being phased out. If you doubt it, I’d just email the gentlemen that HP quoted and ask."

    Are you stupid? It is you that are stating something dubious (or false). You prove your claim.

    .

    "You don’t actually demonstrate that locking was used for OCC or MVCC. Rather you’ve argued that since concurrency is maintained, it has to have locking even though you didn’t demonstrate where the locking is used in these techniques. Of course since they functionally replace locking for concurrency control, you won’t find it."

    Are you obtuse? I quoted that they do lock. Didnt you read my quotes? Read them again. Or did you not understand the comp sci lingo?

    .

    "...Also skip the personal attacks shown where these techniques are used in enterprise production databases..."

    How about you skip the FUD? You write stuff all the time that can not be proven. That is false information.
    "...FUD is generally a strategic attempt to influence perception by disseminating...false information..."

    .

    "...Oh I have before, the US Post Office has a UV 2000 for database work. Of course you then move the goal posts to where SAP HANA was no longer a real database...."

    No, this is wrong. USP use the UV2000 for analytics, not database work. The ram database is used for "fraud detection" as quoted from your link. A real database is used to store persistent data on disks, not in RAM.

    .

    ME: "And if I dont specify "business enterprise workloads" in every sentence, you immediately jumps on that and shift to talking about HPC calculations or whatever I did not specify.”
    YOU: "Again, that is pretty much the definition of shifting the goals post and I thank you again for admitting to it."

    Que? I have asked you to post links to where a UV2000 replaced a scale-up server on "business enteprise workloads" many times, and once I did not type it out, because I forgot and also you know what we are discussing. And as soon I make a typo, you jump on that typo instead. Instead of showing a link on what I have asked for probably 30 times now, you have ducked all those requests, and instead you show us a link of something I have never once asked about when I did a typo? That is the very definition of goal shifting you are doing.
    http://en.wikipedia.org/wiki/Moving_the_goalposts
    "...The term is often used ...by arbitrarily making additional demands just as the initial ones are about to be met...."

    Anyway, we all know that you do move the goal posts, we have seen it all the times. And cudos to you for posting a very fine link where a scale-out UV2000 cluster replaced another server on scale-out computations. I dont know really why you posted such a link, as I have never asked about it earlier. But never mind. Can we go back to my original question again, without you trying to duck the question for the 31st time, or posting something irrelevant? Here it comes, again:
    -If you claim that the SGI UV2000 is a scale-up server, then you can surely show us several links where UV2000 replaces scale-up servers on scale-up business enterprise workloads? SGI has explictly said they have tried to get into the enterprise market for many years now, so there surely must exist several customers who replaced high end Unix servers with UV2000, on enterprise business workloads, right?

    Or, are you going to ask me to prove this as well, as you did with Superdome X too?

    .

    "...Actually what I pointed out as a key attribute of those large scale up machines: a single large memory space. That is why the institute purchased the M9000 as well the UV 2000. If they just wanted an HPC system, they’d get a cluster which they did separately alongside each of these units. In other words, they bought *both* a scale up and a scale out system at the same time. In 2009 the scale up server selected was a M9000 and in 2013 their scale up server was a UV 2000. It fits your initial request for UV 2000 replacing a large scale up Unix machine..."

    No, this is incorrect again. Back in time, the SPARC M9000 had the world record in floating point calculations, so it was the fastest in the world. And let me tell you a secret; a mathematical institute do not run large scale business enterprise workloads needing the largest server money could buy, they run HPC mathematical calculations. I know, I come from such an mathematical institute and have programmed a Cray supercomputer with MPI libraries and also tried OpenMP libraries - both are used exclusively for HPC computations.

    Why would a mathematical institute run... say, a large SAP configuration? Large SAP installations can cost more than $100 million, where would a poor research institute get all the money from, and why would a math institute do all that business? This has never occured to you, right? But, let me tell, it is a secret, so dont tell anyone else. Not many people knows math institutes do not run large businesses. I understand you are confused and you did not know this. Have you ever set your foot on a comp sci or math faculty? You dont know much about math that is sure, and you claim you have "studied logic" at a math institute? Yeah right.

    .

    How about you stop FUDing? The very definition of FUD:
    "...FUD is generally a strategic attempt to influence perception by disseminating...false information..."

    Do you consider this FUD? Can you answer me? Can you stop ducking my questions?
    "SPARC M6-32 is much faster than SGI UV2000 on HPC calculations. I am not going to show you benchmarks nor links. You have to trust me".

    How is this different from:
    "SGI UV2000 is a scale-up server and can replace high end Unix servers on enterprise business workloads such as SAP. I am not going to show you benchmarks on enterprise business workloads nor links. You have to trust me."

    It is very easy to silence me and make me change my mind. Just show me some benchmarks, for instance, SAP or database benchmarks. Or show links to customers that have replaced modern high end Unix servers with SGI UV2000 on business enterprise workloads. Do that, and I will stop believing UV2000 is not suitable for scale-up workloads.
  • Brutalizer - Monday, June 1, 2015 - link

    Yes! I got an email reply from a SGI "SAP PreSales Alliance Manager" from Germany, about using UV2000 for the enterprise market. This is our email exchange:

    ME:
    >>Hello, I am a consultant examining the possibilities to run SAP and
    >>Oracle databases using the largest 256-socket SGI UV2000.
    >>Our aim is to replacing expensive Unix servers, in favour of cheaper
    >>UV2000 doing enterprise business work. 256-socket beats 32-socket Unix servers anyday.
    >>Do you have any more information on this? I need to study this
    >>more, on how to replace Unix servers with UV2000 on enterprise
    >>workloads before we can reach a conclusion. Any links/documentation would
    >>be appreciated.

    SGI:
    >>Hello Mr YYY,
    >>
    >>I'm happy to discuss with you regarding your request.
    >>We have an successor of the above mentioned UV2000. It's the SGI UV300H for
    >>HANA. But it is also certified for any other SAP workload.
    >>Would be good to discuss in more detail what is your need.
    >>
    >>When are you available for a call?
    >>
    >>Thanks you for reaching out to SGI
    >>XXX

    ME:
    >>Dear XXX
    >>
    >There are no concrete plans on migrating off Unix for the client, but
    >>I am just brain storming. I myself, am interested in the SGI UV2000 as
    >>a possible replacement for enterprise workloads. So I take this as an
    >>opportunity to learn more on the subject. I reckon the UV300H only has
    >>16-sockets? Surely UV2000 must beat 16-sockets in terms of performance
    >>as it is a 256-socket SMP server! Do you have any documentation on this?
    >>Customer success stories? Use cases? I want to study more about
    >>possibilities to use large SGI servers to replace Unix servers for
    >>enterprise workloads before I say something to my manager. I need
    >>to be able to talk about advantages/disadvantages before talking to
    >>anyone, know the numbers. Please show me some links or documentation
    >>on UV2000 for enterprise business workloads, and I will read through them all.

    SGI:
    >Hi YYY
    >the UV300H has a maximum of 32 Intel Xeon E7 sockets starting with 4 sockets
    >and scales up to 24TB main memory. It's an SMP system with an All-to-All topology.
    >It scales in increments of 4 sockets and 3TB.
    >It is certified for SAP and SAP HANA.
    >http://www.sgi.com/solutions/sap_hana/
    >
    >The UV2000 is based on Intel Xeon E5 which scales up to 256 sockets
    >and 64TB main memory within an Hypercube or Fat-Tree topology. It
    >starts with 4 sockets and scales in 2 socket increments.
    >It is certified for SAP but not for SAP HANA.
    >http://www.sgi.com/products/servers/uv/uv_2000_20....
    >
    >Both can run SUSE SLES and Red Hat RHEL.
    >
    >In respect to use cases and customer success story for the
    >UV2000 in the enterprise market (I assume I mean SAP and Oracle)
    >we have only limited stuff here. Because we target the UV2000 only
    >for the HPC market, eg. Life Science.
    >Look for USPS on http://www.sgi.com/company_info/customers/
    >
    >For a more generic view on our Big Data & Data Analytics business
    >see https://www.sgi.com/solutions/data_analytics/

    ME:
    >So you target UV2000 for the HPC market, but it should not
    >matter, right? It should be able to replace Unix servers on
    >the enterprise market, because it is so much more powerful
    >and faster. And besides, Linux is better than Unix. So I
    >would like to investigate more how to use UV2000 for the
    >enterprise market. How can I continue with this study? Dont
    >you have any customer success stories at all? You must have.
    >Can you show me some links? Or, can you forward my email to some
    >senior person that has been with SGI for a long time?

    And that is where we finished for today. I will keep you updated. I also
    looked at the USPS link he talks about mentioning "eg. Life Science", i.e.
    the US Postal Service that you FUDer KevinG mentioned. On SGI web site
    it says:
    "We use the SGI platform to take the USPS beyond what they were able to achieve
    with traditional approaches." and there is a video. If you look at the video
    it says at the top in the video window:
    "Learn How SGI and FedCentric Deliver Real-Time Analytics for USPS"
    Nowhere do they mention databases, SGI UV2000 is just used as an analytics tool.
    Which is in line with the SGI links about how UV2000 is great for HPC and DIC,
    Data Intensive Computing and High Performance Computing

    So you are wrong again. USP Postal Service is using UV2000 for real time analytics,
    not as a database storing persistent data. I dont know how many times I need
    to repeat this.
  • Kevin G - Tuesday, June 2, 2015 - link

    @Brutalizer
    "Well, you say that because x86 benchmarks scales well going from 8-sockets to 16 sockets, you expect x86 to scale well for 32-sockets too, on SAP. Does this not mean you expect x86 scales close to linear?"
    That is not a quote where I make such claims. I asked for a quote and you have to not provided.

    "Hmmm.... actually, this is really uneducated. Are you trolling or do you really not know the difference? All these clustered benchmarks are designed for clustered scale-out servers. For instance, LINPACK is typically run on supercomputers, big scale-out servers with 100.000s of cpus."
    Or perhaps you actually read what I indicated. As a nice parallel benchmark, they can indeed be used to determine scaling as that will be a nice *upper bound* on performance gains for business workloads. If there is a performance drop on a single node instance of LINPACK as socket count increases, I'd expect to see a similar drop or more when running business workloads.

    "BTW, have you heard about P-complete problems? Or NC-complete problems? Do you know something about parallel computations? You are not going to answer this question as well, right?"
    Sure I'll skip that question as it has no direct relevancy and I'd rather like to ignore indirect tangents.

    "Where are the benchmarks on x86 servers going from 8-sockets up to 16-sockets, you have used to conclude about x86 scalability? I have asked you about these benchmarks. Can you post them and backup your claims and prove you speak true or is this also more of your lies, i.e. FUD?"
    As mentioned below, there were indeed older 16 socket x86 servers in the SAP benchmark database despite your claims here that they don't exist. This would be yet another contradiction you've presented. There are also various SPEC scores among others.

    "I showed you several links from SGI, where they talk about trying to going into scale-up enterprise market, coming from the HPC market. Nowhere do SGI say they have a scale-up server."
    Except that SGI does indicate that the UV 2000 is a large SMP machine, a scale up system. 'SGI UV 2000 scales up to 256 CPU sockets and 64TB of shared memory as a single system.' from the following link: https://www.sgi.com/products/servers/uv/
    I've also posted other links previous where SGI makes this claim before which you continually ignore.

    "SGI always talk about their HPC servers, trying to break into the enterprise market. You have seen several such links, you have even posted such links yourself. If SGI had good scale-up servers that easily bested Unix high end servers, SGI would not talk abou their HPC servers."
    SGI would still talk about their HPC servers as they offer other systems dedicated to HPC workloads. SGI sells more than just the scale up UV 2000 you know.

    "Instead SGI talk about their UV300H 16-socket server trying to get a piece of the enterprise market. Why does not SGI use their UV2000 server if UV2000 is a scale-up server?"
    The UV300H goes up to 32 sockets, 480 cores, 960 threads and 48 TB of memory. For vast majority of scale up workloads, that is more than enough in a single coherent system. The reason SGI is focusing on the UV300 for the enterprise is due to more consistent and lower latency which provides better scaling up to 32 sockets.

    "Que? That is not what I asked! Are you trying to shift goal posts again? I quote myself again in my first post, nowhere do I ask about top 10 results:
    "So, if we talk about serious business workloads, x86 will not do, because they stop at 8 sockets. Just check the SAP benchmark top - they are all more than 16 sockets, Ie Unix servers. X86 are for low end and can never compete with Unix such as SPARC, POWER etc. scalability is the big problem and x86 has never got passed 8 sockets. Check SAP benchmark list yourselves, all x86 are 8 sockets, there are no larger."
    Except I've pointed out x86 results in the top 10, other Unix systems with 8 sockets in the top 10 and previous 16 socket x86 results. So essentially your initial claims here are incorrect.

    "What ramblings. I asked you about why high end Unix market has not died an instant, if x86 can replace them [...] to which you replied something like "it is because of vendor lockin companies continue to buy expensive Unix servers instead of cheap x86 servers". And then I explained you are wrong because Unix code is portable which makes it is easy to recompile among Linux, FreeBSD, Solaris, AIX,..., - just look at SAP, Oracle, etc they are all available under multiple Unixes, including Linux. To this you replied some incomprehensible ramblings? And you claim you have studied logic? I ask one question, you duck it (where are all links) or answer to another question which I did not ask. Again, can you explain why Unix high end market has not been replaced by x86 servers? It is not about RAS, and it is not about vendor lockin. So, why do companies pay $millions for one paltry 32-socket Unix server, when they can get a cheap 256-socket SGI server?"
    You forgot to include the bit where I explicitly indicate that my statements were in the context of companies developing their own software and instead you bring up 3rd party applications.
    Custom developed applications are just one of several reasons why the Unix market continues to endure. Superior RAS is indeed another reason to continue get Unix hardware. Vendor lock-in due to specific features in a particular flavor of Unix is another. They all contribute to today's continued, but shrinking, need for Unix servers. System architects should choose the best tool for the job and there is a clear minority of cases where that means Unix.

    "Que? I accepted your rejection of my initial analysis where I compared 3GHz cpu vs 3.7GHz of the same cpu model on the same server. And I made another analysis where I compared 3.7GHz vs 3.7GHz on the same server and showed that performance dropped with 40-sockets compared to 32-sockets, on a cpu per cpu basis. Explain how I was "missing the points that you yourself were trying to make"?"
    How could you determine scaling from 32 to 40 sockets when the two 3.7 Ghz systems were 16 and 32 socket? Sure, you could determine scaling from 16 to 32 socket as long as you put a nice asterisk noting the DB difference in the configuration. However, you continue to indicate that you determined 32 to 40 socket scaling which I have only seen presented in a flawed manner. You are welcome to try again.

    "No, but it is sheer stupidity. Can you explain again what makes you believe that? Or are you going to duck that question again? Or shift goal posts?"
    The goal post here was a citation on this point and this again is not a citation.

    "What point have I missed? You claim that UV300 is just a 16-socket version of the UV2000."
    Citation please where I make that explicit claim.

    "Analytics is not scale-up, it is scale-out. I have explained this in some detail and posted links from e.g. SAP and links talking about in memory databases which are exclusively used for analytics. Do you really believe anyone stores persistent data in RAM? No, RAM based databases are only used for analytics, as explained by SAP, etc."
    Amazing that you just flat our ignore Oracle, IBM and Microsoft about having in-memory features for their OLTP databases. In-memory data bases are used for more than just analytics despite your 'exclusive' claim.

    "You have not showed a scale-up usage of the UV2000 server, for instance, running SAP or Oracle databases for storing data. Can you post such a link? Any link at all?"
    Oracle TimesTen is indeed a database and that is what the US Post Office uses on their UV 2000. I believe that I've posted this before. As far as data storage goes, half a billion records are written to that database daily as part of the post office's fraud detection system. So instead of accepting this and admit that you're wrong, you have to now move the goal posts by claiming that analytics and in-memory databases some how don't apply.

    "Why is that do you think? If you claim SGI's large servers are faster and cheaper and can replace high end Unix 32-socket servers - why have no one ever done that? Dont they want to save $millions? Dont they want much higher performance? Why?"
    Cost is something that works against the UV 2000, not in hardware costs though. Enterprise software licensing is done on a per core and/or a per socket basis. Thus going all the way to up 256 sockets would incur an astronomical licensing fee. For example Oracle 12c Enterprise edition would have a base starting price of just ~$48.6 million USD to span all the cores in a 256 socket, 2048 core UV 2000 before any additional options. A 40 socket, 640 core Fujitsu M10-4S would have a base starting price of ~$15.2 million USD for Oracle before addition options are added. Any saving in hardware costs would be eaten up by the software licensing fees.
    http://www.oracle.com/us/corporate/pricing/technol...
    http://www.oracle.com/us/corporate/contracts/proce...

    "Well, no one use SGI UV2000 for enterprise business workloads. US Post Office are using it for fraud detection, that is analysis in memory database. Not storing data. You store data on disks, not in memory."
    And here you go again slandering in-memory databases as not being a real database or can used for actual business workloads. I'd also say that fraud detection is a business work load for the US Post Office.

    "Que? That SGI link explains the main difference between HPC workloads and enterprise business workloads. It was valid back then and it is valid today: the link says that HPC workloads runs in a tight for loop, crunching data, there is not much data passed between the cpus. And Enterprise code branches all over the place, so there is much communication among the cpus making it hard for scale-out servers. This is something that has always been true and true today. And in the link, SGI said that their large Altix UV1000 server are not suitable for enterprise workloads."
    http://www.realworldtech.com/sgi-interview/6/ is the link you keep posting about this and it should be pointed out that it predates the launch of the UV 1000 by a full six years.
    And again you go on about code branching affecting socket scaling. Again, could you actually define this in the context you are using it?

    "In your links you posted, SGI talks about trying to break into the enterprise market with the help of the UV300H server. SGI does not talk about the UV2000 server for breaking into the enterprise market."
    I'll quote this from the following link: 'The SGI deal with SAP also highlights the fact that the system maker is continuing to expand its software partnerships with the key players in the enterprise software space as it seeks to push systems like its “UltraViolet” UV 2000s beyond their more traditional supercomputing customer base. SGI has been a Microsoft partner for several years, peddling the combination of Windows Server and SQL Server on the UV 2000 systems, and is also an Oracle partner. It is noteworthy that Oracle’s eponymous database was at the heart of the $16.7 million fraud detection and mail sorting system at the United States Postal Service.' from http://www.enterprisetech.com/2014/01/14/sgi-paint...
    It is clear that SGI did want the UV 2000 to enter the enterprise market and with success at the US Post Office, I'd say they have a toe hold into it.

    "Que? Do you think we are stupid or who are trying to fool?"
    I'm quoting this only because it amuses me. 'We' now?

    "The guy in the article says that it is well known that HPC can span the entire UV2000 server but it is not known how far databases span the UV2000 server. And from this talk about the UV2000 server hardware, you conclude he talks about software limitations?"
    Or I actually read the article and understood the context of those statements you were attempting to quote mine. Software does have limitations in how far it can scale up and that was what was being discussed. If you disagree, please widen that quote and cite specifically where they would be discussing hardware in this specific context.

    "Que? Seriously? Do you know how to shorten High Performance Computing? By "HPC". SGI explicitly says they build the UV2000 for HPC or Data-Intensive Computing, both are scale-out workloads and runs on clusters. In your quote SGI explicitly says that UV2000 are used for clustered scale-out workloads, i.e. HPC and DIC. So you are smoked."
    Quote mining wikipedia now? The reason for using the UV2000 for DIC workloads is that with 64 TB of memory, many workloads can be run from in-memory from a single node. Thus IO storage bottleneck and networking overhead of a cluster are removed if you can do everything in-memory from a single system. If the data size is less than the 64 TB mark, then a cluster would be unnecessary. The networking overhead disappears when you can run it on a single node. Concurrency can be handled in an efficient manner with all the data residing in memory.

    "BEEP! Wrong. No it does not "indicate" that. SGI talks about using UV300 to get into the enterprise market. They only mention UV2000 when talking about HPC or DIC, both clustered scale-out workloads. You quoted that above."
    There are plenty of enterprise workloads that fall into the DIC category, like very large OTLP databases or analytics. Then again you are arbitrarily narrowing what defines enterprise workloads and what defines scale up so at some point nothing will be left.

    "Well, your conclusion is wrong. SGI talks about UV2000 built for HPC or DIC. Not enterprise. So you have missed the whole article, you did not only miss the point. You missed everything. Nowhere do SGI say that UV2000 is for enterprise. Isntead UV300H is for enterprise. You are making things up. Or can you quote where SGI says UV2000 is for enterprise, such as SAP or databases?"
    Except you are missing the massive overlap with DIC and enterprise workloads. "Second, selling UV big memory machines to large enterprises for in-memory and other kinds of processing will no doubt give SGI higher margins than trying to close the next big 10,000-lot server deal at one of the hyperscale datacenter operators." Note that the time of that article the UV 300 had yet to be announced. http://www.enterprisetech.com/2014/03/12/sgi-revea...

    "Duh, you missed the point. SAP does say that there are no larger scale-up x86 servers than 32-sockets."
    And I was asking you to demonstrate that 32 sockets was a hard limit for SAP HANA, which you have not provided. Instead you have posted a hardware answer for a question about software scaling. If HP were to release a 64 socket SuperDome X, would HANA run on it? The article indicates, that yes it would.

    ""Rather good"? You are lying so much that you believe yourself. Those worthless 16-socket x86 servers gets 54.000 saps as best which is really bad. And the best SPARC server from the same year, the M9000 gets 197.000 saps, almost 4x more. Try to fool someone else with "x86 can compete at the very top with high end Unix servers"."
    It is rather good considering that those older x86 systems only had 16 sockets vs. 64 socket machine of the that Fujitsu system you cite. This situation currently mirror where we are now with an 8 socket x86 system does a third of the work with one fifth the number of sockets as the record holder.

    "Do you know anything about the best 16-socket server, the IBM X3950M2? The IBM server is built from four individual 4-socket units, connected together with a single cable into a 16-socket configuration. This single cable between the four nodes, makes scalability awfully bad and makes performance more resemble a cluster. I dont know if anyone ever used it in the largest 16-socket configuration as a scale-up server."
    So? Using multiple chassis is an IBM feature they also use in various POWER systems (p770 is an example). This alone does not make it a cluster as again cache coherency maintained and memory is shared. While a minor point, IBM uses six cables to put four chassis together.
    Cache coherency is maintained as you would expect from a scale up server. Certainly scaling to 16 sockets with the older Intel FSB topology would be painful so the drop in performance per socket was expected. Intel has been replaced this topology with QPI for point to point to links with the addition of integrated memory controllers for NUMA. For x86 systems in this era, Opterons scaled better as AMD already implemented point-to-point and NUMA topology but only went to 8 sockets.

    "Correct. If we are going to be strict, yes. NUMA systems by definition, have different latency to different cpus. and ALL high end Unix servers are NUMA. But they are well designed, with low worst-case latency, so they can in fact run enterprise systems, as can be seen in all SAP and oracle benchmarks all over the internet."
    Then by your own definition, why are you citing 16 and 32 socket 'clusters' to run enterprise workloads? I heard from this guy Brutalizer that you can't run enterprise workloads on clusters. He said it on the internet so it must be true!

    "No, because the defining trait is: what are the usage scenarios for the servers? All high end Unix servers you mentioned, are used for enterprise usage."
    That is the thing, you are ignoring the hardware entirely when discussing system topology. Your analysis is easily flawed due to this. On the x86 side of things, you can load Linux and a OLTP database like Oracle onto a netbook if you wanted. That would not mean that wouldn't mean that the netbook is production worthy.
    Shared memory and cache coherency is what enables SMP scaling, features that the UV 2000 has all the way to 256 sockets and 64 TB of memory.

    "I mean exactly what SGI explained in my other link. But you know nothing about programming, so I understand you have difficulties with this concept. But this is really basic to a programmer. I am not going to teach you to program, though."
    That is not a definition. Please explain your definition of code branches in the context of increasing socket count. I'll humor a rehash of the definition of branching as it is indeed basic but I don't recall a branching being a factor in scaling socket count. So please enlighten me.

    "Are you stupid? It is you that are stating something dubious (or false). You prove your claim."
    OK, so I asked and indeed they have several SuperDome X systems. They all have 240 cores but on select systems they are not all enabled. I didn't follow up on this point but it appears this is due to keep software licensing costs down. They also have varying amount of memory ranging from 1 TB to 4 TB. (And Kent didn't answer himself as he was out of office last week but I pursued the person indicated in his automatic reply who did get me an answer.)

    "Are you obtuse? I quoted that they do lock. Didnt you read my quotes? Read them again. Or did you not understand the comp sci lingo?"
    I read them and I stated exactly what I saw: you did not demonstrate that locking was used for OCC or MVCC. If you feel like reasserting your claim, please do so but this time with actual evidence.

    "No, this is wrong. USP use the UV2000 for analytics, not database work. The ram database is used for "fraud detection" as quoted from your link. A real database is used to store persistent data on disks, not in RAM."
    In fact this is further admission of moving the goal posts on your part as you are now attacking the idea of in-memory databases as not being real databases now. OLTP and OLAP workloads can be performed by an in-memory database perfectly fine.

    "If you claim that the SGI UV2000 is a scale-up server, then you can surely show us several links where UV2000 replaces scale-up servers on scale-up business enterprise workloads? SGI has explictly said they have tried to get into the enterprise market for many years now, so there surely must exist several customers who replaced high end Unix servers with UV2000, on enterprise business workloads, right?"
    And again I will cite the US Post Office's usage of a UV 2000 as fitting the enterprise workload requirement.

    "...Actually what I pointed out as a key attribute of those large scale up machines: a single large memory space. That is why the institute purchased the M9000 as well the UV 2000. If they just wanted an HPC system, they’d get a cluster which they did separately alongside each of these units. In other words, they bought *both* a scale up and a scale out system at the same time. In 2009 the scale up server selected was a M9000 and in 2013 their scale up server was a UV 2000. It fits your initial request for UV 2000 replacing a large scale up Unix machine..."

    "No, this is incorrect again. Back in time, the SPARC M9000 had the world record in floating point calculations, so it was the fastest in the world. And let me tell you a secret; a mathematical institute do not run large scale business enterprise workloads needing the largest server money could buy, they run HPC mathematical calculations."
    Again, you are ignoring the reason why they purchased both the M9000 and UV 2000: get a large memory space that only a large scale up server can provide. And again you also ignore that for HPC workloads, they also bought a cluster alongside these systems.

    "Why would a mathematical institute run... say, a large SAP configuration? Large SAP installations can cost more than $100 million, where would a poor research institute get all the money from, and why would a math institute do all that business? This has never occured to you, right? But, let me tell, it is a secret, so dont tell anyone else. Not many people knows math institutes do not run large businesses. I understand you are confused and you did not know this. Have you ever set your foot on a comp sci or math faculty? You dont know much about math that is sure, and you claim you have "studied logic" at a math institute? Yeah right."
    I'd actually say the reason why they wouldn't run SAP is due to the software licensing costs so open source solutions with no direct licensing fees are heavily favored. They do deal with big data problems similar to what businesses are trying to solve right now.
    And my logic course as at an ordinary university, I made no claim that it was a math institute.
  • Brutalizer - Thursday, June 4, 2015 - link



    Ok, I understand what you are doing. I have some really relevant questions (for instance, if you claim that UV2000 can be used for scale-up business workloads, where are links?) and you either dont answer them or answer a totally another question. This makes the relevant questions disappear in the wall of text, which makes it easy for you to avoid these questions that really pin-point the issue. So in this post I have just three pin-point questions for you, which means you can not duck them any longer.

    .

    Q1) Would you consider this as spreading dubious or false information? If yes, how is this different from your claim that SGI UV2000 can handle scale-up business workloads better than high end Unix servers (without any links)?
    "SPARC M6-32 server is much faster than UV2000 in HPC calculations. I have no benchmarks to show, nor links to show. I emailed Oracle and they confirmed this. Which means this is a proven fact, and no speculation"

    .

    Q2) Show us a link on a x86 server that achieves close to a million saps. You claim that x86 can tackle larger workloads (i.e. scales better) than Unix servers, so it should be easy for you to show links with x86 getting higher saps than Unix. Or is this claim also FUD?

    .

    Q3) Show us an example of UV2000 used for large enterprise scale-up workloads, such as SAP or Oracle database (not dataware house analytics, but a real database storing persistent data which means they have to maintain data integrity which is extremely difficult in HPC clusters)

    .

    These are the three questions I asked numerous times earlier, and you have always ducked them in one way or another. Are you going to duck them again? Doesn't it make your view point a bit hollow?

    .

    .

    Regarding UV2000 used by US Postal Service. SGI says in several links UV2000 is only used for real time analytics. "Learn How SGI and FedCentric Deliver Real-Time Analytics for USPS":
    https://www.youtube.com/watch?v=tiLrkqatr2A&li...

    This means that UV2000 compare each mail with a RAM database to see if this mail is fraudulent. In addition to the TimesTen memory database, there is also an Oracle 10G datawarehouse as a backend, for "long term storage" of data.
    http://www.datacenterknowledge.com/archives/2013/0...
    "[TimesTen] coupled with a transactional [Oracle] database it will perform billions of mail piece scans in a standard 15 hour processing day. Real-time scanning performs complex algorithms, powered by a SGI Altix 4700 system. Oracle Data Warehouse keeps fraud results stored in 1.6 terabytes of TimesTen cache, in order to compare and then push back into the Oracle warehouse for long term storage and analysis."

    So here you have it again, USPS does only do analytics on the TimesTen. TimesTen do not store persistent data (which means TimesTen doesn't have to maintain data integrity).
    https://news.ycombinator.com/item?id=8175726
    "....I'm not saying that Oracle hardware or software is the solution, but "scaling-out" is incredibly difficult in transaction processing. I worked at a mid-size tech company with what I imagine was a fairly typical workload, and we spent a ton of money on database hardware because it would have been either incredibly complicated or slow to maintain data integrity across multiple machines...."

    .

    Also, there are numerous articles where SGI says they try to get into the Enterprise market, but they can not do that with UV2000. The SGI representative mailed me just now:

    >find case studies on
    >http://www.sgi.com/company_info/resources/case_stu...
    >customer testimonials on
    >http://www.sgi.com/company_info/customers/testimon...
    >customer success stories on YouTube
    >https://www.youtube.com/playlist?list=PLT0g4VdghLM...
    >
    >SGI UV 2000
    >http://www.sgi.com/products/servers/uv/uv_2000_20....
    >Ist his sufficient? We don't have more information which is open for the public.
    >
    >the only "enterprise" customer use case I have already sent to you. The USPS case which runs greatly on Oracle.
    >It was never our target to use UV2000 for enterprise. And for SAP HANA it runs very bad.
    >I will ask our Chief Engineer if he can help you.

    all these 46 "case studies" and "testimonials" on SGI website are exclusively talking about HPC scenarios. Not a single use case on SGI website is about Enterprise workloads. If you claim that UV2000 is good for Enterprise workloads, there should be at least a few customers doing enterprise workloads, right? But no one are. Why?

    .

    Apparently you have not really understood this thing about NUMA servers are clusters as you say "Then by your own definition, why are you citing 16 and 32 socket 'clusters' to run enterprise workloads? I heard from this guy Brutalizer that you can't run enterprise workloads on clusters."

    So let me teach you as you have shown very little knowledge about parallel computations. All large servers are NUMA, which means some nodes far away have bad latency, i.e. some sort of a cluster. They are not uniform memory SMP servers. All large NUMA servers have differing latency. If you keep a server small, say 32-sockets, then worst case latency is not too bad which makes them suitable for scale-up workloads. Here we see that each SPARC M6 cpu are always connected to each other in 32-socket fashion "all-to-all topology". In worst case, there is one hop to reach far away nodes:
    http://regmedia.co.uk/2013/08/28/oracle_sparc_m6_b...
    "The SPARC M6-32 machine is a NUMA nest of [smaller] SMP servers. And to get around the obvious delays from hopping, Oracle has overprovisioned the Bixby switches so they have lots of bandwidth."

    Here we see the largest IBM POWER8 server, it has 16-sockets and all are always connected to each other. See how many steps in worst case in comparison to the 32-socket SPARC server:
    http://2eof2j3oc7is20vt9q3g7tlo5xe.wpengine.netdna...

    However, if you go into the 100s of sockets realm (UV2000) then you can not use design principles like smaller 32-socket servers do, where all cpus are always connected to each other. Instead, the cpus in a cluster are NOT always connected to each other. For 256 sockets you would need 35.000 data channels, that is not possible. Instead you typically use lots of switches in a Fat Tree configuration, just like SGI does in UV2000:
    clusterdesign.org/fat-trees/fat_tree_varying_ports/

    As soon as a cpu needs to communicate to another, all the involved switches creates and destroys connections. For best case the latency is good, worst case latency is much worse though, because of all switches. A switch does not have enough of connections for all cpus to be connected to each other all the time (then you would not need switches).

    As soon as you leave this all-to-all topology with low number of sockets, and go into heavily switched architecture that all large 100s of sockets HPC clusters use, worst case latency suffers and code that branches heavily will be severely penalized (just as SGI explains in that link I posted). This is why a switched architecture can not handle scale-up workloads, because worst case latency is too bad.

    Also, the SGI UV2000 limits the bandwidth of the NUMAlink6 to 6.7 GB/sec - which does not cut it for scale-up business workloads. The scale-up Oracle M6-32 has many Terabytes of bandwidth, because SPARC M6 is an all-to-all topology, not a switched slow cluster.
    http://www.enterprisetech.com/2013/09/22/oracle-li...

    .

    I said this many times in one way or another, to no avail. You just dont get it. But here is two month old link, where the CTO at SGI says the same thing; that UV2000 is not suitable for enterprise workloads. Ive posted numerous links from SGI where they say that UV2000 is not suitable for enterprise workloads. How many more SGI links do you want?
    http://www.theplatform.net/2015/03/05/balancing-sc...

    "...to better address the needs of these commercial customers, SGI had to back off on the scalability of the top-end UV 2000 systems, which implement...NUMA, and create [UV300H] that looks a bit more like a classic symmetric multiprocessing (SMP) of days gone by.

    NUMA...assumes that processors have their own main memory assigned to them and linked directly to their sockets and that a high-speed interconnect of some kind glues multiple processor/memory complexes together into a single system with variable latencies between local and remote memory. This is true of all NUMA systems, including an entry two-socket server all the way up to a machine like the SGI UV 2000...

    In the case of a very extendable system like the UV 2000, the NUMA memory latencies fall into low, medium, and high bands, Eng Lim Goh, chief technology officer at SGI, explains...and that means customers have to be very aware of data placement in the distributed memory of the system if they hope to get good performance (i.e. UV2000 is not a true uniform SMP server no matter what SGI marketing says).

    “With the UV 300, we changed to an all-to-all topology,” explains Goh. “This was based on usability feedback from commercial customers because unlike HPC customers, they do not want to spend too much time worrying about where data is coming from. With the UV 300, all groups of processors of four talking to any other groups of processors of four in the system will have the same latency because all of the groups are fully connected.” (i.e. it looks like a normal Unix server where all cpus are connected, no switches involved)

    SGI does not publicly divulge what the memory latencies are in the systems, but what Goh can say is that the memory access in between the nodes in the UV 300 is now uniform, unlike that in the UV 2000, but the latencies are a bit lower than the lowest, most local memory in the UV 2000 machine.

    “It is not that the UV 300 is better than the UV 2000,” says Goh. “It is just that we are trading off scalability in the UV 2000 for the usability in the UV 300. When you do all-to-all topology in a UV 300, you give up something. That is why the UV 300 will max out at 32 sockets – no more. If the UV 300 has to go beyond 32 sockets – for instance, if SAP HANA makes us go there – we will have to end up with NUMA again because we don’t have enough ports on the processors to talk to all the sockets at the same time.”

    .

    (BTW, while it is true that Oracle charges more the more cpus your server have, the normal procedure is too limit the Oracle database to run on only a few cpus via virtualization to keep the cost down. Also, when benchmarking there are no such price limitations, they do their best to only want to grab the top spot and goes to great effort to do that. But there are no top benchmarks from UV2000, they have not even tried)
  • Kevin G - Friday, June 5, 2015 - link

    @Brutalizer
    “Ok, I understand what you are doing. I have some really relevant questions […] and you either dont answer them or answer a totally another question. This makes the relevant questions disappear in the wall of text, which makes it easy for you to avoid these questions that really pin-point the issue. So in this post I have just three pin-point questions for you, which means you can not duck them any longer.”

    Sure I can. :)
    Your projection is strong as you cut out a lot of the discussion to avoid my points and answering my questions so I would consider cropping your request out of my reply. It would simply be fair play. Though I will give these a shot as I have asked several questions you have also dodged in return.

    QA) “That x86 servers will not do? SAP and business software is very hard to scale, as SGI explained to you, as the code branches too much.” I’ve asked for this clarification on this several times before without a direct answer. What is the definition of code branching in the context of increasing and why does it impact scalability?

    QB) You have asserted that OCC and MVCC techniques use locking to maintain concurrency when they were actually designed to be alternatives to locking for that same purpose. Please demonstrate that OCC and MVCC do indeed use locking as you claim.
    http://en.wikipedia.org/wiki/Optimistic_concurrenc...
    http://en.wikipedia.org/wiki/Multiversion_concurre...

    QC) Why does the Unix market continue to exist today? Why is the Unix system market shrinking? You indicated that it not because of exclusive features/software, vendor lock-in, cost of porting custom software or RAS support in hardware/software. Performance is being rivaled by x86 systems as they’re generally faster(per SAP) and cheaper than Unix systems of similar socket count in business workloads.

    “Q1) Would you consider this as spreading dubious or false information? If yes, how is this different from your claim that SGI UV2000 can handle scale-up business workloads better than high end Unix servers (without any links)? "SPARC M6-32 server is much faster than UV2000 in HPC calculations. I have no benchmarks to show, nor links to show. I emailed Oracle and they confirmed this. Which means this is a proven fact, and no speculation"”
    Three points about this. First is the difference in claims. It is an *ability* of the UV 2000 to run scale up business workloads. The metric for this rather binary: can it do so? Yes or no? There is no other point of comparison or metric to determine that. You’ve indicated via your SGI contact above that the UV 2000 it is certified for Oracle and SAP (though not SAP HANA). I’ve posted links about SGI attempting to get it into the enterprise market as well as the USPS example. Thus the UV 2000 appears to have the ability to run enterprise workloads. We have both provided evidence to support this claim.
    Secondly about the M6-32 hypothetical, this is a direct comparison. Performance here can be measured and claims can be easily falsified based upon that measurement. Data not provided can be looked up from independent sources to challenge the assertion directly. And lastly, yes, your hypothetical would fall under the pretense of dubious information as no evidence has been provided.

    “Q2) Show us a link on a x86 server that achieves close to a million saps. You claim that x86 can tackle larger workloads (i.e. scales better) than Unix servers, so it should be easy for you to show links with x86 getting higher saps than Unix. Or is this claim also FUD?”
    First off, there is actually no SAP benchmark result of a million or relatively close (+/- 10%) and thus cannot be fulfilled by any platform. The x86 platform has a score in the top 10 out of 792 results posted as of today. This placement means that it is faster than 783 other submissions, including some (but not all) modern Unix systems (submissions less than 3 years old) with a subset of those having more than 8 sockets.. These statements can be verified as fact by sorting by SAP score after going to http://global.sap.com/solutions/benchmark/sd2tier....
    The top x86 score for reference is http://download.sap.com/download.epd?context=40E2D...

    “Q3) Show us an example of UV2000 used for large enterprise scale-up workloads, such as SAP or Oracle database (not dataware house analytics, but a real database storing persistent data which means they have to maintain data integrity which is extremely difficult in HPC clusters)”
    You’ve shifted the goal posts so many times that I think it is now worth documenting every time they have moved to put this question into its proper context. Your initial claim ( http://anandtech.com/comments/9193/the-xeon-e78800... ) was “...no one use SGI servers for business software, such as SAP or databases which run code that branches heavily.” For what the USPS example was a counter. You then assert ( http://anandtech.com/comments/9193/the-xeon-e78800... ) that since TimesTen is an in memory DB, that it cannot be used as a real database has to write to disk. Oddly, you have since provided a quote below that runs contrary to this claim. When we get to http://anandtech.com/comments/9193/the-xeon-e78800... the excuse about TimesTen is that it is only for analytics and thus not real database workloads (even though again the quote you provide below would satisfy the requirements you present in this post). When we get to http://anandtech.com/comments/9193/the-xeon-e78800... you re-assert the claim: “In Memory Databases often don't even have locking of rows, as I showed in links. That means they are not meant for normal database use. It is stupid to claim that a "database" that has no locking, can replace a real database.” So at this point a real database has to write to disk, can’t be used for analytics and has to use locking for concurrency. When we get to http://anandtech.com/comments/9193/the-xeon-e78800... a few posts later, apparently fraud detection is not an acceptable business case for the USPS: “No, this is wrong. USP use the UV2000 for analytics, not database work. The ram database is used for "fraud detection" as quoted from your link. A real database is used to store persistent data on disks, not in RAM.”

    So yes, I have answered this question before you shifted the goal posts four different times. At this point to fit your shifted criteria, all the advantages to running a large database on the UV 2000 has been removed. The main feature of the UV 2000 isn’t the large socket count but the massive cache coherent shared memory capacity. 64 TB is large enough to hold the working set for many business applications, even in this era of big data. If that strength cannot be utilized, then your only other reason to get a UV 2000 would be a computationally bound, scale up problem for enterprise work (but can’t be analytics either!). You have successfully moved the goal posts to the point that there is no answer. To use an analogy it is like trying to find a car that can go 300 km/hr and 3L/100 km fuel efficiency but also has no wheels.

    “This means that UV2000 compare each mail with a RAM database to see if this mail is fraudulent. In addition to the TimesTen memory database, there is also an Oracle 10G datawarehouse as a backend, for "long term storage" of data.
    http://www.datacenterknowledge.com/archives/2013/0...
    "[TimesTen] coupled with a transactional [Oracle] database it will perform billions of mail piece scans in a standard 15 hour processing day. Real-time scanning performs complex algorithms, powered by a SGI Altix 4700 system. Oracle Data Warehouse keeps fraud results stored in 1.6 terabytes of TimesTen cache, in order to compare and then push back into the Oracle warehouse for long term storage and analysis."”
    Excellent! Now you finally understand that Times Ten database is used to actually write data directly to. The initial scan goes directly to the TimesTen database for the comparison with all other recent scans. This is the core function of their fraud detection system. The other key point in the quote you cite is transactional. This contradicts your earlier statements (http://anandtech.com/comments/9193/the-xeon-e78800... ): “… A real database acts at the back end layer. Period. In fact, IMDB often acts as cache to a real database, similar to a Data WareHouse. I would not be surprised if US Postal Service use TimesTen as a cache to a real Oracle DB on disk. You must store the real data on disk somewhere, or get the data from disk….”
    Also things have changed a bit since that article was published. USPS has upgraded to a UV 1000 and then to a UV 2000 so it is two generations behind. The main thing that has changed is the amount of memory in the systems. At the time of the Itanium based Altix 4700, the data warehouse was 10 TB in size ( http://www.oracle.com/technetwork/products/timeste... ). I could not find any reference to the current size of the data warehouse or the rate that it increases, but their current UV 2000 with 32 TB of memory would at least be able to host their entire data warehouse from 2010 in-memory three times over with room to spare.

    “So here you have it again, USPS does only do analytics on the TimesTen. TimesTen do not store persistent data (which means TimesTen doesn't have to maintain data integrity).”
    TimesTen does indeed support concurrency and it even uses traditional locking to do it. https://docs.oracle.com/cd/E13085_01/timesten.1121...

    “Apparently you have not really understood this thing about NUMA servers are clusters as you say "Then by your own definition, why are you citing 16 and 32 socket 'clusters' to run enterprise workloads? I heard from this guy Brutalizer that you can't run enterprise workloads on clusters."
    Oh, I have understood. I was just pointing out your contractions here. The irony is delicious.

    “So let me teach you as you have shown very little knowledge about parallel computations. All large servers are NUMA, which means some nodes far away have bad latency, i.e. some sort of a cluster. They are not uniform memory SMP servers. All large NUMA servers have differing latency. If you keep a server small, say 32-sockets, then worst case latency is not too bad which makes them suitable for scale-up workloads. “
    The differing point between a cluster and a scale up server you appear to be arguing for is simply the latency itself. While having lower latency is ideal for performance, latency can be permitted to increase until cache coherency is lost. The UV 2000 supports cache coherency up to 256 sockets and 64 TB of memory.
    A cluster on the other hand does not directly share memory with each node being fully independent and linked via a discrete networking layer. This adds complexity for programming as coherency has to be added via software as there is no hardware support for it over a network. The network itself is far higher latency connection between nodes than what CPUs use directly in hardware (socket to socket latencies are measured in nanoseconds where as Ethernet is measured in microseconds) and time to handle the software layer that doesn’t exist between CPUs in a scale up system.

    “Here we see that each SPARC M6 cpu are always connected to each other in 32-socket fashion "all-to-all topology". In worst case, there is one hop to reach far away nodes:”
    That graphic is actually a mix of one, two and three hop connections. Going from socket 8 to 15 is a single hop as those sockets are directly connected with nothing in between them. Socket 8 to 1 takes two jumps as a Bixby interconnect chip is between them. Socket 8 to 23 requires three hops: first to a Bixby chip (BX0, BX2, BX4, BX6, BX8 or BX10), then to a another socket (either 16, 17, 18, or 19) and then finally to socket 23. For a 32 socket topology, this is actually pretty good, just not what you are claiming.

    “Here we see the largest IBM POWER8 server, it has 16-sockets and all are always connected to each other. See how many steps in worst case in comparison to the 32-socket SPARC server:”
    The picture you cite is a bit confusing as it follows the cable path (IBM puts two links through a single external cable). This page contains a better diagram for the raw logical topology:
    http://www.enterprisetech.com/2014/07/28/ibm-forgi...
    This is rather clear that the topology is a mix of single and two hop paths: each socket is connected directly to every other socket in the same drawer and to one socket in every external drawer. Thus the worst case is two hope via need to move both within a drawer and then externally.

    “However, if you go into the 100s of sockets realm (UV2000) then you can not use design principles like smaller 32-socket servers do, where all cpus are always connected to each other. Instead, the cpus in a cluster are NOT always connected to each other. For 256 sockets you would need 35.000 data channels, that is not possible. Instead you typically use lots of switches in a Fat Tree configuration, just like SGI does in UV2000”
    First off, far your examples so far don’t actually show all the socket connected directly to each other.
    Secondly, your math on scaling out 256 via mesh topology does not require 35,000 links. The correct answer is 32,640. Still a lot but an odd mistake for some one who claims to have a masters in mathematics.
    Lastly, the UV 2000 can be configured into a hypercube as your own SGI contact indicated. Worst case scenario is 5 hops on a 256 socket system. Those extra hops do add latency but not enough to break cache coherency between all the sockets.

    “As soon as a cpu needs to communicate to another, all the involved switches creates and destroys connections. For best case the latency is good, worst case latency is much worse though, because of all switches. A switch does not have enough of connections for all cpus to be connected to each other all the time (then you would not need switches).”
    Correct and this is what happens with both the Bixby and NUMALink6 interconnects. You’ve provide this link before and it indicates that Bixby is a switch. “The Bixby coherence-switch chips hold L3 cache directories for all of processors in a given system, and a processor doing a memory request has to use the CLs to find the proper processor SMP group in the system, and then the processor socket in the SMP group that has the memory it needs.” http://www.theregister.co.uk/2013/08/28/oracle_spa...

    “As soon as you leave this all-to-all topology with low number of sockets, and go into heavily switched architecture that all large 100s of sockets HPC clusters use, worst case latency suffers and code that branches heavily will be severely penalized (just as SGI explains in that link I posted). This is why a switched architecture can not handle scale-up workloads, because worst case latency is too bad.”
    Cache coherent switching between processor sockets does indeed at latency which in turn lowers performance. Both NUMALink6 and Bixby handle this well and are capable of maintain coherency even with additional switching tiers adding latency.

    “Also, the SGI UV2000 limits the bandwidth of the NUMAlink6 to 6.7 GB/sec - which does not cut it for scale-up business workloads. The scale-up Oracle M6-32 has many Terabytes of bandwidth, because SPARC M6 is an all-to-all topology, not a switched slow cluster.”
    That is 6.7 GB per link on the UV 2000. That 3 TB/s of bandwidth being quoted in the article is the aggregate of the links not an individual link. The Bixby chips are switches and provide 12 GB per link according to http://www.theregister.co.uk/2013/08/28/oracle_spa...

    “I said this many times in one way or another, to no avail. You just dont get it. But here is two month old link, where the CTO at SGI says the same thing; that UV2000 is not suitable for enterprise workloads. Ive posted numerous links from SGI where they say that UV2000 is not suitable for enterprise workloads. How many more SGI links do you want?”
    And this has been covered before. The reason to scale back the socket count from the UV2000 to UV300 is to provide a more uniform latency. The main reasons to pick the UV 2000 over the UV 300 is if you genuinely need the extra performance provided by the extra sockets and/or more than the 48 TB of memory which the UV 3000 can be configured with. That is a very narrow gap for the UV 2000 in the enterprise today. Though before the UV 300 launch, the UV 2000 filled that same role. Though to answer our question, I wouldn’t a few more links as they tend to run counter to the claims you’re trying to make.

    “NUMA...assumes that processors have their own main memory assigned to them and linked directly to their sockets and that a high-speed interconnect of some kind glues multiple processor/memory complexes together into a single system with variable latencies between local and remote memory. This is true of all NUMA systems, including an entry two-socket server all the way up to a machine like the SGI UV 2000...”
    You should have finished the last sentence in that quote: “…which has as many as 256 sockets and which is by far the most scalable shared memory machine on the market today.” I can see why you removed this part as counters several points you are trying to make in our discussion.

    “In the case of a very extendable system like the UV 2000, the NUMA memory latencies fall into low, medium, and high bands, Eng Lim Goh, chief technology officer at SGI, explains...and that means customers have to be very aware of data placement in the distributed memory of the system if they hope to get good performance (i.e. UV2000 is not a true uniform SMP server no matter what SGI marketing says).”
    Data placement is nothing new to NUMA: by performing the calculations on the cores closest to where the data resides reduces the number of remote memory accesses. This is nothing new and applies to all NUMA systems. It is more important on the UV 2000 as the worst case scenario is 5 hops at 256 socckets vs. 3 hops for your M6-32 example.

    “With the UV 300, we changed to an all-to-all topology,” explains Goh. “This was based on usability feedback from commercial customers because unlike HPC customers, they do not want to spend too much time worrying about where data is coming from. With the UV 300, all groups of processors of four talking to any other groups of processors of four in the system will have the same latency because all of the groups are fully connected.” (i.e. it looks like a normal Unix server where all cpus are connected, no switches involved)
    If you actually understood that quote, there are still switches involved. Just a single tier of them instead of the 3 tiered design in the 256 socket UV 2000.

    “(BTW, while it is true that Oracle charges more the more cpus your servers have, the normal procedure is too limit the Oracle database to run on only a few cpus via virtualization to keep the cost down. Also, when benchmarking there are no such price limitations, they do their best to only want to grab the top spot and goes to great effort to do that. But there are no top benchmarks from UV2000, they have not even tried)”
    However, the Oracle’s fees and similar models for enterprise software is a major reason why UV 2000 installations for such software is quiet rare. It is not because the software couldn’t run on the UV 2000 (you did cut discussion about software scalability which in fact could have a real limit). Regardless, software licensing costs is a major reason to avoid such a massive system. It would generally be wiser to get a lower socket/core count system similar to the number of cores that you can afford via licensing. This is why Oracle has high hopes for the UV 300 as the ceiling for software licensing fees is far lower than a UV 2000 with similar memory capacity up to 48 TB.
  • Brutalizer - Thursday, June 11, 2015 - link

    Im on vacation and only have an ipad, so I can not write long posts. ill be home 15 july. However, i notcie you did duck my questions Q1 -3) again, by you shift goal posts. For instance, you say that x86 scales better on sap, and provide no proof of that, instead you duck the question by saying x86 has top ten. So where is the proof x86 tackles larger sap workloads? Nowhere. Does it tackle larger sap workloads? According to you: yes. Show us proof, or is x86 unable to tackle larger sap workloads, ie FUD as usual?
    .

    You also say that x86 can handle businees scale-up workloads better than unix: your proof of this? No links/benchmarks. Instead you explain to all readers here that: "x86 appears to have the ability to run enterprise workloads". Ergo, x86 handles enterprise workloads better than unix. Et voila. This so un academic I dont know where to start. You convince no one with that explanation. Really stupid this. Can you explain AGIN to all of us here?

    Here in london it says in a store: "to buy some wares you need to be 25 years". Imagine Fuder KevinG reply
    -i am older than 25, i emailed them at HP, and they concur i am 25 years. You have to trust me, i am not showing you any id card or any other verification of this claim.
    What do you think the policeman would say? Let you go? Is this argument of yours, credible? Have you verified you are 25 years? No. No proof. Just fud.

    And regarding the sgi link, he says that sgi uv2000 is not suitable to enterprise, so he talks about uv300h being better. You dont believe the cto of sgi. Of course, with logic such as yours, you dont know what a credible explanation is. That is the reason you believe you have proved uv2000 handles better workloads (it appears to do so) and that is why you dont believe cto sgi when he recommends uv300h instead. Your logic is beyond repair, you dont know what an academic debate is, with logically sound arguments and counter arguments. You have no clue of academic debate. That is clear. You dont convince the police, nor a researcher.

    You are welcome to answer questions 1-3) again unfil i get home.
  • Brutalizer - Thursday, June 11, 2015 - link

    I mean, ille be home 15 june. Btw, sgi sales rep has not answered me more. He aksed why i use a weird email adress, and got suspicious. "What is this all about??? Why are you asking these questions??"
  • Kevin G - Thursday, June 11, 2015 - link

    @Brutalizer
    "Im on vacation and only have an ipad, so I can not write long posts. ill be home 15 july. "

    So? I'm composing this while waiting at an airport using an iPhone. I look forward to your response on my three questions and how you misinterpreted the socket topologies in the links you provided. You would want to dodge that part now would you?

    "However, i notcie you did duck my questions Q1 -3) again, by you shift goal posts. For instance, you say that x86 scales better on sap, and provide no proof of that, instead you duck the question by saying x86 has top ten.So where is the proof x86 tackles larger sap workloads?"

    You seem to be missing the point that a top ten SAP score does indeed indicate that it can handle large workloads. The question Q1 you presented was about capability and differences in the claims. The fact x86 ranks in SAP's tip 10 should be sufficient evidence enough on this point. The UV 2000 is certified by Oracle to run there database, which would also be evidence that the system is capable of running running database workloads. The answer to Q1 was accurate despite your attempts to shift the question again here.

    "You also say that x86 can handle businees scale-up workloads better than unix: your proof of this? No links/benchmarks. Instead you explain to all readers here that: "x86 appears to have the ability to run enterprise workloads". Ergo, x86 handles enterprise workloads better than unix. Et voila. This so un academic I dont know where to start. [...]"

    Your initial request about this before you shifted the goal posts was to find a customer that replaced a Unix system with a 16 socket x86 system for business workloads. So I found such an installation at Cerner per your request. This should be proof enough that x86 based can replace large Unix systems because businesses are doing exactly that.

    "And regarding the sgi link, he says that sgi uv2000 is not suitable to enterprise, so he talks about uv300h being better. You dont believe the cto of sgi. Of course, with logic such as yours, you dont know what a credible explanation is. That is the reason you believe you have proved uv2000 handles better workloads (it appears to do so) and that is why you dont believe cto sgi when he recommends uv300h instead. "

    I believe that I stated myself that the UV300 would be a better system in most situations today as it supports up to 48 TB of memory, 3/4 the amount that the UV2000 supports. It does so at a lower socket count which saves money on software licenses and the socket topology offer better latency for better performance up to 32 sockets. As such you, me, and SGI are in lockstep with the idea that in most cases the UV 300 would be a better choice in the majority of use-cases. I also listed the two niche exceptions: the need for more than 48 TB of memory and/or more performance than what the UV 300 offers in 32 sockets.

    The link you presented ( http://www.theplatform.net/2015/03/05/balancing-sc... ) does not actually say that UV 2000 cannot run business workloads rather that UV 300 would be better at them (and the better part we agree on as well as why the UV 300 would be better). You just can't realized that the UV 2000 is also capable of running the exact same software.

    Now go back a year when the UV 300 wasn't on the market. You have interviews like this which the historical context has to be the UV 2000 the UV 300 wasn't on the market yet: http://www.enterprisetech.com/2014/03/12/sgi-revea...
  • Brutalizer - Monday, June 22, 2015 - link

    @Troll KevinG

    I claim that SGI UV2000 is only used as a HPC cluster, and I claim scale-out servers such as UV2000 can not replace large Unix scale-up servers on business enterprise workloads. Here are my arguments:
    1) I have posted several SGI links where SGI CTO and other SGI people, say UV2000 is exclusively for HPC market, and UV300H is for enterprise market.
    2) SGI has ~50 customers UV2000 use cases on their web site and all are about HPC, such as datawarehouse analytics (US Postal Service). Nowhere on the website do SGI talk about scale-up workloads with UV2000.
    3) There are no scale-up benchmarks on the whole wide internet, on large UV2000 servers, no SAP benchmarks, no database benchmarks, no nothing.
    4) There are no articles or anonymous forum posts on the whole wide internet about any company using UV2000 for scale-up workloads.
    5) SGI sales people said UV2000 is only for HPC workloads, in email exchange with me.
    6) SAP says in numerous links, that the largest scale-up servers are 32 sockets, SAP never mention 256-socket. SAP talks about the UV300H server, and never mention UV2000. Remember that SAP has close ties to SGI.

    And you claim the opposite, that the SGI UV2000 can in fact replace and outperforms large Unix servers on scale-up workloads, and tackle larger workloads. On what grounds? Here they are:
    1) It appears to you, that UV2000 is able to actually startup and successfully finish scale-up workloads.
    2) You have examined scale-out clustered HPC workloads on x86 servers from another vendor than SGI.
    3) US Postal Service use an SGI UV2000 for in-memory datawarehouse analytics (they also use a real database on disk for storing data).

    From these three points you have concluded that UV2000 outperforms and tackles larger scale-up workloads than Unix servers.

    .

    Are you for real?? What kind of person am I "discussing" with? You are not serious, you are Trolling! Your Troll arguments are more in the line of a child "did too, no did not, did too, no did not!" - instead of presenting serious technical arguments to discuss. I have met eight graders more serious than you.

    You say it appears to you that SGI UV2000 can startup and run scale-up workloads, and therefore UV2000 outperforms Unix on scale-up. Well, appearance is not a fact, not hard numbers as benchmarks are, it is just your subjective opinion. Something might appear different to me than you, depending on who look at it. You have pulled this opinion out of a hat, and now present this opinion as a fact. This logic is so faulty that I dont know what to say. There is no way any University would graduate you with this FUBAR logic, much of what you conclude is pure wrong and opinions.

    I asked you many times to prove and support your claims by showing benchmarks, otherwise people might believe you are FUDing and lying. But you never backup your claims with benchmarks. For instance, I have counted how many times I have asked you this single question:
    -post SAP benchmarks of SGI UV2000 or any other x86 server, beating the best Unix servers.

    I have asked you this single question 41 times. Yes, 41 times. And everytime you have ducked it, you have never posted such a benchmark. But still you claim it is true, with no proof. And in addition I have also asked other questions, that you also ducked. I dont think that if I ask you another 41 times, you will suddenly post with holded SAP benchmarks of a x86 server attaining close to a million saps, beating the largest Unix servers. Because there are no such a powerful x86 scale-up server. They dont exist. No matter how much you say so. There are no proof of their existence, no links, no benchmarks, no nothing. They are a fairy tale, living only in your imagination.

    The case is clear, I can ask you 10000 times, and you can not prove your false claims about x86 outperforms Unix servers on SAP. Because they are false. In other words, you are FUDing (spreading false information).

    If what you say were true; x86 tackles larger SAP workloads than Unix (844.000 saps), then you could prove it by posting x86 benchmarks of 900.000 saps or more. That would be +-10% of a million saps. But you can not post such x86 benchmarks. I can ask you another 41 times, but you will not post any x86 benchmarks beating Unix servers on SAP. They. Dont. Exist.

    You have ducked that question 41 times. Instead you reiterate "x86 is in the top 10, and therefore x86 tackles larger SAP workloads than Unix". Well, the best servers are Unix, and there are only a few large Unix servers on the market (SPARC and POWER). These few Unix servers can only grab so many SAP entries.

    Of course, Fujitsu could SAP benchmark 32 sockets, 31 sockets, 30 sockets, 29 sockets, etc and grab all 10 top spots. And there would be no way for x86 to stop SPARC grabbing all top 10 spots. Unix can grab top 10 at anytime. x86 can not stop Unix doing that because x86 can not compete with the largest Unix servers in terms of scale-up performance.

    Just because x86 has a spot in top 10, does not mean x86 is fast enough to beat the largest Unix servers. It only means that the Unix vendors have chosen not to benchmark all different cpu configurations. It does not mean x86 is faster. Your logic is plain wrong. As usual.

    .

    To round this off, you have asked me several times what it means that "enterprise business systems have code that branches heavily". SGI explained that in a link I showed you; that as scale-up workloads have heavily branching code SGI will not be able to go into the scale-up enterprise market with their large switch based clusters. This SGI link you rejected because it was several years old. And after that, you kept asking me what does it mean that code branches heavily. This is hilarious. You dont understand SGI's explanation, but you reject it - without understanding what SGI said.

    But fear not, I will finally teach you what heavily branching code is. The reason I did not answer earlier is because there are so much to type, an essay. Besides I doubt you will understand, no matter how much I explain as this is quite technical, and you are clearly not technically inclined. You have time and again proved you have very strong opinions of things you dont understand. To a normal person that would be very strange. I would never tell a quantum physicist he is wrong, as I dont know much about quantum physics. But lack of knowledge of quantum physics would not stop you for telling a physicist is wrong, based on your opinions, judging from this discussion. Your lack of comp sci or comp arch knowledge is huge as you have shown many times.

    Anyway, here goes. "What does SGI mean when they say that heavily branching code is a problem for scaling up?". Read and learn.

    .

    For a server to be fast, it needs to have fast RAM, i.e. low latency so it can feed the fast cpus. Fast RAM is very expensive, so typically servers have loads of slow RAM. This slow RAM makes it difficult to feed data to the fast cpus, and most of the time a cpu has to wait for data to process. Intel studies shows that server x86 cpus, under full load, under maximum load, waits for data more than 50% of the time. Let me repeat this; under full load, a x86 server cpu idles >50% of the time as it waits for data to process.

    So you have also small (~ 10MB or so) but fast caches that can keep a small portion of the data easy accessible to the CPU, so CPU does not need to wait too long time to get data. If the data is small enough to fit in the cpu cache, all is well and the cpu can process the data quickly. Otherwise, the cpu needs to wait for data to arrive from slow RAM.

    C++ game programmers say that typical numbers are 40x slower performance when reaching out to slow RAM. This is on a small PC. I expect the numbers be worse on a large scale-up server with 32 sockets.
    http://norvig.com/21-days.html#answers

    This is also why C++ game programmers avoid the use of virtual functions (because it will thrash the cache, and performance degrades heavily). Virtual pointers hop around a lot in memory, so the data can not be consecutive, so the cpu always need to jump around in slow RAM. Instead, game programmers tries to fit all data into small vectors that can fit into the cpu cache, so the cpu just can step around in the vector, without ever leaving the vector. All data is located in the vector. This is an ideal situation. If you have loads of data, you can not fit it all into a small cpu cache.

    Enteprise business workloads serve many users at the same time, accessing and altering databases, etc. HPC servers are only serving one single user, a scientist that choose what HPC number crunching workload should be run for the next 24 hours.

    Say that you serve 153.000 SAP users (the top SAP spot with 844.000 saps serves 153.000 users) at the same time, and all SAP users are doing different things. All these 153.000 SAP users are accessing loads of different data structs, which means all the different user data can not fit into a small cpu cache. Instead, all the user data is strewn everywhere in RAM. The server needs to reach out to slow RAM all the time, as every SAP user does different things with their particular data set. One is doing accounting, someone else is doing sales, one is reaching the database, etc etc.

    This means SAP servers serving large number of users, goes out to slow RAM all the time. The workload of all users will never fit into a small cpu cache. Therefore performance degrades 40x or so. This means if you have a 3.6 GHz cpu, it corresponds to a 90 MHz cpu. Here is a developer "Steve Thomas" talking about this in the comments:
    http://www.enterprisetech.com/2013/09/22/oracle-li...
    "...My benchmarks show me that random access to RAM when you go past the cache size is about 20-30ns on my test system. Maybe faster on a better processor? Sure. Call it 15ns? 15ns = a 66Mhz processor. Remember those?..."

    Steve Thomas says that the performance of a 3GHz cpu, deteroiates down to 66MHz cpu when reaching out to slow RAM. Why does it happen? Well, if there are too much user data, so it can not fit into RAM, performance will be 66MHz cpu.

    BUT!!! this can also happen if the source code can branche too heavily. In step one, the cpu reads byte 0xaafd3 and in the next step, the cpu reads byte far away, and in the third step the cpu will read another byte very far away. Because the cpu jumps around too much, the source code can not be read in advance into the the small cpu cache. Ideally all source code will lie consecutively in cpu cache (like the data vector I talked about above). In that case, the cpu will read the first byte in the cache and process it, and read the second byte in the cache and process it, etc. There will be no wait, everything is in the cache. If the code branches too heavily, it means the cpu will need to jump around in slow RAM everywhere. This means the cpu cache is not used as intended. In other words, performance degrades 40x. This means a 3.6GHz cpu equals a 90MHz cpu.

    Business enteprise systems have source code that branches heavily. One user does accounting, at the same time another user does some sales stuff, and a third user reaches the database, etc. This means the cpu will serve the accounting user and fill the cpu cache with accounting algorithms, and then the cpu will serve the sales person and will empty the cache and fill it with sales stuff source code, etc. And off and on it goes, the cpu will empty and fill the cache all the time with different source code, once for every user, the cpu cache is thrashed. And as we have 153.000 users, the cache will be filled and emptied all the time. And not much work will be done in the cache, instead we will exclusively work in slow RAM all the time. This makes a fast cpu equivalent of a 66MHz cpu. Ergo, heavily branching code SCALES BAD. Just as SGI explained.

    OTOH, HPC workloads are number crunching. Typically, you have a very large grid, X and Y and Z coordinates. And each cpu handles say, 100 grid points each. This means a cpu will solve Navier Stokes incompressible differential equations again and again on the same small grid. Everything will fit into the cache, which will never be emptied. The same equation can be run on every point in the grid. It just repeats itself, so it can be optimized. Therefore the cpu can run at full 3.6GHz speed, because everything is fit into the cpu cache. The cache is never emptied. The cache always contains 100 grid points and the equation, which will be applied over and over again. There is not much communication going on between the cpus.

    OTOH, enterprise business users never repeat everything, they tend to do different things all the time, calling accounting, sales, book keeping, database, etc functionality. So they are using different functions at the same time, so there is lot of communcation between the cpus, so the perforamcne will degrade to 90MHz cpu. Their workflow can not be optimized as it never repeats. 153.000 users will never do the same thing, you can never optimize their workflow.

    Now, the SGI UV2000 has 256 sockets and five(?) layers of many NUMAlink switches. BTW, NUMA means the latency differs between cpus far away and close. This means any NUMA machine is not a true SMP server. So it does not matter how much SGI says UV2000 is a SMP server, it is not SMP because latency differs. SMP server has the same latency to every cpu. So, this is just SGI marketing, as I told you. Google on NUMA and see that latency differs, i.e. any server using NUMA is not SMP. All large Unix servers are NUMA, but they are small and tight - only 32 sockets - so they keep latency low so they can run scale-up workloads. If you keep UV2000 down to a small configuration with few sockets it would be able to run scale-up workloads. But not when you use many cpus because latency grows with the number of switches. If you google, there are actually a 8-socket database benchmark with the UV2000, so there are actually scale-up benchmarks with UV2000, they do exist. SGI have actually benchmarked UV2000 for scale-up, but stopped at 8-sockets. Why not continue benchmarking with 32 sockets, 64 and 256 sockets? Well...

    These five layers of switches adds additional latency. And as 15ns cpu latency cache is only able to fully feed a 66 MHz cpu, what do you think the latency of five layered switches gives? If every switch layer takes 15 ns to connect, and you have five layers, the worst case latency will be 5 x 15 = 75 ns. This corresponds to a 13 MHz cpu. Yes, 13 MHz server.

    If the code branches heavily, we will exclusively jump around in slow RAM all the time and can never work in the fast cpu cache, and the five layers of switches need to create and destroy connections all the time, yielding a 13 MHz UV2000 server. Now that performance is not suitable for any workload, scale-up or scale-out. Heavily branching code is a PITA, as SGI explained. If you have a larger UV2000 cluster with 2048 cpus you need maybe 10 layers of switches or more, so performance for scale-up workloads will degrade even more. Maybe down to 1MHz cpu. You know the Commodore C64? It had a 1MHz cpu.

    (Remember, the UV2000 are used for HPC workloads, that means all the data and source code fits in the cache so the cpu never needs to go out to slow RAM, and can run full speed all the time. So HPC performance will be excellent. If the HPC code is written so to reach slow RAM all the time, performance will degrade heavily down to 13MHz or so. This is why UV2000 are only used for scale-out analytics by USP, and not scale-up workloads)

    In short; if code branches heavily, you need to go out to slow RAM and you get a 66 MHz server. If worst case latency of the five layers are in total 15ns, then the UV2000 is equivalent of 66MHz cpus. If worst case latency is 5 x 15 = 75ns, then you have a 13MHz UV2000 server. SAP performacne will be bad in either case. Maybe this is why SGI does not reveal the real latency numbers of SGI UV2000, because then it would be apparent even to non technical people like you, that UV2000 is very very slow when running code that branch heavily. And that is why you will never find UV2000 replacing large Unix servers on scale-up workloads. Just google a bit, and you will find out that SGI does not reveal the UV2000 latency. And this is why SGI does not try to get into enterprise market with UV2000, but instead use the UV300H which has few sockets and therefore low latency.

    And how much did I need to type to explain what SGI means with "heavy branching code" to you? This much! There are so much you dont know or fail to understand, no matter how much I explain and show links, so this is a huge waste of time. After a couple of years, I have walked you through a complete B Sc comp sci curriculum. But I dont have the time to school you on this. I should write this much on every question you have, becuase there are so large gaps in your knowledge everywhere. But I will not do that. So I leave you now, to your ignorance. They say that "there are no stupid people, only uninformed". But I dont know in this case. I explain and explain, and show several links to SGI and whatnot, and still you fail to understand despite all information you receive from SGI and me.

    I hope at least you have a better understanding now, why a large server with many sockets can not run scale-up workloads, Unix or x86 or whatever. Latency will be too slow, as I have explained all the time, so you will end up with a 66 MHz server. Large scale-up server performance are not about cpu performance, but about I/O. It is very difficult to make a good scale-up server, cpus need to be fast, but I/O need to be evenly good in every aspect in the server, and RAM. You seem to believe that as UV2000 has 256 sockets, it must be faster than 32-socket Unix servers in every aspect. Well, you forgot to consider I/O. To run scale-up workloads, you need to keep the number of sockets low, say 32-sockets. And use superior engineering with a all-to-all topology. Switches will never do.

    And one last 42nd question: as you claim that x86 beats largest Unix servers on SAP benchmarks, show us a x86 benchmark. You can't. So you are just some Troll spreading FUD (false information).

    (Lastly, it says on the SGI web page that the UV300H goes only to 24TB RAM, and besides, Xeon can maximum adress 1.5TB RAM per cpu in 8-socket configurations. It is a built in hard limit in Xeon. I expect performance of UV300H to deterioate fast when using more than 6-8TB RAM because of scaling inefficencies of x86 architecture. Old mature IBM AIX with 32-socket servers for decades, had severe problems scaling to 8TB RAM, and needed to be rewritten just a few years ago, Solaris as well recently. Windows and Linux needs also to be rewritten if they go into 8TB RAM territory)

    BTW, the SGI sales person have stopped emailing me.
  • Kevin G - Wednesday, June 24, 2015 - link

    @Brutalizer
    I think I’ll start by reposting two questions that you have dodged:

    QB) You have asserted that OCC and MVCC techniques use locking to maintain concurrency when they were actually designed to be alternatives to locking for that same purpose. Please demonstrate that OCC and MVCC do indeed use locking as you claim.

    QC) Why does the Unix market continue to exist today? Why is the Unix system market shrinking? You indicated that it not because of exclusive features/software, vendor lock-in, cost of porting custom software or RAS support in hardware/software. Performance is being rivaled by x86 systems as they’re generally faster per socket and per core(SAP) and cheaper than Unix systems of similar socket count in business workloads.

    “I claim that SGI UV2000 is only used as a HPC cluster, and I claim scale-out servers such as UV2000 can not replace large Unix scale-up servers on business enterprise workloads. Here are my arguments:
    1) I have posted several SGI links where SGI CTO and other SGI people, say UV2000 is exclusively for HPC market, and UV300H is for enterprise market.”
    First off, this is actually isn’t a technical reason why the UV 2000 couldn’t be used for scale-up workloads.

    Secondly, you have not demonstrated that the UV 2000 is a cluster as you have claimed. There is no networking software stack or a required software stack to run a distributed workload on the system. The programmers on a UV 2000 can see all memory and processor cores available. The unified memory and numerous processor sockets are all linked via hardware in a in a cache coherent manner.

    And I have posted links where SGI was looking to move the UV 2000 into the enterprise market prior to the UV 300. I’ve also explained why SGI isn’t doing that now with the UV 2000 as the UV 300 has a more uniform latency and nearly the same memory capacity. In most enterprise use-cases, the UV 300 would be more ideal.

    “2) SGI has ~50 customers UV2000 use cases on their web site and all are about HPC, such as datawarehouse analytics (US Postal Service). Nowhere on the website do SGI talk about scale-up workloads with UV2000.”
    Again, this is actually isn’t a technical reason why the UV 2000 couldn’t be used for scale-up workloads.

    The UV 2000 is certified to run Oracle’s database software so it would appear that both SGI and Oracle deem it capable. SAP (with the exception of HANA) is certified to run on the UV 2000 too. (You are also the one that presented this information regarding this.) Similarly it is certified to run MS SQL Server. Those are enterprise, scale up databases.

    And the US Postal Service example is a good example of an enterprise workload. Every piece of mail gets scanned in: that is hundreds of millions of pieces of mail every day. These new records are compared to the last several days worth of records to check for fraudulent postage. Postage is time sensitive so extremely old postage gets handled separately as there is likely an issue with delivery (ie flagged as an exception). This logic enables the UV 2000 to do everything in memory due to the system’s massive memory capacity. Data is eventually archived to disk but the working set remains in-memory for performance reasons.

    “3) There are no scale-up benchmarks on the whole wide internet, on large UV2000 servers, no SAP benchmarks, no database benchmarks, no nothing.”
    Repeating, this is actually isn’t a technical reason why the UV 2000 could not be used for scale-up workloads.

    “4) There are no articles or anonymous forum posts on the whole wide internet about any company using UV2000 for scale-up workloads.”
    Once again, this is actually isn’t a technical reason why the UV 2000 couldn’t be used for scale-up workloads.

    If you were to actually look, there are a couple of forum posts regarding UV 2000’s in the enterprise market. I never considered these forum posts to be exceedingly reliable as there is no means of validating the claims so I have never presented them. I have found references to Pal Pay owning a UV 2000 for fraud detection but I couldn’t find specific details on their implementation.

    “5) SGI sales people said UV2000 is only for HPC workloads, in email exchange with me.”
    This is appropriate considering point 1 above. Today the niche the UV 2000 occupies a much smaller niche with the launch of the UV 300. The conversation would have been different a year ago before the UV 300 launched.

    “6) SAP says in numerous links, that the largest scale-up servers are 32 sockets, SAP never mention 256-socket. SAP talks about the UV300H server, and never mention UV2000. Remember that SAP has close ties to SGI.”
    Actually your SGI contact indicated that it was certified for SAP with the exception of HANA. As I’ve pointed out, HAHA only gets certified on Xeon E7 based platforms for production. The UV 2000 uses Xeon E5 class chips.

    “And you claim the opposite, that the SGI UV2000 can in fact replace and outperforms large Unix servers on scale-up workloads, and tackle larger workloads. On what grounds? Here they are:”
    Actually the main thing I’m attempting to get you to understand is that the UV 2000 is simply a very big scale up server. My main pool of evidence is SGI documentation on the architecture and how it is similar to other well documented large scale up machines. And to counter this you had to state that the 32 socket version of the SPARC M6-32 is a cluster.

    I’ve also indicated that companies should use the best tool for the job. There still exists a valid niche for Unix based systems, though you have mocked of them. In fact, you skipped over the question (QC above) of why the Unix market exists today when you ignore the reasons I’ve previously cited. And yes, there are reasons to select a Unix server over a UV 2000 based upon the task at hand but performance is not one of those reasons.

    “The case is clear, I can ask you 10000 times, and you can not prove your false claims about x86 outperforms Unix servers on SAP. Because they are false. In other words, you are FUDing (spreading false information).”
    Really? What part of the following paragraph I posted previously to answer your original question is actually false:
    First off, there is actually no SAP benchmark result of a million or relatively close (+/- 10%) and thus cannot be fulfilled by any platform. The x86 platform has a score in the top 10 out of 792 results posted as of [June 05]. This placement means that it is faster than 783 other submissions, including some (but not all) modern Unix systems (submissions less than 3 years old) with a subset of those having more than 8 sockets. These statements can be verified as fact by sorting by SAP score after going to http://global.sap.com/solutions/benchmark/sd2tier....
    The top x86 score for reference is http://download.sap.com/download.epd?context=40E2D...

    As for false information, here are a few bits misinformation you have recently spread:
    *That the aggregate cross sectional bandwidth of all the M6-32 interconnects is comparable to the uplink bandwidth of a single of the NUMALink6 chip found in the UV 2000.
    *Modern Unix systems like the SPARC M6-32 and the POWER8 based E880 only need a single hop to go between sockets when traffic clearly needs up to 3 or 2 hops on each system respectively.
    *That a switched fabric for interprocessor communication cannot be used for scaleup workloads despite your example using such a switch for scale up workloads (SPARC M6-32).
    *OLTP cannot be done in-memory as a ‘real’ database needs to store data on disk despite there being commercial in-memory databases optimized for OLTP workloads.
    *That the OCC or MVCC techniques for concurrency use a traditional locking mechanism.
    *That code branching directly affects system scalability when in fact you meant the random access latency across all a system’s unified memory (see below).
    *It takes over 35,000 connections to form a mesh topology across 256 sockets when the correct answer is 32,640.

    To round this off, you have asked me several times what it means that "enterprise business systems have code that branches heavily". […]
    Anyway, here goes. "What does SGI mean when they say that heavily branching code is a problem for scaling up?". Read and learn.[…]
    This is exactly what I thought: you are using the wrong terminology for something that does genuinely exist. You have described the performance impact of the random memory access latency across the entire unified memory space. Branching inside of code strictly speaking isn’t necessary to create such a varied memory access pattern to the point that prefetching and caching are not effective. A linked list (https://en.wikipedia.org/wiki/Linked_list ) would be a simple example as traversing it doesn’t require any actual code branches to perform. It does require jumping around in memory as elements in the list don’t necessarily have to be neighbors in memory. On a modern processor, the prefetching logic would loads the data into cache in an attempt to speed up access before it is formally requested by the running code. The result is that the request is served from cache. And yes, if the data isn’t in the cache and a processor has to wait on the memory access, performance does indeed drop rapidly. In the context of increasing socket count, the latency for remote memory access increases, especially if multiple hops between sockets are required. I’ll reiterate that my issue here is not the actual ideas you’ve presented but rather the terms you were using to describe it.

    “BUT!!! this can also happen if the source code can branche too heavily. In step one, the cpu reads byte 0xaafd3 and in the next step, the cpu reads byte far away, and in the third step the cpu will read another byte very far away.”
    This is the core problem I’ve had with the terminology you have been using. The code for this does not actually need a branch instruction here to do what you are describing. This can be accomplished by several sequential load statements to non-adjacent memory regions. And the opposite can also happen: several nested branches where the code and/or referenced data all reside in the same memory page. There is where your usage of the term code branch leads to confusion.

    “Now, the SGI UV2000 has 256 sockets and five(?) layers of many NUMAlink switches. “
    It is three layers but five hops in the worst case scenario. Remember you need a hop to enter and exit the topology.

    “BTW, NUMA means the latency differs between cpus far away and close. This means any NUMA machine is not a true SMP server. So it does not matter how much SGI says UV2000 is a SMP server, it is not SMP because latency differs. SMP server has the same latency to every cpu. So, this is just SGI marketing, as I told you. Google on NUMA and see that latency differs, i.e. any server using NUMA is not SMP. All large Unix servers are NUMA, but they are small and tight - only 32 sockets - so they keep latency low so they can run scale-up workloads.”
    You are flat out contradicting yourself here. If NUMA suffices to run scale up workloads like a classical SMP sign as you claim, then the UV 2000 is a scale up server. Having lower latency and fewer sockets to traverse for remote memory access does indeed help performance but as long as cache coherency is able to maintained, then the system can be seen as one logical device.

    “ If you keep UV2000 down to a small configuration with few sockets it would be able to run scale-up workloads.”
    Progress! You have finally admitted that the UV 2000 is a scale up system. Victory is mine!

    “But not when you use many cpus because latency grows with the number of switches. If you google, there are actually a 8-socket database benchmark with the UV2000, so there are actually scale-up benchmarks with UV2000, they do exist. SGI have actually benchmarked UV2000 for scale-up, but stopped at 8-sockets. Why not continue benchmarking with 32 sockets, 64 and 256 sockets? Well...”
    Link please. If you have found them, then why have you been complaining that they don’t exist earlier?

    These five layers of switches adds additional latency. And as 15ns cpu latency cache is only able to fully feed a 66 MHz cpu, what do you think the latency of five layered switches gives? If every switch layer takes 15 ns to connect, and you have five layers, the worst case latency will be 5 x 15 = 75 ns. This corresponds to a 13 MHz cpu. Yes, 13 MHz server.
    The idea that additional layers add latency is correct but your example figures here are way off. Mainly because you are forgetting to include the actual memory access time. Recent single socket systems are around ~75 ns and up (http://anandtech.com/show/9185/intel-xeon-d-review... ).Thus a single socket system has local memory latency on the same level you are describing from just moving across the interconnect. Xeons E7 and POWER8 will have radically higher local memory access times even on a single socket configuration due to the presence of a memory buffer. Remote memory access latencies are several hundred nanoseconds.

    You are also underestimating the effect of caching and prefetching in modern architectures. Cache hit rates have increased over the past 15 years by improving prefetchers, the splitting of L2 into L2 + L3 for multicore systems and increasing the last level cache sizes. I high recommend reading this the following paper as it paints a less dire scenario than you are describing with real actual data that your random forum commenter provided: http://www.pandis.net/resources/cidr07hardavellas....

    “In short; if code branches heavily, you need to go out to slow RAM and you get a 66 MHz server. If worst case latency of the five layers are in total 15ns, then the UV2000 is equivalent of 66MHz cpus. If worst case latency is 5 x 15 = 75ns, then you have a 13MHz UV2000 server. SAP performacne will be bad in either case. Maybe this is why SGI does not reveal the real latency numbers of SGI UV2000, because then it would be apparent even to non technical people like you, that UV2000 is very very slow when running code that branch heavily. And that is why you will never find UV2000 replacing large Unix servers on scale-up workloads. Just google a bit, and you will find out that SGI does not reveal the UV2000 latency. And this is why SGI does not try to get into enterprise market with UV2000, but instead use the UV300H which has few sockets and therefore low latency.”
    Apparently you never did the searching as I found a paper that actually measured the UV 2000 latencies rather quickly. Note that this is a 64 socket configuration UV 2000 where the maximum number of hops necessary is four (in, two NUMAlink6 switches, out), not five on a 256 socket model. The interesting thing is that despite the latencies presented, they were still able to achieve good scaling with their software. Also noteworthy is that even without the optimized software, the UV 2000 was faster than the other two tested systems even if the other two systems were using optimized software. Of course, with the NUMA optimized software, the UV 2000 was radically faster. Bonus: the workloads tested included an in-memory database.
    http://www.adms-conf.org/2014/adms14_kissinger.pdf

    For comparison, a 32 socket SPARC M6 needs 150 ns to cross just the Bixby chips. Note that is figure does not include the latency of the actual memory access itself nor the additional latency if a local socket to socket hop is also necessary. Source: (http://www.enterprisetech.com/2014/10/06/ibm-takes... )

    While a rough estimate, it would appear that the worst case latency on a UV 2000 is 2.5 to 3 times higher than the worst case latency on a 32 socket SPARC M6 (~100 ns for socket-to-socket hop, 150 ns across the Bixby interconnect and ~120 ns for the actual memory access). This is acceptable in the context that the UV 2000 has sixteen times as many sockets.

    (Lastly, it says on the SGI web page that the UV300H goes only to 24TB RAM, and besides, Xeon can maximum adress 1.5TB RAM per cpu in 8-socket configurations. It is a built in hard limit in Xeon. I expect performance of UV300H to deterioate fast when using more than 6-8TB RAM because of scaling inefficencies of x86 architecture.
    1.5 TB per socket * 32 sockets = 48 TB
    You need 64 GB DIMMs to do it per http://www.theplatform.net/2015/05/01/sgi-awaits-u...

    You are providing no basis for the performance deterioration, just an assertion.

    “Old mature IBM AIX with 32-socket servers for decades, had severe problems scaling to 8TB RAM, and needed to be rewritten just a few years ago, Solaris as well recently. Windows and Linux needs also to be rewritten if they go into 8TB RAM territory)”
    The only recent changes I know of with regards to memory addressing has been operating support for larger page sizes. It is inefficient to use small page sizes for such large amounts of memory due to the sheer number of pages involved. Linux has already been adapted to the large 1 MB page sizes offered by modern x86 systems.
  • Kevin G - Friday, June 26, 2015 - link

    A quick correction:
    "While a rough estimate, it would appear that the worst case latency on a UV 2000 is 2.5 to 3 times higher than the worst case latency on a 32 socket SPARC M6 (~100 ns for socket-to-socket hop, 150 ns across the Bixby interconnect and ~120 ns for the actual memory access). This is acceptable in the context that the UV 2000 has sixteen times as many sockets."

    The UV 2000 has eight times as many sockets as the M6-32. If Oracle were to formally release the 96 socket of the M6, it'd actually be 2.5 times as many sockets.
  • Kevin G - Monday, August 24, 2015 - link

    Well HP has submitted a result for the 16 socket Super Dome X:
    http://download.sap.com/download.epd?context=40E2D...

    A score of 459,580 is ~77% faster than a similarly configured eight socket Xeon E7 v2 getting 259,680:
    http://download.sap.com/download.epd?context=40E2D...
    Main difference between the systems would be the amount of RAM at 4 TB for the Super Dome X vs. 1 TB for the Fujitsu system.
    Overall, the Super Dome X is pretty much where I'd predicted it would be. Scaling isn't linear but a 77% gain is still good for doubling the socket count. Going to 32 sockets with the E7 v2 should net a score around ~800,000 which would be the 3rd fastest on the chart. All HP would need to do is migrate to the E7 v3 chips (which are socket compatible with the E7 v2) and at 32 sockets they could take the top spot with a score just shy of a million.
  • kgardas - Monday, May 18, 2015 - link

    "The best SAP Tier-2 score for x86 is actually 320880 with an 8 socket Xeon E7-8890 v3. Not bad in comparison as the best score is 6417670 for a 40 socket, 640 core SPARC box. In other words, it takes SPARC 5x the sockets and 4.5x the cores to do 2x the work." -- Kevin G. This is not that fair, you are comparing 2 years old SPARC box with just released Xeon E7v3! Anyway still more than one year old SPARC is able to achieve nearly the same number with 32 sockets! The question here is really how it happens that neither IBM with Power nor Intel with Xeon is able to achieve such high number with whatever resources they throw at it.
    Speaking about Xeon versus Power8, Power8, 8 sockets gets to 436100 SAPS while 8 sockets Xeon E7v3 just to 320880. Here it looks like Power8 is really speedy CPU, cudos to IBM!
  • Kevin G - Monday, May 18, 2015 - link

    @kgardas
    I consider it a totally fair comparison as Brutalizer was asking for *any* good x86 score as he is in total denial that the x86 platform can be used for such tasks. So I provided one. That's a top 10 ranking one and my note of the top SPARC benchmark requiring 5x the sockets for 2x the work is accurate.
  • kgardas - Tuesday, May 19, 2015 - link

    @Kevin G: this 5x sockets to perform 2x work is not fair. As I told you, in more than one year old system it went to 4x sockets.
    Anyway, if you analyse SPARC64 pipe-line, than it's clear that you need twice the number of SPARC64 CPUs to perform the same work like Intel. This is a well known weakness of SPARC implementation(s) unfortunately...
  • Kevin G - Tuesday, May 19, 2015 - link

    @kgardas
    I don't think that that is a inherent to SPARC. It is just that Sun/Oracle never pursued the ultra high single threaded performance like Intel and IBM. Rather they focused on throughput by increasing thread count. With this philosophy, the T series was a surprising success for its targeted workloads.

    The one SPARC core that was interesting never saw the light of day: Rock. The idea of out of order instruction retirement seems like a natural evolution from out of order execution. This design could have been the single threaded performance champion that SPARC needed. Delays and trouble validating it insured that it it never made it past prototype silicon. I see the concept of OoO instruction retirement as a feature Intel or IBM will incorporate if they can get a license or the Sun/Oracle patents expire.
  • kgardas - Tuesday, May 19, 2015 - link

    @Kevin G: Yes, true, SPARC is more multi-threaded CPU, but honestly POWER too these days. Look at the pipe-lines. SPARC-Tx/Mx -- just two integer execution units. SPARC64-X, 4 integer units and max 4 isns per cycle. Look at POWER8, *just* two integer execution units! They are talking about 8 or so executions units, but they are just two integer. My bet is they are not shared between threads which means POWER8 is brutally throughput chip. So the only speed-daemon in single-threaded domain remains Intel...
    Rock? Would like to see it in reality, but Oracle killed that as a first thing after purchase of Sun. Honestly it was also very revolutionary so I'd bet that Sun engineers kind of not been able to handle that. i.e. All Sun's chip were in-order designs and now they not only come with OoO chip, but also that revolutionary. So from this point of view modest OoO design of S3 core looks like very conservative engineering approach.
  • Kevin G - Tuesday, May 19, 2015 - link

    @kgardas
    POWER8 can cheat as the load/store and dedicated load units can be assigned simple integer operations to execute. Only complex integer operations (divide etc.) are generally sent to the pure integer units. That's a total of 6 units for simple integer operations.

    The tests here does show that single threaded performance on the POWER8 isn't bad but not up to the same level as Haswell. Once you add SMT though, the POWER8 can pull ahead by a good margin. With the ability to issue 10 instructions and dispatch 8 per cycle, writing good code and finding the proper compiler to utilize everything is a challenge.

    Rock would have been interesting but by the time it would have reached the market it would have been laughable against Nehalem and crushed against POWER7 a year later. They did have test silicon of the design but it was perpetually stuck in validation for years. Adding OoO execution, OoO retirement, and transactional memory (TSX in Intel speak) would have been a nightmare. Though if Sun got it to work and shipped systems on time in 2006, the high end market place would be very different than it is today.
  • kgardas - Thursday, May 21, 2015 - link

    @Kevin G: thanks for the correction about load/store units doing simple integer operations. I agree with your testing that single-threaded POWER8 is not up to the speed of Haswell. In fact my testing shows it's on the same level like POWER7.
    So with POWER8 doing 6 integer ops in cycle, it's more powerful than SPARC64 X which is doing 4 or than SPARC S3 core which is doing just 2. It also explain well spec rate difference between M10-4 and POWER8 machine. Good! Things start to be more clear now...
  • patrickjp93 - Saturday, May 16, 2015 - link

    No, just no. Intel solved the cluster latency problem long ago with Infiniband revisions. 4 nanoseconds to have a 10-removed node tell the head node something or vice versa, and no one builds hypercube or start topology that's any worse than 10-removed.
  • Brutalizer - Sunday, May 17, 2015 - link

    @patrickjp93,
    If Intel solved the cluster latency problem, then why are not SGI UV2000 and ScaleMP clusters used to run monolithic business software? Question: Why are there no SGI UV2000 top records in SAP?
    Answer: Because they can not run monolithich software that branches too much. That is why there are no good x86 benchmarks.
  • misiu_mp - Tuesday, June 2, 2015 - link

    Just to point out, 10ns at the speed of light in vacuum is 3m, and signalling is slower than that because of the fibre medium (glass) limits the sped of light to about 60% of c and on top of that come electronic latencies. So maybe you can get 10ns latency over 1-1.5m maximum. That's not a large cluster.
  • Kevin G - Monday, May 11, 2015 - link

    I am not uninformed. I would say that you're being willfully ignorant. In fact, you ignored my previous links about this very topic when I could actually find examples for you. ( For the curious outsider: http://www.anandtech.com/comments/7757/quad-ivy-br... )

    So again I will cite the US Post Office using SGI machines to run Oracle Times Ten databases:
    http://www.intelfreepress.com/news/usps-supercompu...

    As for the UV 2000 not being a scale up sever, did you not watch the videos I posted? You can see Linux tools in the videos clearly that indicate that it was produced on a 64 socket system. If that is not a scale up server, why are the LInux tools reporting it as such? If the UV 2000 series isn't good for SAP, then why is HANA being tuned to run on it by both SGI and SAP?

    HP's Superdome X shares a strong relationship with the Itanium based Superdome 2 machine: they use the same chipset to scale past 8 sockets. This is because the recent Itaniums and Xeons both use QPI as an interconnect bus. So if the Superdome X is a cluster, then so is its Itanium 2 based offerings using that same chipset. Speaking of, that chipset does go up to 64 sockets and there is the potential to go that far (source: http://www.enterprisetech.com/2014/12/02/hps-itani... ). It won't be a decade if HP already has working chipset that they've shipped in other machines. :)

    Speaking of the Superdome X, it is fast and can outrun the SPARC M10-4S at the 16 socket level by a fatctor of 2.38. Even with perfect scaling, the SPARC system would need more than 32 sockets to compete. Oh wait, if we go by your claim above that "you double the number of sockets, you will likely gain 20% or so" then the SPARC system would need to scale to 512 sockets to be competitive with the Superdome X. (Source: http://h20195.www2.hp.com/V2/getpdf.aspx/4AA5-6149... )

    And if you're dead set on an >8 socket SAP benchmark using x86 processors, here is one, though a bit dated:
    https://www.vmware.com/files/pdf/partners/ibm/ibm-...
  • 68k - Tuesday, May 12, 2015 - link

    You know, those >8 socket systems are flying of the shelf faster than anyone can produce them. That is why Intel only got, according to the article, 92-94% of the >4-sockets market... It seem pretty safe to state that >8 socket is an extreme niche market, which is probably why it is hard to find any benchmarks on such systems.

    The price point of really big scaled-up servers is also an extreme intensive to think very hard about how one can design software to now require a single system. Some problems absolutely need scale-up, as you pointed out, there are such x86 systems available and have been for quite some time.

    Anyone know what the ratio between the 2-socket servers vs >4-socket in terms of market share (number of deployed systems) look like?
  • 68k - Tuesday, May 12, 2015 - link

    No edit... Pretend that '>' means "greater or equal" in the post above.
  • Arkive - Tuesday, May 12, 2015 - link

    You guys are obviously not idiots, just enormously stubborn. Why don't you take the wealth of time you spend fighting on the internet and do something productive instead?
  • kgardas - Wednesday, May 13, 2015 - link

    Kevin, I'll not argue with you about SGI UV. It's in fact very nice machine and it looks like it is on part with latency to Sun/Oracle Bixby interconnect. Anyway, what I would like to note is about your Superdome X comparison to SPARC M10-4S. Unfortunately IMHO this is purely CPU benchmark. It's multi-JVM so if you use one JVM per one processor, you pin that JVM to this processor and limit its memory to the size (max) of memory available to the processor, then basically you do have kind of scale-out cluser inside one machine. This is IMHO what they are benchmarking. What it just shows that current SPARC64 is really not up to the performance level of latest Xeon. Pity, but is fact. Anyway, my point is, for memory scalability benchmark you should use something different than multi-jvm bench. I would vote for stream here, although it's still bandwidth oriented still it provides at least some picture: https://www.cs.virginia.edu/stream/top20/Bandwidth... -- no Superdome there and HP submitted some Superdome results in the past. Perhaps it's not memory scalability hero these days?
  • ats - Tuesday, May 12, 2015 - link

    So that must be why SAP did this for SGI: https://www.sgi.com/company_info/newsroom/awards/s...

    You don't normally recognize partners for innovation for your products unless they are filling a need. AKA SGI actually sells their UV300H appliance.

    And SAP HANA like ALL DBs can be used both SSI or clustered.

    And the SAP SD 2-Tier benchmark is not at all monolithic. The whole 2-tier thing should of been a hint. And SAP SD 3-Tier is also not monolithic.

    Scaling for x86 is no easier nor no harder than for any other architecture for large scale coherent systems. If you think it is, its because you don't know jack. I've designed CPUs/Systems that can scale up to 256 CPUs in a coherent image, fyi. Also it should probably be noted that the large scale systems are pretty much never used as monolithic systems, and almost always used via partitioning or VMs.
  • Brutalizer - Tuesday, May 12, 2015 - link

    Again, Hana is a clustered RAM database. And as I have shown above with the Oracle TenTimes RAM database, these are totally different from a normal database. In Memory DataBases can never replace a normal database, as IMDB are optimized for reading data (analysis), not modifying data.

    Regarding SGI UV300H, it is a 16 socket server, i.e. scale-up server. It is not a huge scale-out cluster. And therefore UV300H might be good for business software, but I dont know the performance of SGI's first(?) scale-up server. Anyway, 16 socket servers are different from SGI UV2000 scale out clusters. And UV2000 can not be used for business software. As evidenced by non existing SAP benchmarks.
  • ats - Wednesday, May 13, 2015 - link

    No, you haven't shown anything. You quote some random whitepaper on the internet like it is gospel and ignore the fact that in memory dbs are used daily as the primary in OLTP, OLAP, BI, etc workloads.

    And you don't understand that a significant number of the IMDBs are actually designed directly for the OLTP market which is precisely the DB workload that is modifying the most data and is the most complex and demanding with regard to locks and updates.

    There is no architecural difference between the UV300 and the UV2k except slightly faster interconnect. And just an fyi, UV300 is like SGI's 30th scale up server. After all, they've been making scale up server for longer than Sun/Oracle.
  • questionlp - Monday, May 11, 2015 - link

    HP Superdome X is a 16-socket x86 server that will probably end up replacing the Itanium-based Superdome if HP can scale the S/X to 32 sockets.
  • Brutalizer - Monday, May 11, 2015 - link

    HP will face great difficulties if they try to mod and go beyond 8 sockets on the old Superdome. Heck, even 8 sockets have scaling difficulties on x86.
  • Kevin G - Monday, May 11, 2015 - link

    Except that you can you buy a 16 socket Superdome X *today*.

    http://h20195.www2.hp.com/V2/getpdf.aspx/4AA5-6149...

    The interconnect they're using for the Superdome X is from the old Poulson Itaniums that use QPI which can scale to 64 sockets.
  • rbanffy - Wednesday, May 13, 2015 - link

    You talk "serious business workloads". Of course, there are organizations that use technology that does not scale horizontally, where adding more machines to share the workload does not work because the workload was not designed to be shared. For those, there are solutions that offer progressively less performance per dollar for levels of single-box performance that are unattainable on high-end x86 machines, but that is just because those organizations are limited by the technology they chose.

    There is nothing in SAP (except its design) or (non-rel) databases that preclude horizontal scaling. It's just that the software was designed in an age when horizontal scaling was not in fashion (even though VAXes have been doing clustering since I was a young boy) and now it's too late to rebuild it from scratch.
  • mapesdhs - Friday, May 8, 2015 - link

    Good point, I wonder why they've left it at only 2/core for so long...
  • name99 - Friday, May 8, 2015 - link

    It's not easy to ramp up the number of threads. In particular POWER8 uses something I've never seen any other CPU do --- they have a second tier register file (basically an L2 for registers) and the system dynamically moves data between the two register files as appropriate.

    It's also much easier for POWER8 to decode 8 instructions per cycle (and to do the multiple branch prediction per cycle to make that happen). Intel could maybe do that if they reverted to a trace cache, but the target codes for this type of CPU are characterized by very large I-footprints and not much tight looping, so trace caches, loop caches, micro-op caches are not that much help. Intel might have to do something like a dual-ported I-cache, and running two fetch streams into two independent sets of 4-wide decoders.
  • xdrol - Saturday, May 9, 2015 - link

    Another register file is just a drop in the ocean. The real problem is the increasing L1/2/.. cache pressure; what can only be mitigated by increasing cache size; what in turn will make your cache access slower, even when you use only one of the SMT threads.

    Also, you need to have enough unused execution capacity (pipeline ports) for another hardware thread to be useful; the 2 threads in Haswell can already saturate the 7 execution ports with quite high probability, so the extra thread can only run in expense of the other, and due to the cache effects, it's probably faster to just get the 2 tasks executed sequentially (within the same thread). This question could be revisited if the processor has 14 execution port, 2x issue, 2x cache, 2x everything, so it can have 4T/1C, but then it's not really different from 2 normal size cores with 4T..
  • iAPX - Friday, May 8, 2015 - link

    It's because this is the same architecture (mainly) that is used on desktop, laptops, and now even mobility!

    With this market share, I won't be surprised that Intel decided to create a new architecture (x86-64 based) for future server chips, much more specialized, dropping AVX for cloud servers, having 4+ threads per core with simpler decoder and a lot of integer and load/store units!

    That might be complemented by a Xeon Phi socketable for floating-point compute intensive tasks and workstations, but it's unclear even if Intel announced it far far ago! ;)
  • DanNeely - Friday, May 8, 2015 - link

    Intel's 94% market share is still only ~184k systems. That's tiny compared to the mainstream x86 market; and doesn't give a lot of (budgetary) room to make radical changes to CPU vs just scaling shared designs to a huger layout.
  • theeldest - Friday, May 8, 2015 - link

    184k for 4S systems. The number of 2S systems *greatly* outnumbers the 184k.
  • Samus - Sunday, May 10, 2015 - link

    by 100 orders of magnitude, easily.

    2S systems are everywhere these days, I picked up a Lenovo 2S Xeon system for $600 NEW (driveless, 4GB RAM) from CDW.

    4S, on the other hand, is considerably more rare and starts at many thousands, even with 1 CPU included.
  • erple2 - Sunday, May 10, 2015 - link

    Well, maybe 2 orders of magnitude. 100 orders of magnitude would imply, based on the 184k 4S systems, more 2S systems than atoms in the universe. Ok, I made that up, I don't know how many atoms are in the universe, but 10^100 is a really big number. Well, 10^105, if we assume 184k 4S systems.

    I think you meant 2 orders of magnitude.
  • mapesdhs - Sunday, May 10, 2015 - link

    Yeah, that made me smile too, but we know what he meant. ;)
  • evolucion8 - Monday, May 11, 2015 - link

    That would be right if Intel cores are wide enough which aren't compared to IBM. For example, according to this review, enabling two way SMT boosted the performace to 45% and adding two more threads added 30% more performance. On the other hand, enabling two way SMT on the latest i7 architecture can only go up to 30% on the best case scenario.
  • chris471 - Friday, May 8, 2015 - link

    Great article, and I'm looking forward to see more Power systems.

    I would have loved to see additional benchmarks with gcc flags -march=native -Ofast. Should not change stream triad results, but I think 7zip might profit more on Power than on Xeon. Most software is not affected by the implied -ffast-math.
  • close - Friday, May 8, 2015 - link

    It reminds me of the time when Apple gave up on PowerPC in mobiles because the new G5s were absolute power guzzlers and made space heaters jealous. And then gave up completely and switched to Intel because the 2 dual core PowerPC 970MP CPUs at 2.5GHz managed to pull 250W of power and needed liquid cooling to be manageable.

    IBM is learning nothing from past mistakes. They couldn't adapt to what the market wanted and the more nimble competition was delivering 25-30 years ago when fighting Microsoft, it already lost business to Intel (which is actually only nimble by comparison), and it's still doing business and building hardware like we're back in the '70s mainframe age.
  • name99 - Friday, May 8, 2015 - link

    You are assuming that the markets IBM sells into care about the things you appear to care about (in particular CPU performance per watt). This is a VERY dubious assumption.
    The HPC users MAY care (but I'd need to see evidence of that). For the business users, the cost of the software running on these systems dwarfs the lifetime cost of their electricity.
  • SuperVeloce - Saturday, May 9, 2015 - link

    They surely care. Why wouldn't they. A whole server rack or many of them in fact do use quite a bit of power. And cooling the server room is very expensive.
  • DanNeely - Saturday, May 9, 2015 - link

    The work loads that you'd be buying racks of servers for are better handled with individually less expensive systems. These 4/8way leviatans are for the one or two core business functions that only scale up not out; so the typical customer would only be buying a handful of these max.

    The other half is that even a thousand or two thousand/year in increased operating costs for the server is not only dwarfed by the price of the server; but by the price of software that makes the server look cheap. The best server for those applications isn't the server that costs the least to run. It's not the server that has the cheapest hardware price either. It's the one that lets you get away with the cheapest licensing fee for the application you're running.

    One extreme example from the better part of a decade ago was that prior to being acquired by Oracle, Sun was making extremely wide processors that were very competitive on a per socket basis but used a huge number of really slow cores/threads to get their throughput. At that time Oracle licensed its DB on a per core (per thread?) basis, not per socket. As a result, an $80-100k HP/IBM server was a cheaper way to run a massive Oracle database than a $30k Sun box even if your workload was such that the cheap Sun hardware performed equally well; because Oracle's licensing ate several times the difference in hardware prices.
  • KateH - Saturday, May 9, 2015 - link

    I think the Intel transition was almost-entirely dictated by the lack of mobile options for PowerPC. 125W each for 970MP's sounds like a lot, but keep in mind that the Mac Pro has been using a pair of 100-130W Xeons since the beginning in 2008. Workstations and HPC are much, much less constrained by TDP. The direction that Power and SPARC has been taking for the past decade of cramming loads of SMT-enabled, high-clocked cores into a single chip somewhat negates the power concerns- if a Power8 is pulling a couple hundred watts for a 12C/96T chip, that's probably going to be worth it for the users that need that much grunt. Even Intel's E7-8890V3 is a 165W chip!
  • melgross - Saturday, May 9, 2015 - link

    Actually, the G5 was moving faster than Netburst was. In a bit over a year, it would have caught up, then moved past. Intel's unexpected move to the older "M" series for the Yonah series surprised everyone (particularly AMD), and allowed Apple to make that move. It never would have happened with Netburst.

    Apple switched for two reasons. One was that IBM failed to deliver a mobile G5 chip right at the time when laptop sales were increasing faster than desktop sales, and Apple was forced into using two G4s instead, which wasn't a good alternative. IBM delivered the chip after Apple switched over, but it was too late.

    The second reason was that Apple wanted better Windows compatibility, which could only occur using x86 chips.
  • Kevin G - Saturday, May 9, 2015 - link

    IBM did fail to make a G5 chip for laptops which significantly hurt Apple. Though Apple did have a plan B: PowerPC chips from PA-Semi. Also Apple never shipped a laptop with two G4 chips.

    And Apple didn't care about Windows software compatibility. Apple did care about hardware support as many chips couldn't be used in big endian mode or it made writing firmware for those chips complicated.

    And the real second reasons why Apple ditched PowerPC was due to chipsets. The PCIe based G5's actually had a chipset that was more expensive than the CPUs that were used. It was composed of a DDR2/Hypertransport north bridge, two memory buffers, a hypertransport PCIe bridge chip from Broadcomm/Serverworks and a south bridge chip to handle SATA/USB IO, Firewire 800 chip, and a pair of Broadcomm ethernet chips. The dual core 2.5 Ghz PowerPC 970MP at the time were going between $200 and $250 a piece. Not only was the hardware complex for the motherboards but so was the software side. PowerPC 970's cannot boot themselves as they need a service processor to initialize the FSB. The PowerPC 970 chipsets Apple used have an embedded PowerPC 400 series chip in them that'll initialize and calibrate the PowerPC's high speed FSB before handing off the rest of the boot process.
  • SnowCat00 - Friday, May 8, 2015 - link

    I would question how accurate that chart is...
    Mainframe sales are up: http://www.businessinsider.com/mainframe-saves-ibm...

    Also as someone who works with mainframes, if one wanted to they could consolidate a entire data center to one big z13.
  • ats - Friday, May 8, 2015 - link

    Um, I'm not sure you quite comprehend the scale of some of the datacenters out here. While Z13 is very nice, Its hardly a replacement of 10 racks of 8 socket Xeons.
  • usernametaken76 - Friday, May 8, 2015 - link

    That depends entirely on what those 10 racks worth of systems are doing and what type of applications they are running and at what utilization.

    Mainframes are built to run up to 100% utilization. Real world x86 systems at or above 80% are either rendering video, doing HPC or they have process control issues.

    Real world Enterprise applications running in a virtualized environment is a more appropriate comparison. Everywhere I look it's VMWare at the moment.

    Compare a PowerVM DLPAR to a VMWare VM running Linux x64 for a more fair, real world comparison.
  • melgross - Saturday, May 9, 2015 - link

    It isn't the same thing. Mainframes excell in I/O, which often trumps pure processing power. It's a very different environment.
  • ats - Saturday, May 9, 2015 - link

    Um, the days of mainframes having any real advantage in I/O are long gone, fyi.
  • Kevin G - Saturday, May 9, 2015 - link

    Sort of. Mainframes still farm off most IO commands to dedicated coprocessors so that they don't eat away CPU cycles running actually applications.

    Mainframes also have dedicated hardware for encryption and compression. This is becoming more common in the x86 world on a drive basis but the mainframe implements this at a system level so that any drive's data can be encrypted and compressed.

    It is also because of these coprocessors that IBM's mainframe virtualization is so robust: even the hypervisor itself can be virtualized on top of an another hypervisor without any slow down in IO or reduction in functionality.
  • PowerTrumps - Saturday, May 9, 2015 - link

    Ok, yes a data center like Verizon or ATT might not "qualify" but the point is accurate. I work with IBM's Power servers and have absolutely consolidated 5 racks of x86 into a single Power server - it was 54 Intel 2S & 4S servers into a single 64c Power7. Part of this is due to the "performance" of Power but most of the credit goes to the efficiency of the Power Hypervisor. PHYP can provide a QoS to each workload while weaving a greater amount of workloads onto fewer Power servers/cores than what the benchmarks imply.
  • newtrekemotion - Friday, May 8, 2015 - link

    I wouldn't discount Oracle so quickly. The T5 was a pretty big step forward from the T4 and the new M7 chip sounds like it could be quite the competitor with 2 TB of memory per socket and 32 cores, especially for highly threaded loads since an octo-socket system would have 2048 threads and support 16 TB of memory.. Hopefully this can bring some more competition to the market, though with only Oracle and Fujistu (maybe?) selling systems it won't have quite the impact that multiple POWER8 vendors could bring. Love them, hate them, or anywhere in between it seems Oracle is not ready to give up in this arena and it looks like they are putting more effort in than Sun was (or are at least executing on effort more than Sun did).

    Something else to note here is the process advantage that Intel has over everyone else. I might have missed it in the article, but especially for performance/watt this is important.

    In all I think the statement at the beginning of the article that this area is getting more exciting is very true. Just seems like it might be a 3 way race instead of a 2. The recent AMD announcement that they wanted to focus on HPC is interesting too though of the 4 (Intel, IBM, Oracle and AMD) they have the furthest to go and the fewest resources to do it with. The next few years are going to be very interesting and hopefully someone, or a combination can push Intel and drive the whole market forward.
  • JohanAnandtech - Friday, May 8, 2015 - link

    I was writing from a "who will be able to convert Intel Xeon people" point of view. As I wrote in the Xeon E7v2 article, Oracle's T processors have indeed vastly improved. That is all nice and well but there is no reason why someone considering a Xeon E7 would switch. Oracle's sales seems to mostly about people who are long time Oracle users. As far as I can see, OpenPOWER servers are the only real thread to Intel's server hegemony.
  • Kevin G - Saturday, May 9, 2015 - link

    Oracle does offer one reason to switch to SPARC: massive licensing discounts on Oracle software.

    If you're not using Oracle's software, then yeah, the SPARC platform is a very tough sell over x86 or POWER.
  • JohanAnandtech - Saturday, May 9, 2015 - link

    exactly. Good point.
  • PowerTrumps - Saturday, May 9, 2015 - link

    If you are running Oracle software you should know that IBM and Power are the largest platform which Oracle software runs on. Secondly, if running Oracle products licensed by the core, the only platform to control Oracle licensing is Power (not including Mainframe in this assertion). I have reduce Oracle licensing for customers anywhere from 4X to 10X. Do the math on that to appreciate those savings. Lastly, when I upgrade customers from one generation to another we talk about how much Oracle they can reduce. You don't hear that when upgrading from Sandy Bridge to Ivy Bridge to Haswell.
  • kgardas - Friday, May 8, 2015 - link

    I'm not sure about T5, but certainly latest Fujitsu's SPARC64-X+ is able to over-run POWER8 and by wide margin also older Xeon's. Just look for the spec. rate. It also won some SAP S&D 2-tier benchmark on absolute performance so I'm glad that SPARC is still competitive too...
  • Kevin G - Saturday, May 9, 2015 - link

    The top SPARC benchmarks I've seen are using far more sockets, cores, threads and memory to get to that top spot. It is nice that the system can scale to such high socket counts (40) but only if you can actually fund a project that needs that absolute performance. Drop down to 16 socket where you can get twice the performance from POWER than SPARC with the same licensing cost, what advantage does SPARC have to make people switch?

    Even then, a system like SGI's UV2000 would fall into the same niche due to its ability to scale to insane socket counts, software licensing fees be damned.
  • kgardas - Tuesday, May 12, 2015 - link

    Kevin G, actually you are right and I made an mistake. It was not intentional, I was misled by spec site claiming "24 cores, 4 chips, 6 cores/chip, 8 threads/core" for "IBM Power S824 (3.5 GHz, 24 core, RHEL)" so I've thought this is 4 socket setup and I compared it with Fujitsu M10-4 which won. Now, I've just found IBM is two socket which means it wins on socket/spec rate basis of course. Price-wise IBM is also way much cheaper than SPARC (if you don't run Oracle DB of course) so I keep my fingers crossed for OpenPOWER.
    Honestly, although this is really nice to see I still have kind of feeling that this is IBM hardware division swan's song. I would really like to be wrong here. Anyway, I still think that ARMv8 does have higher chances in getting into the Intel's business and be really a pain for Intel. On the other hand if OpenPOWER is successful in Chinese business, that would be good and some chance for us too to see lower-cost POWER machines...
  • PowerTrumps - Saturday, May 9, 2015 - link

    yes, take a look at those benchmark results and you see the Fuji M10-4S requires 640 & 512 cores. Even the Oracle M6-32 uses 384 cores. The Fuji 512c example had 33% higher SAPS with 2X the cores. The M6-32 has 50% more cores to get 21% higher SAPS. Further, looking at the SAP benchmark as a indicator of core, chip & server performance shows that SPARC & Intel are roughly 1600 - 2200 SAPS per core compared to Power8 which is 5451 SAPS for the 80 core E870. So you put this into context the 80 core Power8 has slightly less than 1/2 the SAPS of the 640 core Fujitsu M10-4S. Think of ALL the costs associated with 640 cores vs 80...ok, 160 if we want to get the SAPS roughly equal. 4X more cores to get less than 2X the results.
  • PowerTrumps - Saturday, May 9, 2015 - link

    Oracle has been unable to develop a power core let alone a processor. What they have done is created servers with many cores and many threads albeit weak cores/threads. The S3 core was an improvement and no reason to think the S4 won't be decent either. However, the M7 will come (again, true to form) with 32 cores per socket. It will be like 8 mini clusters of 4 cores because they are unable to develop a single SMP chip with shared resources across all of the cores. As such, these mini clusters will have their own resources which will lead to latency and inefficiencies. Oracle is a software business and their goal is to run software on either the most cores possible or the most inefficient. They have both of these bases covered with their Intel and SPARC business.

    Also, performance per Watt is important for Intel because what you see is what you get. With Power though, when you have strong single thread performance, strong multi-thread performance and tremendous consolidation efficiency due to Power Hypervisor efficiency means ~200W doesn't matter when you can consolidate 2, 4 maybe 10 Intel chips at 135W each into a single Power chip because of this hypervisor efficiency.
  • tynopik - Friday, May 8, 2015 - link

    pg4 - datam ining
  • der - Friday, May 8, 2015 - link

    Woo...we're bout to have another GHz War here!
  • usernametaken76 - Friday, May 8, 2015 - link

    I'm sure you mean figuratively. We've been stuck between 4-5 GHz on POWER architecture for closing in on a decade.
  • zamroni - Friday, May 8, 2015 - link

    My conclusion is Samsung should buy AMD to reduce Intel dominance.
  • alpha754293 - Friday, May 8, 2015 - link

    It would have been interesting to see the LS-DYNA benchmark results again (so that you can compare it against some of the tests that you've ran previously). But very interesting...
  • JohanAnandtech - Friday, May 8, 2015 - link

    Give me some help and we'll do that again on an update version :-)
  • alpha754293 - Tuesday, May 12, 2015 - link

    Not a problem. You have my email address right? And if not, I'll just send you another email and we can get that going again. :) Thanks.
  • andychow - Friday, May 8, 2015 - link

    If Samsung bought AMD, they would lose the licence for both x86 and x86_64 production. It would in fact ensure Intel's dominance of the market.
  • Kevin G - Friday, May 8, 2015 - link

    The x86 license can be transferred as long as Intel signs off on the deal (and it is in their best interest to do so). What will probably happen is that if any company buys AMD, the new owner will enter a cross licensing agreement with Intel.
  • TheSocket - Friday, May 8, 2015 - link

    They sure wouldn't lose the x86-64 license since they own it and Intel is licensing it from AMD.
  • melgross - Saturday, May 9, 2015 - link

    But without the license from Intel, it is worthless. There's also the question of how that works. I believe that Intel doesn't need to license back the 64 bit extensions.
  • Kevin G - Monday, May 11, 2015 - link

    This one of the reasons why it would be in Intelsat best interest to let AMD be bought out with the 32 bit license intact. The 64 bit license/patents going to a third party that doesn't want to share would be a dooms day scenario for Intel. Legally it wouldn't affect anything currently on the market but it'd throw Intel's future roadmap into the trash.
  • Death666Angel - Saturday, May 9, 2015 - link

    Pretty sure some regulatory bodies would step in if Intel were the only x86 game in town. And x86-64 is AMD property.
  • JumpingJack - Saturday, May 9, 2015 - link

    Any patents on x86 are long expired, AMD only owns the IP related to the extension of the x86 not the instruction set.
  • patrickjp93 - Monday, May 11, 2015 - link

    Not true. The U.S. government has them locked up under special military-based protections. Absolutely no one can make and sell x86 without Intel's and the DOD's permission.
  • Kevin G - Monday, May 11, 2015 - link

    Got a source for that?

    I know that DoD did some validation on x86 many years ago. (The Pentium core used by Larrabee had the DoD changes incorporated.)
  • haplo602 - Friday, May 8, 2015 - link

    hmm ... where's the RAS feature comparison/test ? did I miss it in the article ?
  • TeXWiller - Friday, May 8, 2015 - link

    In the E7v3 vs POWER comparison table, there should be 32 PCIe lanes instead 40 in the Xeon column.
  • TeXWiller - Friday, May 8, 2015 - link

    Additionally, it is the L3 in POWER8 that runs half of the core speed. L2 runs at the core speed.
  • PowerTrumps - Saturday, May 9, 2015 - link

    I'm sure the author will update the article unless this was a Intel cheerleading piece.
  • name99 - Friday, May 8, 2015 - link

    The thing is called E7-8890. Not E7-5890?
    WTF Intel? Is your marketing team populated by utter idiots? Exactly what value is there in not following the same damn numbering scheme that your product line has followed for the past eight years or so?

    Something like that makes the chip look like there's a whole lot of "but this one goes up to 11" thinking going on at Intel...
  • name99 - Friday, May 8, 2015 - link

    OK, I get it. The first number indicates the number of glueless chips, not the micro-architecture generation. Instead we do that (apparently) with a v2 or v3 suffix.
    I still claim this is totally idiotic. Far more sensible would be to use the same scheme as the other Intel processors, and use a suffix like S2, S4, S8 to show the glueless SMP capabilities.
  • ZeDestructor - Friday, May 8, 2015 - link

    They've been using this convention since Westmere-EX actually, at which point they ditched their old convention of a prefix letter for power tier, followed by one digit for performance/scalability tier, followed by another digit for generation then the rest for individual models. Now we have 2xxx for dual socket, 4xxx for quad socket and 8xxx for 8+ sockets, and E3/E5/E7 for the scalability tier. I'm fine with either, though I have a slight preference for the current naming scheme because the generation is no longer mixed into the main model number.
  • Morawka - Saturday, May 9, 2015 - link

    man the power 8 is a beefy cpu... all that cache, you'd think it would walk all over intel.. but intel's superior cpu design wins
  • PowerTrumps - Saturday, May 9, 2015 - link

    please explain
  • tsk2k - Saturday, May 9, 2015 - link

    Where are the gaming benchmarks?
  • JohanAnandtech - Saturday, May 9, 2015 - link

    Is there still a game with software rendering? :-)
  • Gigaplex - Sunday, May 10, 2015 - link

    Llvmpipe on Linux gives a capable (feature wise) OpenGL implementation on the CPU.
  • Klimax - Saturday, May 9, 2015 - link

    Don't see POWER getting anywhere with that kind of TDP. There will be dearth of datacenters and other hosting locations retooling for such thing. And I suspect not many will even then take it as cooling and power costs will be damn too high.

    Problem is, IBM can't go lower with TDP as architecture features enabling such performance are directly responsible for such TDP. (Just L1 consumes 2W to keep few cycles latency at high frequency)
  • Dmcq - Saturday, May 9, 2015 - link

    Well they'll sell where performance is an absolute must but they won't pose a problem to Intel as they won't take a large part of the market and they'd keep prices high. I see the main danger to Intel being in 64 bit ARMs eating the server market from below. I suppose one could have cheap and low power POWER machines to attack the main market but somehow it just seems unlikely with their background.
  • Guest8 - Saturday, May 9, 2015 - link

    Uh did you see Anandtech's reviews on the latest ARM server? The thing barely keeps up with an Avoton. Intel is well aware of ARM based servers and has preemptively disARMed the threat. If ARM could ever deliver Xeon class performance it would look like Power8.
  • melgross - Saturday, May 9, 2015 - link

    Chip TDP is mostly a concern for the chip itself. Other areas contribute far more waste heat than the CPU does.
  • PowerTrumps - Saturday, May 9, 2015 - link

    Power doesn't need to have a TDP of 1000W but 200W is nothing given the performance and efficiency advantage of the processors and Power hypervisor. When you can consolidate 2, 4 and 10 2 socket Intel servers into 1 x 2 socket Power8 server that is 10 x 2 x 135W = 2700 overall Watts vs 400W with the Power server. Power reduces the overall energy, cooling and rack space consumption.
  • KAlmquist - Saturday, May 9, 2015 - link

    $4115 E5-2699 (18C, 2.3 Ghz (3.6 Ghz turbo), max memory 768 GB)
    $5896 E7-8880 (18C, 2.3 Ghz (3.1 Ghz turbo), max memory 1536 GB)

    That's a big premium for the E7--enough that it probably doesn't make sense to buy an 8 socket system just to run a bunch of applications in parallel. The E7 makes sense only if you need more than 36 cores to have access to the same memory.
  • PowerTrumps - Saturday, May 9, 2015 - link

    I really enjoyed the article as well as the many data and comparison charts. It is unfortunate that most of your statements, assessments and comparisons about Power and with Intel to Power were either wrong, misleading, not fully explained or out of context. I invite the author to contact me and I will be happy to walk you through all of this so you can update this article as well as consider a future article that shows the true advantage Power8 and OpenPower truly has in the data center and the greater value available to customers.
  • KAlmquist - Saturday, May 9, 2015 - link

    I would be surprised if anybody working for Anandtech is going to contact an anonymous commentator. You can point out portions of the article that you think are wrong or misleading in this comment section.

    To do a really good article on Power8, Anandtech needs a vendor to give Anandtech access to a system to review.
  • PowerTrumps - Sunday, May 10, 2015 - link

    Admittedly I assumed when I registered for the PowerTrumps account some time ago I used a email address which they could look up. But, your point is taken. Brett Murphy with Software Information Systems (aka SIS) www.thinksis.com. Email at [email protected]. If I pointed out all of the mistakes my comment would look like a blog which many don't appreciate. I have my own blog for that. I like well written articles and happy to accept criticism or shortcomings with IBM Power - just use accurate data and not misrepresent anything. Before Anandtech reviews a Power8 server, my assessment is they need to understand what makes Power tick and how it is different than Intel or SPARC for that matter. Hope they contact me.
  • thunng8 - Sunday, May 10, 2015 - link

    I too would like a more detailed review of the Power8.

    Some of the text in the article made me laugh on how wrong they are.

    For example, the great surprise that Intel is not on top.. Well anandtech has never test any Power systems before..

    And it is laughable to make any conclusions based on running of 7zip. Just about any serious enterprise server benchmark shows a greater than 2x performance advantage per core in favor of Power compared to the best Xeons. So that 50% advantage is way less than expected.

    Btw Power7 for most of its life bested Xeon in performance by very large margins. It is just now that IBM have opened up Power to other vendor that makes it exciting.
  • JohanAnandtech - Monday, May 11, 2015 - link

    I welcome constructive critism. And yes, we only had access to an IBM Power8 dev machine, so we only got a small part of the machine (1 core/2GB).

    "Some of the text in the article made me laugh on how wrong they are."
    That is pretty low. Without any pointer or argument, nobody can check your claims. Please state your concerns or mail me.
  • thunng8 - Monday, May 11, 2015 - link

    Sorry about the language

    A couple points that are wrong on benchmarks:
    - The Power7 p270 is a 2 socket system with 2 processors in one socket (4 processors). It was designed to get more cores into 1 socket and not outright performance per processor. If you want to show the best quad processor on 4 socket system, then it would be this result:
    http://download.sap.com/download.epd?context=40E2D...

    - Your comment about Power7 needing more sockets to match Intel is not based on reality. IBM held the 8 socket lead in SAP SD from March 2010 with this result:
    http://download.sap.com/download.epd?context=40E2D...

    It wasn't surpassed by Intel until June 2014 with this result:
    http://download.sap.com/download.epd?context=40E2D...

    Note: Even the Power7 result from 2010 shows higher throughput per core than the just released Haswell server chips.

    And then 4 months later Power8 overtook it again. BTW, IBM recently announced the 12 core 4.02Ghz cpu in the E880..that should get an extra ~15% throughput per socket.

    - Power8 L2 cache runs at full speed clock speed

    A point completely overlooked and what makes Power systems really excel is the efficiency of the Power hypervisor. IMO it is the biggest selling point of the Power ecosystem.
  • thunng8 - Monday, May 11, 2015 - link

    Another datapoint (not on spec site yet, but listed on the IBM e880 performance site):

    http://www-03.ibm.com/systems/power/hardware/e880/...

    SpecIntRate: 14400
    SpecfpRate: 11400

    Which makes it (per processor):
    SpecIntrate: 900
    SepcfpRate: 713
  • thunng8 - Monday, May 11, 2015 - link

    Also, the ibm power 760 is the same deal with the p270.

    It is actually a 4 socket system with 2 processors per socket.

    Technical overview here:

    http://www.redbooks.ibm.com/redpapers/pdfs/redp498...
  • thunng8 - Wednesday, May 13, 2015 - link

    Well, it has been a few days since I've listed a quite few of your misrepresentations of the data in comparison to POWER, and nothing has changed and no reply at all.

    I find it hilarious that you can put this text in the article:
    "the new POWER8 has made the Enterprise line of IBM more competitive than ever. Gone are the days that IBM needed more CPU sockets than Intel to get the top spot."

    and still have it there when I've pointed out over the last 5 years (or maybe longer, I couldn't be bothered looking further), Intel has only overtaken POWER system for only 4 months. i.e. 4 months out of 60+ months
  • JlHADJOE - Sunday, May 10, 2015 - link

    "No less than 98% of the server shipments have been 'Intel inside'... From the revenue side, the RISC based systems are still good for slightly less than 20% of the $49 Billion (per year) server market*."

    Wow! So RISC has 2% market share and 20% revenue.
  • FunBunny2 - Monday, May 11, 2015 - link

    Gee. Sounds kinda like the Apple approach to production.
  • akula2 - Sunday, May 10, 2015 - link

    POWER8 is far better than Intel's counterpart.
    IBM is way ahead of Intel for the next generation computing with their Brain Chip.

    I hope Intel's share slips with the emerging ARM 64 bit CPU (A-72) in the Server space.
  • ats - Tuesday, May 12, 2015 - link

    Whoa, there is wrong then there is Brutalizer WRONG!

    First of all many IMDBs support full locking at multiple granularity including both TimesTen and SAP HANA. IMDBs are not read only and are used in the most critical performance transaction processing scenarios (because disk based DBs simply can't keep up!)

    Second, IMDBs are used for a variety of DB workloads from transaction processing to analytic workloads.

    Third, if your queries are taking hours, you are doing analytic workloads, not transaction processing. Transaction processing is the DB workload most dependent on locking functionality and requires real time responses. Analytic workloads are the least dependent on locking performance.

    Fourth, many IMDBs are designed and deployed as the sole DB layer, including SAP HANA and TimesTen. Both fully support shadowing to disk.

    Fifth, you can run businesses on **SCALE UP** severs like UV2K. Unless you now want to claim you can run businesses on mainframes, Sun's large scale servers, Fujitsu's large scale servers, IBMs large scale servers, or HP's large scale servers.

    Sixth, if you think an UV2K is a cluster, you don't have enough knowledge to even post about this topic. UV2k is a SSI cache coherent SMP, no different than Oracle Sparc M6 or and IBM P795.

    Seventh, you don't need a direct channel between sockets. You have never needed a direct channel between sockets. In fact the system that put Sun on the map, UE10K, did not have direct connections between each socket. In fact MANY MANY large scale sun systems have not had direct connections between sockets. If you actually knew anything about the history of big servers you would know that direct connections can be slower, using switches can be slower, and using torii and hypercubes can be slower, or they all can be faster. Looking at an interconnection network topology doesn't tell you jack. What matters is latency and latency vs load.

    Eighth, people who fail at math should probably not try to make math based arguments. To directly connect N sockets, each socket needs N-1 links, no n^2 links. And you should probably learn something about how bandwidth and latency works. The more you directly connect, the less bandwidth you have between each node and the highly the latency hot spotting becomes. Using min channel widths isn't necessarily the best solution. And actually, you can have throughput and low latency, it just impacts cost.

    Ninth, ScaleMP has no relation to SGI's UV2k. None.

    10th, more business software runs on X86 than anything else in the world. More DBs run on x86 than anything else in the world. And neither ScaleMP nor UV2k are scale out solutions. UV2K is a pure scale-up system. You might know that if you only had a clue.

    1) There are 8, 16, 32, 64, 128, AND 256 processor **SCALE-UP** x86 systems. And the x86 Superdome delivers higher performance than any previous HP scale-up system. And no, you don't need socket counts, you need performance. Socket counts are quite immaterial, and shrinking by the by.

    2) SGI UV2K is not a scale out system. Its a SSI Scale Up system. When you finally admit this, you'll be one step closer to not riding the short bus.

    And fyi, plenty of people use x86 for large sap installations. In fact, x86 runs more sap installations than anyone else combined.

    Oh and: http://global.sap.com/solutions/benchmark/bweml-re...

    And just for fun, WE LAUGH AT YOUR PUNY ORACLE SAPS: http://global.sap.com/solutions/benchmark/sd3tier.... still under a million? You are being beaten by 8 socket servers, ouch that's gotta hurt!
  • ats - Tuesday, May 12, 2015 - link

    Whoa, there is wrong then there is Brutalizer WRONG!

    First of all many IMDBs support full locking at multiple granularity including both TimesTen and SAP HANA. IMDBs are not read only and are used in the most critical performance transaction processing scenarios (because disk based DBs simply can't keep up!)

    Second, IMDBs are used for a variety of DB workloads from transaction processing to analytic workloads.

    Third, if your queries are taking hours, you are doing analytic workloads, not transaction processing. Transaction processing is the DB workload most dependent on locking functionality and requires real time responses. Analytic workloads are the least dependent on locking performance.

    Fourth, many IMDBs are designed and deployed as the sole DB layer, including SAP HANA and TimesTen. Both fully support shadowing to disk.

    Fifth, you can run businesses on **SCALE UP** severs like UV2K. Unless you now want to claim you can run businesses on mainframes, Sun's large scale servers, Fujitsu's large scale servers, IBMs large scale servers, or HP's large scale servers.

    Sixth, if you think an UV2K is a cluster, you don't have enough knowledge to even post about this topic. UV2k is a SSI cache coherent SMP, no different than Oracle Sparc M6 or and IBM P795.

    Seventh, you don't need a direct channel between sockets. You have never needed a direct channel between sockets. In fact the system that put Sun on the map, UE10K, did not have direct connections between each socket. In fact MANY MANY large scale sun systems have not had direct connections between sockets. If you actually knew anything about the history of big servers you would know that direct connections can be slower, using switches can be slower, and using torii and hypercubes can be slower, or they all can be faster. Looking at an interconnection network topology doesn't tell you jack. What matters is latency and latency vs load.

    Eighth, people who fail at math should probably not try to make math based arguments. To directly connect N sockets, each socket needs N-1 links, no n^2 links. And you should probably learn something about how bandwidth and latency works. The more you directly connect, the less bandwidth you have between each node and the highly the latency hot spotting becomes. Using min channel widths isn't necessarily the best solution. And actually, you can have throughput and low latency, it just impacts cost.

    Ninth, ScaleMP has no relation to SGI's UV2k. None.

    10th, more business software runs on X86 than anything else in the world. More DBs run on x86 than anything else in the world. And neither ScaleMP nor UV2k are scale out solutions. UV2K is a pure scale-up system. You might know that if you only had a clue.

    1) There are 8, 16, 32, 64, 128, AND 256 processor **SCALE-UP** x86 systems. And the x86 Superdome delivers higher performance than any previous HP scale-up system. And no, you don't need socket counts, you need performance. Socket counts are quite immaterial, and shrinking by the by.

    2) SGI UV2K is not a scale out system. Its a SSI Scale Up system. When you finally admit this, you'll be one step closer to not riding the short bus.

    And fyi, plenty of people use x86 for large sap installations. In fact, x86 runs more sap installations than anyone else combined.

    Oh and: http://global.sap.com/solutions/benchmark/bweml-re...

    And just for fun, WE LAUGH AT YOUR PUNY ORACLE SAPS: http://global.sap.com/solutions/benchmark/sd3tier.... still under a million? You are being beaten by 8 socket servers, ouch that's gotta hurt!
  • MyNuts - Tuesday, May 12, 2015 - link

    Great, i guess. Wheres the holograms and teleporters. I see just another calculator :(
  • MyNuts - Tuesday, May 12, 2015 - link

    Charles Babbage would be upset
  • quadibloc - Thursday, May 14, 2015 - link

    I'm shocked to hear that Oracle and IBM are charging more for their SPARC and PowerPC chips, respectively, than Intel is charging for comparable x86 chips - or, at least, I presume they are, if servers using those chips are more expensive. Since x86 has the enormous advantage of being able to run Microsoft Windows, the only way other ISAs can be viable is if they offer better performance or a lower price.
  • Kevin G - Thursday, May 14, 2015 - link

    Actually IBM comes in cheaper than Intel for comparable POWER8 hardware. IBM now is offering the processor to outside system builders so the actual prices are some what known. Tyan used to have the raw prices on their site but I can't find them again.

    Regardless, this article indicates that they top out at $3000 which is less than equivalent Xeon E7's.
  • kgardas - Thursday, May 21, 2015 - link

    Sure, SPARC and POWER are (was in case of POWER) more expensive, but usually hardware price is nothing in comparison with software price if you are running enterprise. Also SPARC is also Oracle preferred over POWER/Itanium by Oracle's price ratios... Anyway, POWER8 looks so powerful that it may even be cheaper software wise in comparison with SPARC, but that would need some clever Oracle DB benchmarking...
  • HighTech4US - Friday, May 15, 2015 - link

    Power 9 will be available when?
  • Phiro69 - Friday, May 15, 2015 - link

    I wanted to compare the E7's in this review to the E5's reviewed a few months back in your benchmark comparison tool, but I'm not seeing any of this data in it? Is it going to be there?

Log in

Don't have an account? Sign up now