Comments Locked

51 Comments

Back to Article

  • ken.c - Friday, March 27, 2020 - link

    We lost a pair of mirrored drives in a mongodb server to this. They both just kicked the bucket at the same time. :)
  • olafgarten - Friday, March 27, 2020 - link

    RAID doesn't help when all the drives fail simultaneously!
  • InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link

    I agree. It would be rather fruitless for law enforcement to raid company premises in search of documents revealing illicit activities only to find the companies storage array(s) being well beyond their best before date ;)
  • brontes - Saturday, March 28, 2020 - link

    Crazy! Are you from the future?

    > MPORTANT: Due to the SSD failure not occurring until attaining 40,000 hours of operation and based on the dates these drives began shipping from HPE, these drives are NOT susceptible to failure until October 2020 at the earliest.
  • olafgarten - Saturday, March 28, 2020 - link

    That is incorrect, according to the source link, the first drives were shipped in late 2015, and so could possibly start failing now. Any drive put into operation from September 5th 2015 would fail.
  • olafgarten - Saturday, March 28, 2020 - link

    No edit facility, but there should be a 'today' at the end of the sentence
  • InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

    Please read that blog article again. It's not exactly a "source". Note that 05. September 2015 date mentioned is pure speculation, based on a blind assumption that a drive would have accrued 40000 operational hours today. Which likely is a misinterpretation of the notice they got from SanDisk/WD, confusing the event of this notice being published with the event of actual drives failing...
  • InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

    Addendum: Also note that the Rohs conformity declaration for both the SDLTOCKR and SDLTOCKM series (check the supplier part number of the Dell/HD SSDs...) were signed in June 2016, which would indicate that those SSDs were not sold in 2015 or earlier...
  • Gigaplex - Sunday, March 29, 2020 - link

    This was found because drives started failing. If the first failure can't occur before October 2020 then they wouldn't have spotted it.
  • 69369369 - Friday, March 27, 2020 - link

    HDD Master Race!
  • eastcoast_pete - Friday, March 27, 2020 - link

    Occurrences such as this - planned obsolescence baked right into SSD firmware - is just another reason why I like to keep backups on spinning rust for cold storage.
    Also, I guess they got the timing wrong; these SSDs bricked before the 5 year warranty was up (: tsk, tsk
  • FunBunny2 - Friday, March 27, 2020 - link

    "backups on spinning rust for cold storage"

    arguably, tape is more durable.
  • ballsystemlord - Friday, March 27, 2020 - link

    Not when the pets get a hold of it. :)
  • InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link

    You should have checked the sysadmin certfications of your pets before adopting them. Due diligence, my man... ;-P
  • eastcoast_pete - Sunday, March 29, 2020 - link

    I thought those certs looked shifty(: But then, they are great at shredding data on the hardware level..
  • Samus - Saturday, March 28, 2020 - link

    It is hard to wrap my head around this not being intentional. But by who? HPE and EMC have vested interest in keeping continuing support subscriptions in place, but this bug seems to be the direct result of SanDisk QA failure. SanDisk has nothing to gain by this bug, it actually hurts their reputation. So maybe HPE and EMC REQUESTED similar “features”?
  • InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link

    Intentional? So the intention was to force a new firmware on the devices at time X or else they lose their data?

    Why? Doens't matter... Never let questions like "Why?" get in the way of a "good" (=absurd) conspiracy theory.
  • FunBunny2 - Saturday, March 28, 2020 - link

    "conspiracy theory"

    but wasn't that the explanation of early Intel SSD bricking?
  • Samus - Sunday, March 29, 2020 - link

    That’s exactly what this situation reminded me of. I am not a conspiracy theorist, my career requires me to be fact driven. But this just doesn’t add up when you consider such a ridiculous flaw in such a mission critical scenario, and that HPe and EMC are the only two enterprise suppliers in their segment that require continuing support subscriptions for out of warranty hardware (typically 1-3 years, in other words before this bug would materialize) when every other competitor only discontinues free firmware and ongoing driver support when hardware hits EOL.
  • Kvaern1 - Sunday, March 29, 2020 - link

    Making older SSD's/GPU's/whatever perform worse via driver or not delivering driver updates after a certain timeperiod has passed are examples of planned obsolence.

    Secret planned drive bricking (or any other undocumented "deliberate" selfdestruction of any item you have procurred) is NOT planned obsolence, it's a planned crime.
  • Samus - Monday, March 30, 2020 - link

    Re-read my statement. The two companies that are seemingly the only enterprise equipment suppliers affected by these SSD's running this particular firmware are CONVENIENTLY the only two enterprise suppliers that strongarm their partners into maintenance agreements beyond the warranty period to receive what are otherwise free updates from virtually any other supplier.

    The crime here is it still isn't clear if EMC and HPe are providing these updates for out-of-warranty equipment. Everything else is, as I admitted, speculation, not conspiracy.
  • Gigaplex - Sunday, March 29, 2020 - link

    "But this just doesn’t add up when you consider such a ridiculous flaw in such a mission critical scenario"

    Such a ridiculous flaw in such a mission critical scenario makes even LESS sense if that flaw was intentional.
  • leexgx - Wednesday, July 8, 2020 - link

    The bug was due to an coding error (should be N it was N-1 in the code witch had somthong to do when 40k hours passed) raid is never a backup

    you should have an secondary array on another server that's using completely different drives for server to server mirroring (real-time if needed or every hour or day really depends on your requirement, for most 2am backup everyday is enough)
  • oRAirwolf - Saturday, March 28, 2020 - link

    Hanlon's razor, my dude.
  • rrinker - Monday, March 30, 2020 - link

    It's entirely accidental - caused by the very common fault of programmers who don't understand the limits of various data types. All sorts of unintended consequences have happened because of these types of errors - including deaths, in the case of the 737 Max.

    It makes absolutely no sense for a company to purposely brick a device which is STILL UNDER WARRANTY - that's a recipe for killing the company if every single one of a product line fails before the warranty is up, leaving them on the hook for supplying replacements.
  • FunBunny2 - Monday, March 30, 2020 - link

    "It's entirely accidental - caused by the very common fault of programmers who don't understand the limits of various data types."

    there was a time when most commercial programs (COBOL, almost always) were written by HS graduates (or GEDs) who got a 'certificate' from some store-front 'programming school'. you can guess the result. in these days, the C/java/PHP crowd are largely as ignorant.
  • leexgx - Wednesday, July 8, 2020 - link

    This was an coding error, they used N-1 instead of just N so when it hits 40k hours it does some sort of internal hard error due to everytime it trys to read 40k hours it hard errors the firmware on boot up (this is why you should try not to use disks that have the same uptime as nearly impossible rare as it can be it could happen)
  • InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link

    If by saying "planned obsolescence" you mean such blunder potentially making the company or the brand(s) the company sells obsolete because almost nobody wants to buy their data-killing products anymore, then i agree. If you rather meant the commonly agreed-upon meaning of "planned obsolescence", well, please don't let me stop you wallowing in absurd theories.

    Also, i am quite curious about the physical law or whatever it is that allows building planned obsolescence into SSD firmwares, yet seemingly makes it impossible to build such into firmwares of HDDs. Please tell me more! (...goes to redirect response output to /dev/nul)
  • FunBunny2 - Saturday, March 28, 2020 - link

    "yet seemingly makes it impossible to build such into firmwares of HDDs. "

    HDD vis-a-vis SSD has virtually no logic used in data R/W. it's just a bit of magnetism going back and forth. now, HDD manufacturers could well build the platter hub ball bearings with leftover BB gun shot, and the voice coils from $10 transistor radio speakers, of course.
  • InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

    And that would stop a manufacturer to build planned obsolescence measures into a HDD? Because it is so much simpler than a SSD, therefore SSDs have planned obsolescence measures built-in, and HDDs have not? You know what is even simpler than a HDD? Good old traditional light bulbs. According to the logic of your argument, those light bulbs must have been immune from planned obsolescence Dude, i have a bridge in Brooklyn to sell you...
  • InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

    Correction of phrasing in my last comment: "Because it is so much simpler than a SSD, therefore SSDs have planned obsolescence measures built-in, and HDDs have not?" should be rather "Because it is so much simpler than a SSD, therefore SSDs can have planned obsolescence measures built-in, and HDDs would not allow that?"

    I am not trying to argue about whether SSDs or HDDs have actual planned obsolenscene measures built in or not. I am (haphazardly, i guess) trying to dispel this ridiculous notion that SSDs are not trustworthy because they are seen as affected by planned obsolescence whereas HDDs are seen to be safe/unable to be affected by planned obsolenscene.
  • edzieba - Monday, March 30, 2020 - link

    "HDD vis-a-vis SSD has virtually no logic used in data R/W. it's just a bit of magnetism going back and forth."

    I would advise looking inside an HDD made in the last 3 or so decades. You may be suppressed to find a copious account of electronic processing is required to turn magnetic domains into addressable blocks.
  • StrangerGuy - Friday, March 27, 2020 - link

    How did this escaped QA to begin with?
  • ABR - Saturday, March 28, 2020 - link

    That's what I'm wondering? Where is their HALT (Highly Accelerated Life Testing)?
  • shabby - Saturday, March 28, 2020 - link

    How do you accelerate time?
  • PreacherEddie - Saturday, March 28, 2020 - link

    It is zero sum. Every person who uses a time machine to go back in time allows a company to test products for MTBF.
  • FunBunny2 - Saturday, March 28, 2020 - link

    I believe it's called WARP drive. In a practical sense, many (hundreds, thousands?) are run 24/7 for some time period, and the total uptime hours across all devices are algorithmically massaged to MTBF. but you knew that, right?
  • shabby - Saturday, March 28, 2020 - link

    Yes I did, but this drive specifically dies after 40,000 hours, mtbf won't find this flaw until the drive actually reaches those amount of hours.
  • FunBunny2 - Saturday, March 28, 2020 - link

    "Yes I did"

    Yes I did, too. I was answering the different question: "How do you accelerate time?" That's how it's done, in general.
  • Kvaern1 - Sunday, March 29, 2020 - link

    "How do you accelerate time?"

    You record something and watch it on FF.
  • LMF5000 - Sunday, March 29, 2020 - link

    In the semiconductor industry, some products have their time accelerated by elevated temperature and humidity. For hard disks and SSDs, no idea.
  • leexgx - Wednesday, July 8, 2020 - link

    They can run vaule checks on the code in a simulation to test vaule boundaries to make sure the output is valid

    And Intel or who ever makes the ssd can make a firmware that allows changes to smart numbers directly so you can just set it to 50k hours for example and the ssd won't boot up with the N-1 bug (should of just been N in this case so it was basically coding error)
  • Gigaplex - Monday, March 30, 2020 - link

    Because they can't wait for a 40,000 hour test to complete before shipping.
  • eastcoast_pete - Sunday, March 29, 2020 - link

    All that brings up an interesting question: how is SSD firmware bug-tested? Obviously, this is a bug, but one that doesn't show up for quite a while, so the drives work just fine until that hit that age. Would like to know a bit more on how SSD controller software is tested. Maybe a little backgrounder is in order?
  • dwbogardus - Monday, March 30, 2020 - link

    Usually in ASIC design or in this case, SSD controller design, pre-silicon validation is done by running simulations that make a point of checking all the boundary conditions, like buffer overruns, FIFO underflows, and various limits, some of which are never expected to be reached in normal operation. Normal simulations would take way too long to run in order to hit those boundary conditions, so special test hooks often permit the validations engineers to preset values close to the limits, and then do a few increments to reach the condition. Then they can verify correct operation for the condition. It can be tedious to check every instance, and perhaps some were missed. Whether the validation simulations are being run to check the controller, or the validation test simulations are being run to check the firmware, the principles are the same: check all the boundary conditions by presetting registers close to the limit, then increment to and through the limit, and verify expected behavior. That way you don't have to wait for years of "wall clock" time to reach the limits you need to validate.
  • FunBunny2 - Monday, March 30, 2020 - link

    in addition to the other, long, reply is the simple answer: the coders and analysts simply didn't confirm the design spec. kind of like those airplane crashes on "Air Disasters" where the crew skipped steps on a pre-flight checklist. or, of course, the analysts wrote the spec without checking with design requirements. in any case, this sort of error would be nearly impossible to find in production QA.
  • leexgx - Wednesday, July 8, 2020 - link

    Well they obviously did checks 3-4 years later and found these bugs before they became a problem in real world (not as bad as the 0mb bug on the old sandforce ssds witch had a random chance at powerup to nuke the ssd and respond with 0mb space, some sort of bug with the unique way sandforce has 2 levels of virtual LBA NAND mapping then second level compression Mapping witch would result in the whole drive becoming 0mb in very rare but specific cases)
  • Sivar - Monday, March 30, 2020 - link

    Linux network drivers, Oracle database, other SSD firmware -- how many times does this need to happen before developers stop making the same mistake?
    It isn't even a tricky fix. Use a larger integer! Count something larger (e.g. days instead of hours, packets instead of bytes)! Add a second integer that counts overruns of the first! Use a double or arbitrary precision value!
  • wildbil1952 - Friday, March 5, 2021 - link

    It's not just Dell and HPE. We had this bite us on Cisco servers. An entire cluster, two ESXi hosts running Simplivity. Even though the VMs were running on an entirely different array, when Simpliivty lost both drives - boom. Cluster down and the Simplivity reinstall would not see the old disks. Every VM is toast.
  • mikerobert110 - Thursday, March 31, 2022 - link

    Whoa! Thanks for all of this information Well we should have to do our own research about the topic.

    Also, people I am here to share my amazing experience with the EMC DES-4122 Practice test questions.
    https://www.test4practice.com/DES-4122-practice-te...

Log in

Don't have an account? Sign up now