Comments for Dell & HPE Issue Updates to Fix 40K Hour Runtime Flaw in Enterprise SSDs

Dell & HPE Issue Updates to Fix 40K Hour Runtime Flaw in Enterprise SSDs

by Anton Shilov on 3/27/2020 4:00 PM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

51 Comments

Back to Article

ken.c - Friday, March 27, 2020 - link
We lost a pair of mirrored drives in a mongodb server to this. They both just kicked the bucket at the same time. :)
olafgarten - Friday, March 27, 2020 - link
RAID doesn't help when all the drives fail simultaneously!
InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link
I agree. It would be rather fruitless for law enforcement to raid company premises in search of documents revealing illicit activities only to find the companies storage array(s) being well beyond their best before date ;)
brontes - Saturday, March 28, 2020 - link
Crazy! Are you from the future?

> MPORTANT: Due to the SSD failure not occurring until attaining 40,000 hours of operation and based on the dates these drives began shipping from HPE, these drives are NOT susceptible to failure until October 2020 at the earliest.
olafgarten - Saturday, March 28, 2020 - link
That is incorrect, according to the source link, the first drives were shipped in late 2015, and so could possibly start failing now. Any drive put into operation from September 5th 2015 would fail.
olafgarten - Saturday, March 28, 2020 - link
No edit facility, but there should be a 'today' at the end of the sentence
InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link
Please read that blog article again. It's not exactly a "source". Note that 05. September 2015 date mentioned is pure speculation, based on a blind assumption that a drive would have accrued 40000 operational hours today. Which likely is a misinterpretation of the notice they got from SanDisk/WD, confusing the event of this notice being published with the event of actual drives failing...
InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link
Addendum: Also note that the Rohs conformity declaration for both the SDLTOCKR and SDLTOCKM series (check the supplier part number of the Dell/HD SSDs...) were signed in June 2016, which would indicate that those SSDs were not sold in 2015 or earlier...
Gigaplex - Sunday, March 29, 2020 - link
This was found because drives started failing. If the first failure can't occur before October 2020 then they wouldn't have spotted it.
69369369 - Friday, March 27, 2020 - link
HDD Master Race!
eastcoast_pete - Friday, March 27, 2020 - link
Occurrences such as this - planned obsolescence baked right into SSD firmware - is just another reason why I like to keep backups on spinning rust for cold storage.
Also, I guess they got the timing wrong; these SSDs bricked before the 5 year warranty was up (: tsk, tsk
FunBunny2 - Friday, March 27, 2020 - link
"backups on spinning rust for cold storage"

arguably, tape is more durable.
ballsystemlord - Friday, March 27, 2020 - link
Not when the pets get a hold of it. :)
InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link
You should have checked the sysadmin certfications of your pets before adopting them. Due diligence, my man... ;-P
eastcoast_pete - Sunday, March 29, 2020 - link
I thought those certs looked shifty(: But then, they are great at shredding data on the hardware level..
Samus - Saturday, March 28, 2020 - link
It is hard to wrap my head around this not being intentional. But by who? HPE and EMC have vested interest in keeping continuing support subscriptions in place, but this bug seems to be the direct result of SanDisk QA failure. SanDisk has nothing to gain by this bug, it actually hurts their reputation. So maybe HPE and EMC REQUESTED similar “features”?
InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link
Intentional? So the intention was to force a new firmware on the devices at time X or else they lose their data?

Why? Doens't matter... Never let questions like "Why?" get in the way of a "good" (=absurd) conspiracy theory.
FunBunny2 - Saturday, March 28, 2020 - link
"conspiracy theory"

but wasn't that the explanation of early Intel SSD bricking?
Samus - Sunday, March 29, 2020 - link
That’s exactly what this situation reminded me of. I am not a conspiracy theorist, my career requires me to be fact driven. But this just doesn’t add up when you consider such a ridiculous flaw in such a mission critical scenario, and that HPe and EMC are the only two enterprise suppliers in their segment that require continuing support subscriptions for out of warranty hardware (typically 1-3 years, in other words before this bug would materialize) when every other competitor only discontinues free firmware and ongoing driver support when hardware hits EOL.
Kvaern1 - Sunday, March 29, 2020 - link
Making older SSD's/GPU's/whatever perform worse via driver or not delivering driver updates after a certain timeperiod has passed are examples of planned obsolence.

Secret planned drive bricking (or any other undocumented "deliberate" selfdestruction of any item you have procurred) is NOT planned obsolence, it's a planned crime.
Samus - Monday, March 30, 2020 - link
Re-read my statement. The two companies that are seemingly the only enterprise equipment suppliers affected by these SSD's running this particular firmware are CONVENIENTLY the only two enterprise suppliers that strongarm their partners into maintenance agreements beyond the warranty period to receive what are otherwise free updates from virtually any other supplier.

The crime here is it still isn't clear if EMC and HPe are providing these updates for out-of-warranty equipment. Everything else is, as I admitted, speculation, not conspiracy.
Gigaplex - Sunday, March 29, 2020 - link
"But this just doesn’t add up when you consider such a ridiculous flaw in such a mission critical scenario"

Such a ridiculous flaw in such a mission critical scenario makes even LESS sense if that flaw was intentional.
leexgx - Wednesday, July 8, 2020 - link
The bug was due to an coding error (should be N it was N-1 in the code witch had somthong to do when 40k hours passed) raid is never a backup

you should have an secondary array on another server that's using completely different drives for server to server mirroring (real-time if needed or every hour or day really depends on your requirement, for most 2am backup everyday is enough)
oRAirwolf - Saturday, March 28, 2020 - link
Hanlon's razor, my dude.
rrinker - Monday, March 30, 2020 - link
It's entirely accidental - caused by the very common fault of programmers who don't understand the limits of various data types. All sorts of unintended consequences have happened because of these types of errors - including deaths, in the case of the 737 Max.

It makes absolutely no sense for a company to purposely brick a device which is STILL UNDER WARRANTY - that's a recipe for killing the company if every single one of a product line fails before the warranty is up, leaving them on the hook for supplying replacements.
FunBunny2 - Monday, March 30, 2020 - link
"It's entirely accidental - caused by the very common fault of programmers who don't understand the limits of various data types."

there was a time when most commercial programs (COBOL, almost always) were written by HS graduates (or GEDs) who got a 'certificate' from some store-front 'programming school'. you can guess the result. in these days, the C/java/PHP crowd are largely as ignorant.
leexgx - Wednesday, July 8, 2020 - link
This was an coding error, they used N-1 instead of just N so when it hits 40k hours it does some sort of internal hard error due to everytime it trys to read 40k hours it hard errors the firmware on boot up (this is why you should try not to use disks that have the same uptime as nearly impossible rare as it can be it could happen)
InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link
If by saying "planned obsolescence" you mean such blunder potentially making the company or the brand(s) the company sells obsolete because almost nobody wants to buy their data-killing products anymore, then i agree. If you rather meant the commonly agreed-upon meaning of "planned obsolescence", well, please don't let me stop you wallowing in absurd theories.

Also, i am quite curious about the physical law or whatever it is that allows building planned obsolescence into SSD firmwares, yet seemingly makes it impossible to build such into firmwares of HDDs. Please tell me more! (...goes to redirect response output to /dev/nul)
FunBunny2 - Saturday, March 28, 2020 - link
"yet seemingly makes it impossible to build such into firmwares of HDDs. "

HDD vis-a-vis SSD has virtually no logic used in data R/W. it's just a bit of magnetism going back and forth. now, HDD manufacturers could well build the platter hub ball bearings with leftover BB gun shot, and the voice coils from $10 transistor radio speakers, of course.
InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link
And that would stop a manufacturer to build planned obsolescence measures into a HDD? Because it is so much simpler than a SSD, therefore SSDs have planned obsolescence measures built-in, and HDDs have not? You know what is even simpler than a HDD? Good old traditional light bulbs. According to the logic of your argument, those light bulbs must have been immune from planned obsolescence Dude, i have a bridge in Brooklyn to sell you...
InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link
Correction of phrasing in my last comment: "Because it is so much simpler than a SSD, therefore SSDs have planned obsolescence measures built-in, and HDDs have not?" should be rather "Because it is so much simpler than a SSD, therefore SSDs can have planned obsolescence measures built-in, and HDDs would not allow that?"

I am not trying to argue about whether SSDs or HDDs have actual planned obsolenscene measures built in or not. I am (haphazardly, i guess) trying to dispel this ridiculous notion that SSDs are not trustworthy because they are seen as affected by planned obsolescence whereas HDDs are seen to be safe/unable to be affected by planned obsolenscene.
edzieba - Monday, March 30, 2020 - link
"HDD vis-a-vis SSD has virtually no logic used in data R/W. it's just a bit of magnetism going back and forth."

I would advise looking inside an HDD made in the last 3 or so decades. You may be suppressed to find a copious account of electronic processing is required to turn magnetic domains into addressable blocks.
StrangerGuy - Friday, March 27, 2020 - link
How did this escaped QA to begin with?
ABR - Saturday, March 28, 2020 - link
That's what I'm wondering? Where is their HALT (Highly Accelerated Life Testing)?
shabby - Saturday, March 28, 2020 - link
How do you accelerate time?
PreacherEddie - Saturday, March 28, 2020 - link
It is zero sum. Every person who uses a time machine to go back in time allows a company to test products for MTBF.
FunBunny2 - Saturday, March 28, 2020 - link
I believe it's called WARP drive. In a practical sense, many (hundreds, thousands?) are run 24/7 for some time period, and the total uptime hours across all devices are algorithmically massaged to MTBF. but you knew that, right?
shabby - Saturday, March 28, 2020 - link
Yes I did, but this drive specifically dies after 40,000 hours, mtbf won't find this flaw until the drive actually reaches those amount of hours.
FunBunny2 - Saturday, March 28, 2020 - link
"Yes I did"

Yes I did, too. I was answering the different question: "How do you accelerate time?" That's how it's done, in general.
Kvaern1 - Sunday, March 29, 2020 - link
"How do you accelerate time?"

You record something and watch it on FF.
LMF5000 - Sunday, March 29, 2020 - link
In the semiconductor industry, some products have their time accelerated by elevated temperature and humidity. For hard disks and SSDs, no idea.
leexgx - Wednesday, July 8, 2020 - link
They can run vaule checks on the code in a simulation to test vaule boundaries to make sure the output is valid

And Intel or who ever makes the ssd can make a firmware that allows changes to smart numbers directly so you can just set it to 50k hours for example and the ssd won't boot up with the N-1 bug (should of just been N in this case so it was basically coding error)
Gigaplex - Monday, March 30, 2020 - link
Because they can't wait for a 40,000 hour test to complete before shipping.
eastcoast_pete - Sunday, March 29, 2020 - link
All that brings up an interesting question: how is SSD firmware bug-tested? Obviously, this is a bug, but one that doesn't show up for quite a while, so the drives work just fine until that hit that age. Would like to know a bit more on how SSD controller software is tested. Maybe a little backgrounder is in order?
dwbogardus - Monday, March 30, 2020 - link
Usually in ASIC design or in this case, SSD controller design, pre-silicon validation is done by running simulations that make a point of checking all the boundary conditions, like buffer overruns, FIFO underflows, and various limits, some of which are never expected to be reached in normal operation. Normal simulations would take way too long to run in order to hit those boundary conditions, so special test hooks often permit the validations engineers to preset values close to the limits, and then do a few increments to reach the condition. Then they can verify correct operation for the condition. It can be tedious to check every instance, and perhaps some were missed. Whether the validation simulations are being run to check the controller, or the validation test simulations are being run to check the firmware, the principles are the same: check all the boundary conditions by presetting registers close to the limit, then increment to and through the limit, and verify expected behavior. That way you don't have to wait for years of "wall clock" time to reach the limits you need to validate.
FunBunny2 - Monday, March 30, 2020 - link
in addition to the other, long, reply is the simple answer: the coders and analysts simply didn't confirm the design spec. kind of like those airplane crashes on "Air Disasters" where the crew skipped steps on a pre-flight checklist. or, of course, the analysts wrote the spec without checking with design requirements. in any case, this sort of error would be nearly impossible to find in production QA.
leexgx - Wednesday, July 8, 2020 - link
Well they obviously did checks 3-4 years later and found these bugs before they became a problem in real world (not as bad as the 0mb bug on the old sandforce ssds witch had a random chance at powerup to nuke the ssd and respond with 0mb space, some sort of bug with the unique way sandforce has 2 levels of virtual LBA NAND mapping then second level compression Mapping witch would result in the whole drive becoming 0mb in very rare but specific cases)
Sivar - Monday, March 30, 2020 - link
Linux network drivers, Oracle database, other SSD firmware -- how many times does this need to happen before developers stop making the same mistake?
It isn't even a tricky fix. Use a larger integer! Count something larger (e.g. days instead of hours, packets instead of bytes)! Add a second integer that counts overruns of the first! Use a double or arbitrary precision value!
wildbil1952 - Friday, March 5, 2021 - link
It's not just Dell and HPE. We had this bite us on Cisco servers. An entire cluster, two ESXi hosts running Simplivity. Even though the VMs were running on an entirely different array, when Simpliivty lost both drives - boom. Cluster down and the Simplivity reinstall would not see the old disks. Every VM is toast.
mikerobert110 - Thursday, March 31, 2022 - link
Whoa! Thanks for all of this information Well we should have to do our own research about the topic.

Also, people I am here to share my amazing experience with the EMC DES-4122 Practice test questions.
https://www.test4practice.com/DES-4122-practice-te...

Dell & HPE Issue Updates to Fix 40K Hour Runtime Flaw in Enterprise SSDs

Post Your Comment

51 Comments

Back to Article

ken.c - Friday, March 27, 2020 - link

olafgarten - Friday, March 27, 2020 - link

InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link

brontes - Saturday, March 28, 2020 - link

olafgarten - Saturday, March 28, 2020 - link

olafgarten - Saturday, March 28, 2020 - link

InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

Gigaplex - Sunday, March 29, 2020 - link

69369369 - Friday, March 27, 2020 - link

eastcoast_pete - Friday, March 27, 2020 - link

FunBunny2 - Friday, March 27, 2020 - link

ballsystemlord - Friday, March 27, 2020 - link

InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link

eastcoast_pete - Sunday, March 29, 2020 - link

Samus - Saturday, March 28, 2020 - link

InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link

FunBunny2 - Saturday, March 28, 2020 - link

Samus - Sunday, March 29, 2020 - link

Kvaern1 - Sunday, March 29, 2020 - link

Samus - Monday, March 30, 2020 - link

Gigaplex - Sunday, March 29, 2020 - link

leexgx - Wednesday, July 8, 2020 - link

oRAirwolf - Saturday, March 28, 2020 - link

rrinker - Monday, March 30, 2020 - link

FunBunny2 - Monday, March 30, 2020 - link

leexgx - Wednesday, July 8, 2020 - link

InTheMidstOfTheInBeforeCrowd - Saturday, March 28, 2020 - link

FunBunny2 - Saturday, March 28, 2020 - link

InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

InTheMidstOfTheInBeforeCrowd - Sunday, March 29, 2020 - link

edzieba - Monday, March 30, 2020 - link

StrangerGuy - Friday, March 27, 2020 - link

ABR - Saturday, March 28, 2020 - link

shabby - Saturday, March 28, 2020 - link

PreacherEddie - Saturday, March 28, 2020 - link

FunBunny2 - Saturday, March 28, 2020 - link

shabby - Saturday, March 28, 2020 - link

FunBunny2 - Saturday, March 28, 2020 - link

Kvaern1 - Sunday, March 29, 2020 - link

LMF5000 - Sunday, March 29, 2020 - link

leexgx - Wednesday, July 8, 2020 - link

Gigaplex - Monday, March 30, 2020 - link

eastcoast_pete - Sunday, March 29, 2020 - link

dwbogardus - Monday, March 30, 2020 - link

FunBunny2 - Monday, March 30, 2020 - link

leexgx - Wednesday, July 8, 2020 - link

Sivar - Monday, March 30, 2020 - link

wildbil1952 - Friday, March 5, 2021 - link

mikerobert110 - Thursday, March 31, 2022 - link

Log in

Don't have an account? Sign up now