Comments Locked

48 Comments

Back to Article

  • FunBunny2 - Thursday, September 5, 2019 - link

    RIP real, aka Relational, databases.
  • Billy Tallis - Thursday, September 5, 2019 - link

    They're still around, and always will be. But they're not the only game in town anymore, and most people understand that they're not always the best solution. Of course, there are also plenty of examples where people have gone overboard in rejecting relational databases.
  • shayne.oneill - Friday, September 6, 2019 - link

    "they're not always the best solution"

    As we've been (re)discovering however, there aren't a lot of use cases where relational dbs are not the best solution. Theres a good reason why non relational databases where mostly abandoned in the 80s and 90s.
  • Urthor - Saturday, September 7, 2019 - link

    People will keep saying relational databases are dying as long as they keep mistakenly associating the term "relational" with tabular data.

    In reality the principles behind relational databases are identical whether your data is stored in tables of objects or documents or parsable data, but people seem to think the term relational means SQL '92
  • submux - Saturday, September 7, 2019 - link

    I think the most important factor is that there are far too many cases where relational DBs are the wrong "only answer" where they are abused. There is a massive amount of unstructured data which has been stored in blobs for ages. When developing an application such as for managing a doctors office, you would want to store structured patient data in a relational database, but you'd want to store pictures, lab data (genetic data for example), xrays, etc... in an object store.

    We don't need to do either/or and from an architectural perspective, running an SQL front-end on an object back end makes sense if only because most SQL servers have scalability issues. Legacy SQL ISAMs tend to shard very narrowly. By using Mongo, Couch, whatever, you can often scale out across many nodes. This is great for performance (maybe not for sync, but for query at least), and it makes it possible to simplify backup by keeping all data in a single data store.

    There is a holy grail to be discovered and that's the means of making a great scalable database which handles structured and unstructured data equally well. This could be an SQL server with amazing blob handling on a NoSQL server with amazing query processing... or maybe a hybrid. It should be able to be deployed using Docker or K8S without any complex replication junk and it should scale almost linearly with each additional added node.

    I think this will happen soon. And these drives may make it more interesting if an object store ends up being how we can accelerate it.
  • ElishaBentzi - Thursday, September 5, 2019 - link

    yes, is another mail to kill the databases, key - value, hash tables is almost all we need.
  • bcronce - Thursday, September 5, 2019 - link

    Key value datastore is almost all you need [for non relational data].

    Why do I say for non-relation data? Because humans, being humans, will make mistake after mistake if relational consistency is not enforced. Unless you have a simplistic use case or are dealing with the top 0.1% of programmers, you're going to have people corrupting your data.

    But but.. We don't have that issue! You say. ehhhh..... not in my experience. People who think they're not corrupting data seem to not know when they are. But we have all kinds of checks! you say. Yeah, but unless you're in the top 0.1%, you probably have the wrong checks or the checks are buggy and giving false positives.

    As someone who deals with data integration and loves RDBMSs, I generally load data into an RDBMS. And even from the biggest players in the industry, I will find invalid relations for data that spans years.

    Buggy software is much easier to fix that corrupt data.

    The real irony is many projects that don't want to use a RDBMS, run into many issues and start adding all kinds of business logic to their data layer, only to reinvent parts of a RDBMS, poorly.
  • bcronce - Thursday, September 5, 2019 - link

    Should be "false negative"
  • FunBunny2 - Thursday, September 5, 2019 - link

    "The real irony is many projects that don't want to use a RDBMS, run into many issues and start adding all kinds of business logic to their data layer, only to reinvent parts of a RDBMS, poorly."

    the crux of the issue: ever since IDMS & IMS reared their hierarchical heads, client-side coders have sought to sabotage transaction control in the datastore, which is what distinguishes a database from a datastore. client-side coders seek to extend their employment to infinity. current industrial strength RDBMS can handle petabyte storage, so it's not as if application only flat-files are necessary.
  • lkcl - Thursday, September 5, 2019 - link

    Lightning Memory Database (LMDB) is a transactional key-value store based on copy-on-write shmem (shared memory) file capabilities and uses a B+ Tree for data. the copy-on-write semantics means that unlike a standard key-value store which must copy the data, LMDB may give you a *direct* pointer to the actual data. consequently, performance for 100 MEGABYTE values is as good as performance for a 100 BYTE value.

    its transactional capabilities work by allowing a (single) writer to write an alternative B+ tree whilst multiple readers use the current B+tree, all of which is safe to do because the copy-on-write semantics isolate the writer entirely from the readers. when the transaction is complete, a new B+-tree "root" is created, which *only new readers* will be allowed to see (not the ones that currently have a read transaction in progress).

    it is extremely powerful and very obtuse code that took one (inexperienced) reviewer a YEAR to complete a full audit (he simply could not believe the listed capabilities of LMDB).

    interestingly - sadly - i doubt very much whether LMDB's capabilities could take advantage of Samsung's KV SSDs. also, do look up Howard Chu's comments on RocksDB and LevelDB. they are extremely informative (and not very complimentary, which, if you are going to have your business critically dependent on a key-value store, you need blunt, unfiltered, no-bullshit *technically correct* advice).
  • FunBunny2 - Friday, September 6, 2019 - link

    "its transactional capabilities work by allowing a (single) writer to write an alternative B+ tree whilst multiple readers use the current B+tree, all of which is safe to do because the copy-on-write semantics isolate the writer entirely from the readers. when the transaction is complete, a new B+-tree "root" is created, which *only new readers* will be allowed to see (not the ones that currently have a read transaction in progress)."

    this is not materially different from MVCC semantics, which have hobbled Oracle et al for a long time. Oracle eats servers for breakfast. sane transaction scoping, sans client intervention, with a locker (traditional) engine, will always be more parsimonious and efficient.
  • lkcl - Friday, September 6, 2019 - link

    you may be interested to know that an oracle employee tracked the LMDB wikipedia page for a long time, continuously trying to get it deleted. when they succeeded, i went into overdrive, spent three weeks rewriting it to a high standard. we have that unethical oracle employee to thank, for that. whoopsie :)

    howard chu's perspecive on write journal transaction logging is both dead-accurate and extremely funny. he says, very simply, "well if you maintain a write journal you just wrote data twice and cut performance in half".

    oracle's approach is *in no way* more efficient. "eats servers for breakfast" - as in, it's so heavy it "consumes" them? as in, it puts such a heavy load onto machines that it destroys them?

    howard's LMDB code is so small it fits into the L1 cache of any modern processor (including smartphone ones). by designing the code to *be* efficient, the server happens to have more time to do things such as "read and write".

    interestingly, as long as fsync is enabled, it is literally impossible to corrupt the database. only if SSDs without power loss protection are used (such that the *SSD* corrupts the data) will the database become corrupted. that's with *no write journal because by design one is not needed*. you might lose some data (if the last write transaction did not get to write the new B+-tree root iin time), but you will not *corrupt* the *existing* data due to a power-loss event.

    https://en.wikipedia.org/wiki/Multiversion_concurr...

    no.

    timestamps are *NOT* used in LMDB. there is absolutely no slow-down as the number of readers increases, because there is no actual "locking" on read-access. the only "locking" required is at the point where a writer closes a transaction and needs to update the "root" of the B+-tree.

    so if a reader happens to start a new read transaction just at the point where a writer happens to want to *close* a transaction, there is a *small* window of time in which the reader *might* be blocked (for a few hundred nanoseconds). if it misses that window (early) it gets the *old* root node. if it misses that window (late) it gets the *new* root node. either way there will be no delay.

    there are *NOT* "multiple versions of the same objects", either, because there is only one writer. however there is some garbage-collection going on: old B+-tree roots, pointing to older "versions", will hang around until the very last reader closes the read transaction. at that point, LMDB knows to throw those pages onto the "free" page list.

    so the only reason that "old" objects will be still around is because there's a reader *actually using them*. duh :)

    the difference in the level of efficiency between MVCC and what LMDB does is just enormous, and it's down to the extremely unusual choice to use copy-on-write shared memory semantics.

    you may be interested to know that the BTRFS filesystem is the only other well-known piece of software using copy-on-write. it's just so obscure from a computer science perspective that it was 20 years before i even became aware that shmem had a copy-on-write option!
  • lkcl - Friday, September 6, 2019 - link

    re-read that wikipedia page:

    "When an MVCC database needs to update a piece of data, it will not overwrite the original data item with new data, but instead creates a newer version of the data item. Thus there are multiple versions stored. "

    so yes... technically you are correct. however the actual implementation details are chalk and cheese. without using copy-on-write, the performance penalty is just enormous. *with* copy-on-write, MVCC semantics are an accidental beneficial *penalty-less* design side-effect.
  • bcronce - Friday, September 6, 2019 - link

    "you may be interested to know that the BTRFS filesystem is the only other well-known piece of software using copy-on-write"

    ZFS uses a merkle tree, which is also CoW due to the nature that an individual branch of a merkle tree is effectively immutable. Any changes by definition has to create a whole new branch. Git is another example of this same algorithm. You can't change history, but you can create a new history, and it is perfectly obvious that is it different.
  • mode_13h - Sunday, September 8, 2019 - link

    And Sun/Oracle also wrote ZFS.
    : )
  • mode_13h - Sunday, September 8, 2019 - link

    > i went into overdrive, spent three weeks rewriting it to a high standard

    Thanks for that & your informative post.

    > interestingly, as long as fsync is enabled, it is literally impossible to corrupt the database. only if SSDs without power loss protection are used (such that the *SSD* corrupts the data) will the database become corrupted.

    Even with fsync(), I believe you still need to enable filesystems' write barriers - a common mount option.

    > you may be interested to know that the BTRFS filesystem is the only other well-known piece of software using copy-on-write.

    This is a very strong claim. I know Linux has long used CoW in its VM system to trigger page allocations. I'm not sure if it's still true, but it used to be that malloc() would return memory mapped to a shared page containing zeros. Only when you tried to initialize it would a write-fault cause the kernel to go out and find an exclusive page to put behind it. Now that I think about it, I'm not sure how advantageous that would really be, especially if using 4 kB pages - the overhead from all those page faults would surely add up.

    CoW semantics show up in a lot of places where you have ref-counting. Some libraries force a copy, when you try to write to an object and find that the refcount is > 1 (meaning you don't have exclusive ownership).
  • mode_13h - Sunday, September 8, 2019 - link

    BTW, there's some hint of irony at bashing Oracle on one hand, and then seemly praising BTRFS - also written by Oracle! That said, big companies are rarely all bad or all good.
  • bcronce - Friday, September 6, 2019 - link

    That is interesting. A single writer DB can be useful for limited applications or heavy-read low-write. But then you also have the issue of needing to partition your data in order to scale up writes.

    Moral of the story. There are many interesting tools to solve many types of problems. Some are fundamentally a poor fit for certain problems, and others could be a good or bad fit depending on the design, implementation, and configuration.

    The most fun thing about document data stores that everyone is using is distributed data stores that need multi-document cross-service transnational consistency is a much more difficult problem that multithreading. At least with multithreading, you're working in the same memory space, no messages are lost, and very small upper bounds on delayed messages.

    Engineers treat multithreading as some humanly impossible thing, but then blindly barrel head first into distributed eventually-consistent data storage with no cross-service consistency protocols and assume anyone can write a "microservice" in a vacuum as a one sprint project.
  • lkcl - Friday, September 6, 2019 - link

    yes, you may be interested to know that LMDB was initially designed purely for OpenLDAP (focussing as it does on read performance). howard's technical background would not let him write something that was inefficient, and that, surprisingly, makes it "not bad" (not great, just "not bad") for writes as well.

    OpenLDAP obviously has slurpd and can do data distribution. i agree 100%, it's technically an extremely challenging problem. not helped by developers not appreciating that network connections can disappear on you, and the *entire program* has to be designed around connection recovery and continuation.

    tricky! :)

    howard's comments on eventually-consistent databases are very funny. he says something like, "if you are going to use that [brain-dead technical approach] you might as well just use mongodb and hope that it writes the data", i wish i could remember where it was, it was incredibly funny. basically someone suggested doing something which, technically, was as good as just throwing the data away.

    the last time i did a review of these eventually-consistent "data"bases, i was asked to throw as much data at them as possible in order to check maximum write rates. i was shocked to find that mongodb, after only 3 minutes, paused for around 30 seconds and would not let me write any more data. after another 2 minutes, it paused for 45 seconds. 2 minutes later, it paused for an entire minute. you can guess where that went. we did not go with mongodb.

    btw another really informative article by him: https://www.linkedin.com/pulse/actordb-distributed...
  • mode_13h - Sunday, September 8, 2019 - link

    > The most fun thing about document data stores that everyone is using is distributed data stores that need multi-document cross-service transnational consistency is a much more difficult

    You mean "document data stores that everyone is using *as* distributed data stores that need multi-document cross-service *transactional* consistency" ?

    > assume anyone can write a "microservice" in a vacuum as a one sprint project.

    Lol, true.
  • submux - Saturday, September 7, 2019 - link

    Back end object storage can be unstructured and manageable, but the front end can be relational and enforce relationships.

    Then you get the scalability advantages of object storage rather than ISAM and you also get the advantages of SQL RDBMS.
  • jordanclock - Thursday, September 5, 2019 - link

    I'm curious as to why you would call relational databases "real" databases?
  • FunBunny2 - Thursday, September 5, 2019 - link

    the simple answer: RDBMS control the data independent of the clients, all of these other flat-file analogs leave control in the client, which is a problem.
  • satai - Sunday, September 8, 2019 - link

    Client/server architecture has very little to do with relationality.
  • FunBunny2 - Sunday, September 8, 2019 - link

    "Client/server architecture has very little to do with relationality."

    it's not relationality, per se, that's at issue. it just happens that, these days, SQL (relational, sort of) databases constitute 99.44% of datastores which are controlled by a central TPM. in the case of CICS, which is believe it or don't still around even on linux, is separate from the datastore. RDBMS/SQL engines incorporate the datastore. before Codd and the RM, both IDMS and IMS (again, still around) were/are transaction control engines for their datastores.

    doing transaction control from the client(s) is the disaster waiting to happen. that's what figured out 50 years ago. kiddie koders fresh out of koder skool have no clue about data. they just want a sinecure pounding out LoC.
  • prisonerX - Thursday, September 5, 2019 - link

    Relational databases are built from key-value stores (generally b-trees) so your comment doesn't make much sense.

    What does make sense is moving KV stores closer to the hardware to improve performance, so this product is a great idea.
  • cosmotic - Thursday, September 5, 2019 - link

    The index might be a b-tree but the storage of the data almost definitely isn't a b-tree.
  • lkcl - Thursday, September 5, 2019 - link

    https://github.com/LMDB/sqlightning

    sqlite3 uses btree for its data. replacing the btree algorithm with LMDB resulted in the "insert" test completing at 1,000 times faster. this is due to a unique feature of LMDB's "insert-at-end" capability.

    synchronous sequential and random writes were 25% better. asynchronous sequential writes about 15% worse. async random writes about 1% better. random reads 80% better. sequential reads 85% better. etc.
  • lkcl - Friday, September 6, 2019 - link

    https://blog.biokoda.com/post/133121776825/actordb...

    "How - SQLite

    Lets start with some basics. A basic unit of storage for a database is a page. Pages are generally 4k or 8k. An SQLite file is a sequence of pages one after another."
  • FunBunny2 - Sunday, September 8, 2019 - link

    "A basic unit of storage for a database is a page"

    yes and no. the unit of storage depends on the hardware and OS. could be a page or extent or a row. it just depends on what the OS supports and engine writers opt for. whether a write is implemented as a full-page re-write on a row change (update/delete/insert) is up to the engine writer and the OS capabilities. RM/SQL semantics make it the row (strictly speaking the set which may be a join, which itself may resolve as one row), and someday, Codd willing, all RDBMS will support that.
  • shayne.oneill - Friday, September 6, 2019 - link

    "Relational databases are built from key-value stores"

    Uh, no. No they are not they most definately are not KV stores. You can implement a KV store on a relational database but under the hood they look nothing like a KV store.
  • sercand - Friday, September 6, 2019 - link

    Most RDBS are really built from key-value stores like RocksDB. For example, CockroachDB has a great blog post about how to SQL engines work with key-value stores: https://www.cockroachlabs.com/blog/sql-in-cockroac...
  • [email protected] - Thursday, September 5, 2019 - link

    I mean, this is a rocksDB replacement, relational databases like MySQL still need a storage layer, which often is rocksDB. This doesn't replace the database, it replaces the storage layer interface to the drive.
  • lkcl - Friday, September 6, 2019 - link

    https://www.linkedin.com/pulse/lies-damn-statistic...

    be careful. rocksdb is a pile of s**t.
  • FunBunny2 - Sunday, September 8, 2019 - link

    "be careful. rocksdb is a pile of s**t."

    ah, dreadful partisanship there.
  • lkcl - Sunday, September 8, 2019 - link

    not really - do the research, you'll find that most benchmarks are done in an incompetent (and thus misleading) fashion, such as only running the test for 30 minutes and yet claiming full read coverage, when, plainly, at the data rate published *by* the tester, 30 minutes times the SSD's data rate is physically impossible to read the entire dataset. yet, somehow, they claim to have managed it... turns out that there's a large RAM cache which they failed to mention, which invalidates the entire benchmark.

    however in the case of RocksDB it's much worse than that: there are technical design flaws which it takes a huge amount of expertise to even fully understand. for that, you want the analysis of one of the world's leading experts - a core developer on OpenLDAP - Howard Chu.
  • trissylegs - Thursday, September 5, 2019 - link

    Cockroach DB implements a Postgres database on top a Key-vale store (using RocksDB).
    So you should be able to use these with Cockroach (if implemented)
  • insufferablejake - Friday, September 6, 2019 - link

    Not really. Relational DBs expose a relational schema, this doesn't dictate how they actually persist their data. For eg. if B-Trees are used [ref. https://www.sqlite.org/fileformat.html] then you can replace the 'pages' or nodes in a BTree with the 2MB pages on an SSD.
  • Urthor - Saturday, September 7, 2019 - link

    Exactly. A relational database just means "expose a relational schema" but people keep thinking that all relational databases "must" use the same under the hood technologies as MySQL or else they are not relational.

    As long as you expose that relational scheme your actual method of abstracting physical media to the relational tables is totally irrelevant.
  • Sivar - Thursday, September 5, 2019 - link

    I am curious how the move from fixed block-oriented to dynamic-key-oriented storage was implemented in firmware. Isn't page size (smallest programmable unit) fixed at manufacture time? 4 bytes seems a little small for a physical page of flash memory.
  • Billy Tallis - Thursday, September 5, 2019 - link

    Page sizes for current flash memory tend to be on the order of eg. 16kB. Even block-oriented SSDs have to do partial page programming a lot, and benefit when they can cache writes in order to program full pages at a time.

    I don't know what kind of data structures Samsung is using under the hood, but it has to be more complex than the flat address indirection tables used by typical SSDs, and probably looks more like a filesystem. So keys and length info are probably packed into some kind of index, while values (at least larger ones) probably get allocated whole NAND pages.
  • lkcl - Thursday, September 5, 2019 - link

    my guess is that they found a random piece of free software off the internet, ported it to run on an embedded linux distribution such as openwrt on what is almost certainly an ARM core on the PCB of the SSD, and slapped a "server" version of the exact same API code they just released on github.

    that "server" version will receive the API calls over the NVMe bus, unpack them, and call whatever free software key-value library is being run on the SSD's processor.

    what would actually be far more sensible - and this is something that the RISE Group at IIT Madras is giving actual serious consideration to - would be to allow the *USERS* to run the Embedded OS *of their own choice* on the actual SSD's embedded processor.

    this would allow developers to put the best key-value store *OF THEIR CHOICE* on the actual bare metal (ok NVMe) SSD, whilst also allowing the manufacturers to actually still sell useful product and make a lot of money.
  • iaw4 - Friday, September 6, 2019 - link

    I would love to experiment with my own SSD firmware. I have been looking for such a device for a long time. does anyone know where I could find a reasonably-priced SSD that allows this??
  • lkcl - Friday, September 6, 2019 - link

    some quick google searches:
    https://news.ycombinator.com/item?id=8791274

    which in turn links hee:
    http://www.openssd-project.org/

    that would be a good place to start, and it turns out that "The Indilinx's BarefootTM controller is an ARM-based SATA controller used in numerous high-performance SSDs such as Corsair Memory's Extreme/Nova, Crucial Technology's M225, G.Skill's Falcon, A-RAM's Pro series, OCZ's "
  • iaw4 - Friday, September 6, 2019 - link

    sadly, the openssd project is a $3,500 card to experiment with software. it's great, but there seems to be no path to developing one's own firmware and load it onto a commercially reasonably priced ssd. even for the indilinx chip, one would presumably still have to be able to sign code to upload it. oh, and indilinx is dead.
  • lkcl - Sunday, September 8, 2019 - link

    argh sigh, that's a pity. it means starting again from scratch, with a more up-to-date chip. well, sigh, this is precisely why IIT Madras want to actually develop their own SSD ASIC, with *actual* open standards, so that people can put the processing right where the data is.

    in the meantime, there is... sort-of... a way that you could produce your own sort-of SSD, and it's by using something like the Allwinner A20 (or the R40) and attach multiple ONFI NAND ICs to it. i believe that the SATA interface on the A20 and R40 can be "turned around" so that you can use that processor *as* a SATA client. it's a full ARM Cortex A7 Dual-Core (Quad Core for the R40) so is not messing about.

    once you have full control over the NAND ICs like that (running your own File system, or treating the entire SSD as a block device), the sky's the limit.

    this really isn't technically hard (except for dealing with wear-levelling, and there's NAND filesystem drivers to take care of that), and the sole major difference between the above and "an SSD" is... well... um... one's proprietary and mass-produced, and the other isn't.
  • FunBunny2 - Thursday, September 5, 2019 - link

    "I am curious how the move from fixed block-oriented to dynamic-key-oriented storage was implemented in firmware."

    There's nothing special about fixed block hard drive. This reads more like a clone of IBM 360 C-K-D structure. Which IBM has been emulating in PC harddrives for a couple of decades.
  • mode_13h - Sunday, September 8, 2019 - link

    I'm concerned about how well these will handle larger data. 2 MB sounds like a lot, but a lot of audio, video, and even image data is much larger than that. Any chance these drives will support partitions of both types - block and KV?

    There's also the issue of scalability beyond a single drive, but allegedly the standard allows for some kind of "linking" between drives?

    Finally, what's the solution for replication/fault-tolerance? Does that fall entirely on the client side of the API?

Log in

Don't have an account? Sign up now