How does the Log-Structured-Merge-Tree work?

If you are wondering why should you care about LSM Tree, In one of my previous posts Art of choosing a datastore , I have briefly touched upon LSM-Trees. But this writeup is the best out there if you want to learn the inner workings of a LSM-Tree.

How does the Log-Structured-Merge-Tree work? This was Quora answer by David Jeske. (I book marked this answer but again, few answers disappeared from my bookmarks or some times writers get banned for some reason, so here I am quoting the entire answer)

Whether LSM came from LSFS work is debatable, however, these concepts have been around since long before the 1990s. Check out the 1976 paper “Differential Files and their Applications to Large Databases“. This explains the higher-level concept behind LSM, “when writes are slow, defer them and do them in batches”..

One difficulty in answering your question is that the term “Log Structured Merge Tree” does not describe a single canonical and efficient algorithm the way the term “B-Tree” does. The LSMT algorithm as described in the paper has no specifics about how to manage merging sub-portions of the keyspace — a necessary element to making LSMT efficient. (if you don’t understand what that means, don’t worry, I’ll cover it in detail at [1].. read on…)

I think it’s easiest to understand an LSM by understanding it’s main goal — higher write throughput — and how LSM achieves it relative to the btree. It’s also important to understand we’re talking about “blind writes”, that is writes which do not require a read first.

Start with a B-tree. If you don’t understand that data-structure, go review it now. A Btree, at it’s write-throughput limit, requires two-IOs per random key blind-write.

To understand why this is,  consider the un-cached case — the destination b-tree block is loaded (first IO), then modified, then written back (second IO). While the cached case has more subtlety, it still quickly approaches two-IOs per random-record-write as the dataset exceeds the size of physical RAM. (see explanation [2] below).

How does LSM get better blind-write-throughput than the btree? 

In the general form described in the “differential files” paper — when writing is expensive, instead of trying to immediately write, defer into a “differential layer”. Reads consult both layers. Periodically, layers (or portions of layers) are merged efficiently, to make sure there are not too many layers — because each layer consulted hurts read performance. This is the conceptual basis of all sorted-map write-deferral strategies.

So how many IOs per random-blind-write are there in a ideal-LSM? I like to simplify this to (O * R / B), where O = # levels, R = record-size, and B = block size. If you have 100 byte records, 16k blocks, and 10 LSM levels, that turns out to be 0.06 IOs per record write. Yes, that’s about 32 times better than the b-tree for those parameters. Sounds nice.. but this is just theoretical, now we need a concrete algorithm to get this.

In the 1996 paper, Log-Structured-Merge-Tree, a simplistic but concrete scheme is described using b-trees for each layer. These layers are designated C0-Cn. The newest C0 layer is an entirely in-memory btree, and assumes writes are also going to WAL-style log for durability. Adjacent layers are periodically merged together in their entirety, in a process referred to as a ‘rolling merge’. I call this algorithm “naive LSMT”.

If you’re goal is merely to understand the algorithm exactly in the LSMT paper, then there it is. However, this is not particularly useful, as it’s performance very sub-optimal in many common situations.

The main trouble with naive LSMT, is the merging of entire levels en-mass. This is ideal only when writes are perfectly random. If writes have *any* locality, then write-throughput in this scheme becomes proportional worse at the same time that in the b-tree it becomes proportional better. With localized writes, the b-tree gets to write to cache-local write-back-cache-blocks, while the naive LSMT still periodically rewrites everything in each Cn layer regardless of locality (this is a form of write-amplification).

[1] Now we’re back to the tricky part, how to manage efficiently merging only sub-portions of the key-space. There is no single accepted efficient algorithm. LevelDB, Shuttle-Trees / Fractal-Trees, and other algorithms are effectively all tackling this problem with different tradeoffs.

(a) LevelDB/RocksDB tackles it by liberally relying on a b-tree based intermediate layer, the filesystem. It divides layers up into non-key-overlapping segments (normally around 2MB), and it performs merges using one segment at Cn and all overlapping segments at C(n+1). RocksDB extends merges to handle cases where it’s more efficient to also include overlapping segments in C(n+2) or beyond. (The LSM in Cassandra, HBase, and Hypertable are very close to the LevelDB “filesystem layered” approach)

There are several inefficiencies in the LevelDB approach. One is that it manages the map of all segments using a single large “manifest” file stored in a separate filesystem file. Anytime the segment-set is modified, this entire manifest file must be rewritten. As dataset becomes very large, the manifest becomes very large as well. In order to mitigate this, segment size can be increased, but this makes the minimum merge-granularity larger, causing more write-amplification (rewriting of unchanging data). Another is that it lives on-top of an inode filesystem itself. Ideally merging 12 segments would be 12-streaming reads and 1 long streaming write. However, because of the FS layer, there is no guarantee these files are physical linear on disk, and writing them requires periodic filesystem metadata writes.

(b) Shuttle-Trees and Fractal-Trees approach the problem from the opposite direction, adding write-deferral to the btree. While some don’t consider them LSM, they are essentially attacking the same beast of balancing b-tree organization with write-deferral. They are tricky to understand, but in a nut-shell, they are b-trees which have specially designed “shuttle buffers” which allow writes to shuttle-down portions of the b-tree block-splits more efficiently. Because their design supports much smaller block sizes than LevelDB does segment sizes, they don’t produce as much write-amplification. However, because their tree algorithm has rigid ratios of write-deferral space to tree-size, the maximum write-deferral achievable is also bounded. (An ideal LSM with merging turned off can write as efficiently as a log file)

(d) There are undoubtedly other partitioning schemes out there. Acunu stratified-trees appears to be a b-tree write deferral approach. I’m personally researching my own block-hosted LSM approach which I call “Migrating Trees”.

Of course there are also other non-partitioned schemes.

bLSMTs offer some other interesting ideas, including a limited type of partitioning of the C0 generation referred to as “snow-shoveling”. It does not have a partitioning solution, however, the paper is a very accessible overview and I highly recommend reading it.

Sqlite3’s LSMT, according to the wiki, is a naive unpartitioned LSMT. It appears to use a form of snow-shoveling like partitioning, but only during merge operations.


[2] Cached btree write-throughput is more complicated, but if writes are sufficiently random, and the dataset is bigger than RAM (even by a small factor), write throughput still approaches two IOs per record write. To understand why, imagine a write-back-cache completely full of dirty pages, and a database much larger than RAM. When we want to write a new random key, that block is very likely not in memory. In order to make space for it, we have to evict a dirty page from the write-back-cache (first IO), then we can load the block we need (second IO), and when we modify it, the cache is again full of dirty pages. We’ve reached the b-tree write-throughput “cliff”.

How does the Log-Structured-Merge-Tree work?