CockroachDB internship project: Speeding up some interleaved table deletes by a factor of 10 billion

Last summer, I did an internship with Cockroach Labs, makers of CockroachDB, a SQL database built for massive distribution. I was working on the SQL language semantics in Cockroach, and I was able to work on many different facets of the project in that area.

Overall, my theme for the summer was finding ways to improve the performance of mutation statements - that's your INSERTs, UPDATEs, and DELETEs. At the tail end of the internship, I was able to contribute a major performance gain by adding a fast path to a particular kind of DELETE, involving a kind of table called an interleaved table. This post is about this particular performance fix and everything about how it works.

All the work described in this post actually comes from this pull request.

What's an interleaved table, anyway?

When you search "interleaved table", the first results are CockroachDB's own documentation and that of competitor Google Cloud Spanner. Both have the same semantics, and revolve around the data in one table needing to have a strong relationship with another. This is generally the case when one data object "belongs" to another, possibly in a parent-child relationship. Formally, you can specify a "child" table to be interleaved in another "parent" table if and only if the parent's primary key is a prefix of the child's primary key.

Interleaving tables makes reads, joins, writes and other operations much faster when operating on both the parent and child tables, so they primarily should be used if you have two (or more) tables linked to each other, where you frequently have to access rows from both. They shouldn't be used recklessly, however: interleaved tables generally make reads and deletes over ranges a little slower, so when you don't have a strong data locality relationship as described above, specifying them as interleaved may not give you enough benefit for the trouble.

In Cockroach's case at least, specifying a table to be interleaved causes a change in the data's underlying representation to change in storage. CockroachDB uses a key-value store called RocksDB as its storage layer - rows of tables in the database are stored in RocksDB as one or more keys. Cockroach has a very good comprehensive introduction to how this mechanism works and extensive documentation about all its idiosyncrasies.

When a table is specified as interleaved, the keys for the primary index of the child table contain the key corresponding to their associated parent row as a prefix. Here's an example:

You can also have interleaved tables within interleaved tables, like this:

Because of how RocksDB stores its keys in sorted order (that's a bit of an oversimplification of what it does, actually), interleaving always puts the keys corresponding to the child table's primary key right in between the keys that correspond to the primary key of the row it's interleaved in. This leads to the aforemented speedups in the read, join, and write operations. The keys are much closer together and the database always knows precisely where the interleaved tables' keys were, and doesn't have to search.

At the time of the fix, however, some work still needed to be done to leverage this data transformation for the purpose of improving the performance of deletes.

Why are deletes slow?

A child table in an interleaved relationship usually has a foreign key reference to its parent table. When this foreign key reference is set as ON DELETE CASCADE, rows of the child table are deleted when their parent rows are deleted. In order for CockroachDB to perform a cascading delete on a row from parent table, it performs the following procedure, which is just the same procedure it performs for all foreign key cascading deletes:

It's very important to note that all of these checks are happening on a row-by-row basis. This is incredibly expensive; delete operations that required many foreign key checks were incredibly slow, even when deleting from the parent table of an interleaved relationship.

With interleaved tables, however, the way the keys are laid out allows for the creation of a fast path when the interleaved relationship is just right.

Vroom Vroom

The fast path works based on the premise that the child table's primary-key keys have the parent table's primary-key keys as a prefix. CockroachDB has a DelRange key-value operation that performs a series of deletes in RocksDB to speedily delete all the keys in a range of values. So, in order to delete a row from the parent table and all its corresponding child table rows, all you need to do is perform this "range delete" from the parent table key to the last key that has the key as its prefix. An illustration of this is below.

Of course, you can't just go using the fast path whenever there is an interleaved table --- when you do a key-value operation such as DelRange you are obviating the need for foreign-key and other safety checks, which need to be done beforehand in order for the fast path to work without possibly leaving you with a corrupted database. Luckily, checking whether the schemas are okay before doing the DelRange is still a lot cheaper than checking for foreign key constraints every time you delete a row.

In order to be able to perform the delete, the database has to make sure the following conditions are met:

To detect the fast path, I implemented a tree-walk (breadth-first search) through all of the schema information involved (with the parent table as the root) whenever a delete was performed on a table that had other tables interleaved in it. So this check would run whenever a user would do something like a DELETE FROM parent WHERE id > 4 or with any other WHERE clause. The check would check for the three criteria listed above, going through all of the tables interleaved in the parent table and adding any further interleaved relationships into a queue. Here's the relevant function in all its glory:

func canDeleteFastInterleaved(table TableDescriptor, fkTables sqlbase.TableLookupsByID) bool {
    // If there are no interleaved tables then don't take the fast path.
    // This avoids superfluous use of DelRange in cases where there isn't as much of a performance boost.
    hasInterleaved := false
    for _, idx := range table.AllNonDropIndexes() {
        if len(idx.InterleavedBy) > 0 {
            hasInterleaved = true
    if !hasInterleaved {
        return false
    // if the base table is interleaved in another table, fail
    for _, idx := range fkTables[table.ID].Table.AllNonDropIndexes() {
        if len(idx.Interleave.Ancestors) > 0 {
            return false
    interleavedQueue := []sqlbase.ID{table.ID}
    for len(interleavedQueue) > 0 {
        tableID := interleavedQueue[0]
        interleavedQueue = interleavedQueue[1:]
        if _, ok := fkTables[tableID]; !ok {
            return false
        if fkTables[tableID].Table == nil {
            return false
        for _, idx := range fkTables[tableID].Table.AllNonDropIndexes() {
            // Don't allow any secondary indexes
            // TODO(emmanuel): identify the cases where secondary indexes can still work with the fast path and allow them
            if idx.ID != fkTables[tableID].Table.PrimaryIndex.ID {
                return false
            // interleavedIdxs will contain all of the table and index IDs of the indexes interleaved in this one
            interleavedIdxs := make(map[sqlbase.ID]map[sqlbase.IndexID]struct{})
            for _, ref := range idx.InterleavedBy {
                if _, ok := interleavedIdxs[ref.Table]; !ok {
                    interleavedIdxs[ref.Table] = make(map[sqlbase.IndexID]struct{})
                interleavedIdxs[ref.Table][ref.Index] = struct{}{}
            // The index can't be referenced by anything that's not the interleaved relationship
            for _, ref := range idx.ReferencedBy {
                if _, ok := interleavedIdxs[ref.Table]; !ok {
                    return false
                if _, ok := interleavedIdxs[ref.Table][ref.Index]; !ok {
                    return false
                idx, err := fkTables[ref.Table].Table.FindIndexByID(ref.Index)
                if err != nil {
                    return false
                // All of these references MUST be ON DELETE CASCADE
                if idx.ForeignKey.OnDelete != sqlbase.ForeignKeyReference_CASCADE {
                    return false
            for _, ref := range idx.InterleavedBy {
                interleavedQueue = append(interleavedQueue, ref.Table)
    return true

Here's a higher-level description of how the check runs:

If the check passes, then instead of going through the foreign key-checking workflow, Cockroach will then issue a single KV delete range operation on the range of keys it needs to delete; it gets this range by doing a scan through of the keys and finding the ranges that satisfy the WHERE clause specified in the DELETE command.

Ordinarily, the scan request will return a bunch of key spans that will end at the key of the parent row that needs to be deleted, in the form of a pair of keys indicating the start and end. For the interleaved fast path we then need to just change the end key to the last key that has the end key as a prefix. Finding the range from a key to its "prefix end" seems like a pretty big dependency but a PrefixEnd function already existed Cockroach's KV API and I was able to use it precisely for this purpose.

That's all there was to it! Now, instead of going through foreign key checks every row, the database now just goes through all of the table schemas before starting to delete anything. Let's see how the performance improved.

Show me the numbers

To test out the performance of the fast path, a number of benchmarks were written in deleting large numbers of rows from interleaved tables in the following cases:

The benchmark just tests the performance of a delete after inserting 1000 rows into the parent table and one child row per parent row for each child table. Go benchmarks measure performance by running the operation many times and taking how much time on average a single operation takes. Here are the numbers:

TestNon-fast path timeFast path timeSpeedup (slow time / fast time)
Child50800162 ns/op0.08 ns/op635002025
Siblings1557171104 ns/op0.23 ns/op6770309147
Child and grandchild2397660316 ns/op0.22 ns/op10898455981.8

Disclaimer: the numbers here are somewhat idealized, because the benchmarks here were run on a single node, which is definitely not the situation that Cockroach will normally run in.

However, we still see an performance boost of a factor of 10 billion for the most complex case between the non-fast path and the fast path. This was absolutely huge! We went from second-scale to sub-nanosecond scale in one fell swoop.

Other than the huge performance boost, it's also interesting that while the fast path experiences a predictable slow-down from the single-child case to the cases with more than one interleaved table, the performance for the sibling and grandchild cases perform about the same. This is not true of the non-fast path deletes, which experience a slow down from the sibling to the grandchild case.

This is most likely because the fast path's bottleneck in performance is scanning through all of the keys, which would scale just with the number of keys that need to be deleted in total, while the non-fast path has a bottleneck of checking all of the foreign key relationships of every row being deleted. So not only is there a constant-factor performance boost, there's an algorithmic complexity boost as well.

Future directions

As this is a fast path, the natural next thing to do is to find more cases where this can work, so that we can see similar performance gains for more types of queries. Revisiting the criteria for the fast-path check, we identified that the criteria that could be relaxed first would be the condition that none of the tables could have secondary indices.

The problem with secondary indices is that if an interleaved table has a secondary index, this creates another key that's outside of the prefix range of the parent table. For example, if table with ID 73 is interleaved in table 54, then the child table's primary key would be something like Table/54/1/1/#/Table/73/1/1/0, and its secondary index would be something like Table/73/2/3/0, which isn't in the range. Using some extra scans, it's probably possible to make it so that the interleaved delete will cascade into these secondary index key ranges as well.

On a broader note, other fast paths can be discovered to optimize other types of queries to improve performance in other cases.

If you enjoyed this piece and are excited about this kind of work, I definitely recommend Cockroach Labs as an internship (or full-time) destination! Here's their open positions page.