mvcc 总共有两级索引,一个是在持久化存储中的 b+tree index,一个是在内存中维护的 b-tree secondary index。
在持久化存储中,目前 etcd 抽象了 backend 的设计(期望未来支持多种存储引擎),它封装了对 boltdb 的操作,而在 bolt backend 中实际使用的 key 为
在内存中的索引主要对应两个结构
下面我们展开讲讲上面的几个数据结构
// revBytesLen is the byte length of a normal revision.
// First 8 bytes is the revision.main in big-endian format. The 9th byte
// is a '_'. The last 8 bytes is the revision.sub in big-endian format.
const revBytesLen = 8 + 1 + 8
// A revision indicates modification of the key-value space.
// The set of changes that share same main revision changes the key-value space atomically.
type revision struct {
// main is the main revision of a set of changes that happen atomically.
main int64
// sub is the sub revision of a change in a set of changes that happen
// atomically. Each change has different increasing sub revision in that
// set.
sub int64
}
etcd 在 bolt backend 中存储的真正的 key 是 revision
而不是用户的 key,value 是 mvccpb.KeyValue
revBytesLen
中有一个重要的信息是,序列化后总共有 17 字节,因为在 main 和 sub 之间加入了 "_" 来分隔,main 和 sub 都是以 big-endian 格式序列化,这点与 wal 中的 lenField 不同
<aside> 💡 在删除某个用户 key 时候,会在序列化完成后的 revision raw bytes 后添加一个额外的 byte 't',这种情况下 bolt backend 中的 key 就有 18 字节
</aside>
func (tw *storeTxnWrite) put(key, value []byte, leaseID lease.LeaseID) {
...
ibytes := newRevBytes()
idxRev := revision{main: rev, sub: int64(len(tw.changes))}
// 1. 将 revision 序列化成 bytes
revToBytes(idxRev, ibytes)
// 2. 配置 keyValue
kv := mvccpb.KeyValue{}
d, err := kv.Marshal()
...
// 3. 写入 backend, key 为 revision,value 是 KeyValue
tw.tx.UnsafeSeqPut(keyBucketName, ibytes, d)
// 4. 写入 treeIndex,key 为用户 key
tw.s.kvindex.Put(key, idxRev)
}
func (tw *storeTxnWrite) delete(key []byte) {
ibytes := newRevBytes()
idxRev := revision{main: tw.beginRev + 1, sub: int64(len(tw.changes))}
revToBytes(idxRev, ibytes)
// 往 revision bytes 中写入 't' 字符
ibytes = appendMarkTombstone(nil, ibytes)
// 写入 backend,注意不是删除
tw.tx.UnsafeSeqPut(keyBucketName, ibytes, d)
// 修改 treeIndex 中的 keyIndex
err = tw.s.kvindex.Tombstone(key, idxRev)
...
}
// keyIndex stores the revisions of a key in the backend.
// Each keyIndex has at least one key generation.
// Each generation might have several key versions.
// Tombstone on a key appends an tombstone version at the end
// of the current generation and creates a new empty generation.
// Each version of a key has an index pointing to the backend.
type keyIndex struct {
key []byte
modified revision // the main rev of the last modification
generations []generation
}
// generation contains multiple revisions of a key.
type generation struct {
ver int64
created revision // when the generation is created (put in first revision).
revs []revision
}
keyIndex 为内存里索引中对应的结构体