etcd-raft 实现之源码解读

Go开发大全

【导读】raft是一种应用广泛的高效分布式一致性算法，etcd项目的raft算法是如何实现的？本文做了详细介绍。

这篇文章打算介绍使用 etcd 的 raft 模块，本来我是想一行实现的代码都不贴，就单独介绍 etcd-raft 使用（即如何基于它来实现 raft 语义的）。但是读了一下 raftexample 发现不太现实，感觉 etcd-raft 的外部逻辑和实现是严重耦合的，所以或许我们不得不从实现开始，了解一下 etcd-raft 大概提供了什么样的抽象, 然后再介绍一下官方的 raftexample 的内容，便于我们实现简单的 raft 服务器。

Raft 结构体
raft 的 Config
Storage 接口，和实现的 memory storage

etcd的raft实现都在etcd/raft目录下，但是大部分的实现都在下面几个比较核心的文件:

raft.go: 从名字也可以看出来，这个是最核心的部分，比如leader选择的逻辑、raft消息的处理逻辑等
node.go: 可以理解为raft集群的一个节点，客户端也主要是这个类打交道，比如心跳的逻辑、propose、状态机、成员变更等都是这个类负责处理。
log.go: raft日志相关的代码，比如保存日志记录
raft.proto: 定义了raft一些核心的RPC数据结构，由于protobuf是跨语言的，因此如果想用其他语言重写etcd raft，那么至少这部分内容都是可以复用的

type raft struct {
  // raft 节点本身在集群中对应的 id
 id uint64

 Term uint64
 Vote uint64
 // readIndex 相关的结构，用来存储 readIndex 相关的状态
 readStates []ReadState

 // the log
 raftLog *raftLog

 maxMsgSize         uint64
 maxUncommittedSize uint64
 // TODO(tbg): rename to trk.
 prs tracker.ProgressTracker

 state StateType

 // isLearner is true if the local raft node is a learner.
 isLearner bool

 msgs []pb.Message

 // the leader id
 lead uint64
 // leadTransferee is id of the leader transfer target when its value is not zero.
 // Follow the procedure defined in raft thesis 3.10.
 leadTransferee uint64
 // Only one conf change may be pending (in the log, but not yet
 // applied) at a time. This is enforced via pendingConfIndex, which
 // is set to a value >= the log index of the latest pending
 // configuration change (if any). Config changes are only allowed to
 // be proposed if the leader's applied index is greater than this
 // value.
 pendingConfIndex uint64
 // an estimate of the size of the uncommitted tail of the Raft log. Used to
 // prevent unbounded log growth. Only maintained by the leader. Reset on
 // term changes.
 uncommittedSize uint64

 readOnly *readOnly

 // number of ticks since it reached last electionTimeout when it is leader
 // or candidate.
 // number of ticks since it reached last electionTimeout or received a
 // valid message from current leader when it is a follower.
 electionElapsed int

 // number of ticks since it reached last heartbeatTimeout.
 // only leader keeps heartbeatElapsed.
 heartbeatElapsed int

 checkQuorum bool
 preVote     bool

 heartbeatTimeout int
 electionTimeout  int
 // randomizedElectionTimeout is a random number between
 // [electiontimeout, 2 * electiontimeout - 1]. It gets reset
 // when raft changes its state to follower or candidate.
 randomizedElectionTimeout int
 disableProposalForwarding bool

 tick func()
 step stepFunc

 logger Logger
}

（里面好像很多都是配置，为啥不内嵌一个 conf 嘞）

raftLog && Storage

比较重要的结构是 raftLog:

type raftLog struct {
 // storage contains all stable entries since the last snapshot.
 storage Storage

 // unstable contains all unstable entries and snapshot.
 // they will be saved into storage.
 unstable unstable

 // committed is the highest log position that is known to be in
 // stable storage on a quorum of nodes.
 committed uint64
 // applied is the highest log position that the application has
 // been instructed to apply to its state machine.
 // Invariant: applied <= committed
 applied uint64

 logger Logger

 // maxNextEntsSize is the maximum number aggregate byte size of the messages
 // returned from calls to nextEnts.
 maxNextEntsSize uint64
}

commited, applied 都是论文提到的
storage 是一个很重要的 interface, 是对 raft 依赖的存储层的抽象。你可以（应该）自己实现，etcd 也提供了 memory storage。实际上 Storage 是访问落盘数据的借口。
unstable 是一个 log 的内存写buffer，便于日志复制。unstable 使用内存数组维护其中所有的日志，对于 Leader节点而言，它维护了客户端请求对应的日志；对于Follower节点而言，它维护的是从Leader节点复制来的日志。

// unstable.entries[i] has raft log position i+unstable.offset.
// Note that unstable.offset may be less than the highest log
// position in storage; this means that the next write to storage
// might need to truncate the log before persisting unstable.entries.
type unstable struct {
 // the incoming unstable snapshot, if any.
 snapshot *pb.Snapshot
 // all entries that have not yet been written to storage.
 entries []pb.Entry
 offset  uint64

 logger Logger
}

以上是 unstable 的结构，下面重点看看 Storage:

// Storage is an interface that may be implemented by the application
// to retrieve log entries from storage.
//
// If any Storage method returns an error, the raft instance will
// become inoperable and refuse to participate in elections; the
// application is responsible for cleanup and recovery in this case.
type Storage interface {
 // TODO(tbg): split this into two interfaces, LogStorage and StateStorage.

 // InitialState returns the saved HardState and ConfState information.
 InitialState() (pb.HardState, pb.ConfState, error)
 // Entries returns a slice of log entries in the range [lo,hi).
 // MaxSize limits the total size of the log entries returned, but
 // Entries returns at least one entry if any.
 Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
 // Term returns the term of entry i, which must be in the range
 // [FirstIndex()-1, LastIndex()]. The term of the entry before
 // FirstIndex is retained for matching purposes even though the
 // rest of that entry may not be available.
 Term(i uint64) (uint64, error)
 // LastIndex returns the index of the last entry in the log.
 LastIndex() (uint64, error)
 // FirstIndex returns the index of the first log entry that is
 // possibly available via Entries (older entries have been incorporated
 // into the latest Snapshot; if storage only contains the dummy entry the
 // first log entry is not available).
 FirstIndex() (uint64, error)
 // Snapshot returns the most recent snapshot.
 // If snapshot is temporarily unavailable, it should return ErrSnapshotTemporarilyUnavailable,
 // so raft state machine could know that Storage needs some time to prepare
 // snapshot and call Snapshot later.
 Snapshot() (pb.Snapshot, error)
}

Storage 本意是定义的存储层，能够有连续的日志，逻辑上可以生成 snapshot

HardState 是原论文中介绍需要 persistent 的 state 部分，ConfState 则是集群的配置。

message HardState {
 optional uint64 term   = 1 [(gogoproto.nullable) = false];
 optional uint64 vote   = 2 [(gogoproto.nullable) = false];
 optional uint64 commit = 3 [(gogoproto.nullable) = false];
}

snapshot 本体是被序列化了的一堆 bytes, 带上 snap 的状态。

Conf

// Config contains the parameters to start a raft.
type Config struct {
 // ID is the identity of the local raft. ID cannot be 0.
 ID uint64

 // peers contains the IDs of all nodes (including self) in the raft cluster. It
 // should only be set when starting a new raft cluster. Restarting raft from
 // previous configuration will panic if peers is set. peer is private and only
 // used for testing right now.
 peers []uint64

 // learners contains the IDs of all learner nodes (including self if the
 // local node is a learner) in the raft cluster. learners only receives
 // entries from the leader node. It does not vote or promote itself.
 learners []uint64

 ElectionTick int
 HeartbeatTick int

 // Storage is the storage for raft. raft generates entries and states to be
 // stored in storage. raft reads the persisted entries and states out of
 // Storage when it needs. raft reads out the previous state and configuration
 // out of storage when restarting.
 Storage Storage
 Applied uint64

 MaxSizePerMsg uint64
 MaxCommittedSizePerReady uint64
 MaxUncommittedEntriesSize uint64
 MaxInflightMsgs int

 // CheckQuorum specifies if the leader should check quorum activity. Leader
 // steps down when quorum is not active for an electionTimeout.
 CheckQuorum bool
 PreVote bool
 ReadOnlyOption ReadOnlyOption

 // Logger is the logger used for raft log. For multinode which can host
 // multiple raft group, each raft group can have its own logger
 Logger Logger
 DisableProposalForwarding bool
}

这属于 raft 的一些配置，大部分要么在论文里有，要么是raft的参数，还是很好理解的。

Node

Node 是单个节点的抽象，raft 里面有个 Node interface，同时有一个 node 的实现，同时我们从 node 中找到 ready channel，从这个 channel 取东西，进行处理

// Ready encapsulates the entries and messages that are ready to read,
// be saved to stable storage, committed or sent to other peers.
// All fields in Ready are read-only.
type Ready struct {
 // The current volatile state of a Node.
 // SoftState will be nil if there is no update.
 // It is not required to consume or store SoftState.
 *SoftState

 // The current state of a Node to be saved to stable storage BEFORE
 // Messages are sent.
 // HardState will be equal to empty state if there is no update.
 pb.HardState

 // ReadStates can be used for node to serve linearizable read requests locally
 // when its applied index is greater than the index in ReadState.
 // Note that the readState will be returned when raft receives msgReadIndex.
 // The returned is only valid for the request that requested to read.
 ReadStates []ReadState

 // Entries specifies entries to be saved to stable storage BEFORE
 // Messages are sent.
 Entries []pb.Entry

 // Snapshot specifies the snapshot to be saved to stable storage.
 Snapshot pb.Snapshot

 // CommittedEntries specifies entries to be committed to a
 // store/state-machine. These have previously been committed to stable
 // store.
 CommittedEntries []pb.Entry

 // Messages specifies outbound messages to be sent AFTER Entries are
 // committed to stable storage.
 // If it contains a MsgSnap message, the application MUST report back to raft
 // when the snapshot has been received or has failed by calling ReportSnapshot.
 Messages []pb.Message

 // MustSync indicates whether the HardState and Entries must be synchronously
 // written to disk or if an asynchronous write is permissible.
 MustSync bool
}

这个时候处理 batch 消息

raftexample

这个服务器大致会构建一个 channel pipe, 然后 http 接到消息之后发给 node, node 处理完毕同步的逻辑之后会把消息 send 给 kvstore, 使其 apply 到状态机。

httpKVApi

// Handler for a http based key-value store backed by raft
type httpKVAPI struct {
 store       *kvstore
 confChangeC chan<- raftpb.ConfChange
}

启动的逻辑如下：

// serveHttpKVAPI starts a key-value server with a GET/PUT API and listens.
func serveHttpKVAPI(kv *kvstore, port int, confChangeC chan<- raftpb.ConfChange, errorC <-chan error) {
 srv := http.Server{
  Addr: ":" + strconv.Itoa(port),
  Handler: &httpKVAPI{
   store:       kv,
   confChangeC: confChangeC,
  },
 }
 go func() {
  if err := srv.ListenAndServe(); err != nil {
   log.Fatal(err)
  }
 }()

 // exit when raft goes down
 if err, ok := <-errorC; ok {
  log.Fatal(err)
 }
}

如果收到 conf change, 会把请求 parse 出来，发给 confChangeC
如果有数据更改的请求，会调用 h.store.Propose(key, string(v))，这个函数内部会s.proposeC <- buf.String(), 给 proposeC 发送消息

raft node

commitC, errorC, snapshotterReady := newRaftNode(*id, strings.Split(*cluster, ","), *join, getSnapshot, proposeC, confChangeC)

raftNode 不是 raft 库定义的，是程序定义的，有关代码如下：

// newRaftNode initiates a raft instance and returns a committed log entry
// channel and error channel. Proposals for log updates are sent over the
// provided the proposal channel. All log entries are replayed over the
// commit channel, followed by a nil message (to indicate the channel is
// current), then new log entries. To shutdown, close proposeC and read errorC.
func newRaftNode(id int, peers []string, join bool, getSnapshot func() ([]byte, error), proposeC <-chan string,
 confChangeC <-chan raftpb.ConfChange) (<-chan *string, <-chan error, <-chan *snap.Snapshotter) {

 commitC := make(chan *string)
 errorC := make(chan error)

 rc := &raftNode{
  proposeC:    proposeC,
  confChangeC: confChangeC,
  commitC:     commitC,
  errorC:      errorC,
  id:          id,
  peers:       peers,
  join:        join,
  waldir:      fmt.Sprintf("raftexample-%d", id),
  snapdir:     fmt.Sprintf("raftexample-%d-snap", id),
  getSnapshot: getSnapshot,
  snapCount:   defaultSnapshotCount,
  stopc:       make(chan struct{}),
  httpstopc:   make(chan struct{}),
  httpdonec:   make(chan struct{}),

  snapshotterReady: make(chan *snap.Snapshotter, 1),
  // rest of structure populated after WAL replay
 }
 go rc.startRaft()
 return commitC, errorC, rc.snapshotterReady
}

这个 raftNode 结构如下：

// A key-value stream backed by raft
type raftNode struct {
 proposeC    <-chan string            // proposed messages (k,v)
 confChangeC <-chan raftpb.ConfChange // proposed cluster config changes
 commitC     chan<- *string           // entries committed to log (k,v)
 errorC      chan<- error             // errors from raft session

 id          int      // client ID for raft session
 peers       []string // raft peer URLs
 join        bool     // node is joining an existing cluster
 waldir      string   // path to WAL directory
 snapdir     string   // path to snapshot directory
 getSnapshot func() ([]byte, error)
 lastIndex   uint64 // index of log at start

 confState     raftpb.ConfState
 snapshotIndex uint64
 appliedIndex  uint64

 // raft backing for the commit/error channel
 node        raft.Node
 raftStorage *raft.MemoryStorage
 wal         *wal.WAL

 snapshotter      *snap.Snapshotter
 snapshotterReady chan *snap.Snapshotter // signals when snapshotter is ready

 snapCount uint64
 transport *rafthttp.Transport
 stopc     chan struct{} // signals proposal channel closed
 httpstopc chan struct{} // signals http server to shutdown
 httpdonec chan struct{} // signals http server shutdown complete
}

比较重要的是 serveChannels:

func (rc *raftNode) serveChannels() {
 snap, err := rc.raftStorage.Snapshot()
 if err != nil {
  panic(err)
 }
 rc.confState = snap.Metadata.ConfState
 rc.snapshotIndex = snap.Metadata.Index
 rc.appliedIndex = snap.Metadata.Index

 defer rc.wal.Close()

 ticker := time.NewTicker(100 * time.Millisecond)
 defer ticker.Stop()

 // send proposals over raft
 go func() {
  confChangeCount := uint64(0)

  for rc.proposeC != nil && rc.confChangeC != nil {
   select {
   case prop, ok := <-rc.proposeC:
    if !ok {
     rc.proposeC = nil
    } else {
     // blocks until accepted by raft state machine
     rc.node.Propose(context.TODO(), []byte(prop))
    }

   case cc, ok := <-rc.confChangeC:
    if !ok {
     rc.confChangeC = nil
    } else {
     confChangeCount++
     cc.ID = confChangeCount
     rc.node.ProposeConfChange(context.TODO(), cc)
    }
   }
  }
  // client closed channel; shutdown raft if not already
  close(rc.stopc)
 }()

 // event loop on raft state machine updates
 for {
  select {
  case <-ticker.C:
   rc.node.Tick()

  // store raft entries to wal, then publish over commit channel
  case rd := <-rc.node.Ready():
   rc.wal.Save(rd.HardState, rd.Entries)
   if !raft.IsEmptySnap(rd.Snapshot) {
    rc.saveSnap(rd.Snapshot)
    rc.raftStorage.ApplySnapshot(rd.Snapshot)
    rc.publishSnapshot(rd.Snapshot)
   }
   rc.raftStorage.Append(rd.Entries)
   rc.transport.Send(rd.Messages)
   if ok := rc.publishEntries(rc.entriesToApply(rd.CommittedEntries)); !ok {
    rc.stop()
    return
   }
   rc.maybeTriggerSnapshot()
   rc.node.Advance()

  case err := <-rc.transport.ErrorC:
   rc.writeError(err)
   return

  case <-rc.stopc:
   rc.stop()
   return
  }
 }
}

这里把 proposeC 收到的请求交给 node 处理。然后主要线程处理 Ready, Ready 收到消息的时候，会依次：

先写 wal
再写 snap
试图写rd.Entries store
transport 的网络层 send
试图 apply committed 的消息
调用 advance

ready 对应的逻辑是：

pb.HardState: 包含当前节点见过的最大的term，以及在这个term给谁投过票，已经当前节点知道的commit index
Messages: 需要广播给所有peers的消息
CommittedEntries:已经commit了，还没有apply到状态机的日志
Snapshot:需要持久化的快照

应用需要对Ready的处理包括:

将HardState, Entries, Snapshot持久化到storage。
将Messages(上文提到的msgs)非阻塞的广播给其他peers
将CommittedEntries(已经commit还没有apply)应用到状态机。
如果发现CommittedEntries中有成员变更类型的entry，调用node的ApplyConfChange()方法让node知道(这里和raft论文不一样，论文中只要节点收到了成员变更日志就应用)
调用Node.Advance()告诉raft node，这批状态更新处理完了，状态已经演进了，可以给我下一批Ready让我处理。

实际上这里就是对这个状态机处理，但是感觉耦合的很严重，所以需要理解清楚这个代码的层次。

kvstore

kvstore 是具体 apply 到的地方，最初要的方法如下：

func (s *kvstore) readCommits(commitC <-chan *string, errorC <-chan error) {
 for data := range commitC {
  if data == nil {
   // done replaying log; new data incoming
   // OR signaled to load snapshot
   snapshot, err := s.snapshotter.Load()
   if err == snap.ErrNoSnapshot {
    return
   }
   if err != nil {
    log.Panic(err)
   }
   log.Printf("loading snapshot at term %d and index %d", snapshot.Metadata.Term, snapshot.Metadata.Index)
   if err := s.recoverFromSnapshot(snapshot.Data); err != nil {
    log.Panic(err)
   }
   continue
  }

  var dataKv kv
  dec := gob.NewDecoder(bytes.NewBufferString(*data))
  if err := dec.Decode(&dataKv); err != nil {
   log.Fatalf("raftexample: could not decode message (%v)", err)
  }
  s.mu.Lock()
  s.kvStore[dataKv.Key] = dataKv.Val
  s.mu.Unlock()
 }
 if err, ok := <-errorC; ok {
  log.Fatal(err)
 }
}

这相当于一个 apply 线程

转自：mwish
zhuanlan.zhihu.com/p/138563359

- EOF -