kafka2.2源码分析之follower副本truncateToEpochEndOffsets

概述

follower副本向leader副本发送OffsetForLeaderEpochRequest后，leader副本返回大于requestedLeaderEpoch的最小Epoch的startOffset。

follower副本会根据返回的leaderEpoch和offset进行日志截断。

首先，我将通过一个例子来说明follower副本向leader副本请求OffsetForLeaderEpoch的过程

A(leader, epoch=1): 1, 2, 3, 4, 5, 6

A cache: leaderEpoch = 1, startOffset = 1

B(follower): 1, 2, 3, 4

B cache: leaderEpoch = 1, startOffset = 1

=============================================

B(leader, epoch=2): 1, 2, 3, 4, 5, 6, 7

B cache:

leaderEpoch = 1, startOffset = 1

leaderEpoch = 2, startOffset = 5、

A挂掉后，B成为新leader，A又恢复过来，此时追加了新数据，B的leaderEpochCache增加了新条目（leaderEpoch=2, startOffset=5）。

当A请求复制B时，请求的epoch为1，B查询到epoch=2（比1大的最小epoch），然后返回对应的startOffset=5，A收到后truncate自己>=5的记录（这里是offset=5和6），然后把请求的offset更新为5，重新复制数据，B返回数据（offset=5, 6 和7，epoch=2），A追加记录时发现数据的epoch=2，新增条目（epoch=2, startOffset=5）到自己的leaderEpochCache。

下面将通过源码分析follower副本会根据返回的leaderEpoch和offset进行日志截断的过程。

AbstractFetcherThread#maybeTruncate方法

private def maybeTruncate(): Unit = {
    val (partitionsWithEpochs, partitionsWithoutEpochs) = fetchTruncatingPartitions()
    if (partitionsWithEpochs.nonEmpty) {
      truncateToEpochEndOffsets(partitionsWithEpochs)
    }
    if (partitionsWithoutEpochs.nonEmpty) {
      truncateToHighWatermark(partitionsWithoutEpochs)
    }
  }

AbstractFetcherThread#truncateToEpochEndOffsets方法

向leader副本发送OffsetsForLeaderEpochRequest，返回每个Partition对应的EpochEndOffset。

将每个partition的日志截断到指定的offset处。截断的整个过程需要加锁保护，以防leadership发生改变。

/**
    * - Build a leader epoch fetch based on partitions that are in the Truncating phase
    * - Send OffsetsForLeaderEpochRequest, retrieving the latest offset for each partition's
    *   leader epoch. This is the offset the follower should truncate to ensure
    *   accurate log replication.
    * - Finally truncate the logs for partitions in the truncating phase and mark them
    *   truncation complete. Do this within a lock to ensure no leadership changes can
    *   occur during truncation.
    */
  private def truncateToEpochEndOffsets(latestEpochsForPartitions: Map[TopicPartition, EpochData]): Unit = {
//向leader副本发送OffsetsForLeaderEpochRequest，返回Map[TopicPartition, EpochEndOffset]集合
    val endOffsets = fetchEpochEndOffsets(latestEpochsForPartitions)
    //Ensure we hold a lock during truncation.
    inLock(partitionMapLock) {
      //Check no leadership and no leader epoch changes happened whilst we were unlocked, fetching epochs
//过滤map[TopicPartition, EpochEndOffset]集合，校验在发送OffsetsForLeaderEpochRequest到返回的过程中，leader replica没有发生改变
//得到过滤后的map[TopicPartition, EpochEndOffset]集合
      val epochEndOffsets = endOffsets.filter { case (tp, _) =>
        val curPartitionState = partitionStates.stateValue(tp)
        val partitionEpochRequest = latestEpochsForPartitions.get(tp).getOrElse {
          throw new IllegalStateException(
            s"Leader replied with partition $tp not requested in OffsetsForLeaderEpoch request")
        }
        val leaderEpochInRequest = partitionEpochRequest.currentLeaderEpoch.get
        curPartitionState != null && leaderEpochInRequest == curPartitionState.currentLeaderEpoch
      }

      val ResultWithPartitions(fetchOffsets, partitionsWithError) = maybeTruncateToEpochEndOffsets(epochEndOffsets)
      handlePartitionsWithErrors(partitionsWithError)
//最后标记截断完成
      updateFetchOffsetAndMaybeMarkTruncationComplete(fetchOffsets)
    }
  }

AbstractFetcherThread#maybeTruncateToEpochEndOffsets方法

  private def maybeTruncateToEpochEndOffsets(fetchedEpochs: Map[TopicPartition, EpochEndOffset]): ResultWithPartitions[Map[TopicPartition, OffsetTruncationState]] = {
    val fetchOffsets = mutable.HashMap.empty[TopicPartition, OffsetTruncationState]
    val partitionsWithError = mutable.HashSet.empty[TopicPartition]

    fetchedEpochs.foreach { case (tp, leaderEpochOffset) =>
      try {
        leaderEpochOffset.error match {
          case Errors.NONE =>
            val offsetTruncationState = getOffsetTruncationState(tp, leaderEpochOffset)
            truncate(tp, offsetTruncationState)
            fetchOffsets.put(tp, offsetTruncationState)

          case Errors.FENCED_LEADER_EPOCH =>
            onPartitionFenced(tp)

          case error =>
            info(s"Retrying leaderEpoch request for partition $tp as the leader reported an error: $error")
            partitionsWithError += tp
        }
      } catch {
        case e: KafkaStorageException =>
          info(s"Failed to truncate $tp", e)
          partitionsWithError += tp
      }
    }

    ResultWithPartitions(fetchOffsets, partitionsWithError)
  }

ReplicaFetcherThread#truncate方法

/**
   * Truncate the log for each partition's epoch based on leader's returned epoch and offset.
   * The logic for finding the truncation offset is implemented in AbstractFetcherThread.getOffsetTruncationState
   */
  override def truncate(tp: TopicPartition, offsetTruncationState: OffsetTruncationState): Unit = {
    val replica = replicaMgr.localReplicaOrException(tp)
    val partition = replicaMgr.getPartition(tp).get
    partition.truncateTo(offsetTruncationState.offset, isFuture = false)

    if (offsetTruncationState.offset < replica.highWatermark.messageOffset)
      warn(s"Truncating $tp to offset ${offsetTruncationState.offset} below high watermark " +
        s"${replica.highWatermark.messageOffset}")

    // mark the future replica for truncation only when we do last truncation
    if (offsetTruncationState.truncationCompleted)
      replicaMgr.replicaAlterLogDirsManager.markPartitionsForTruncation(brokerConfig.brokerId, tp,
        offsetTruncationState.offset)
  }

Partition#truncateTo方法

基于给定的offset，截断该Partition的local log，并设置该offset为checkpoint。

  /**
    * Truncate the local log of this partition to the specified offset and checkpoint the recovery point to this offset
    *
    * @param offset offset to be used for truncation
    * @param isFuture True iff the truncation should be performed on the future log of this partition
    */
  def truncateTo(offset: Long, isFuture: Boolean) {
    // The read lock is needed to prevent the follower replica from being truncated while ReplicaAlterDirThread
    // is executing maybeDeleteAndSwapFutureReplica() to replace follower replica with the future replica.
    inReadLock(leaderIsrUpdateLock) {
      logManager.truncateTo(Map(topicPartition -> offset), isFuture = isFuture)
    }
  }

LogManager#truncateTo方法

基于给定的offset，截断该Partition的local log，并设置该offset为checkpoint。

 /**
   * Truncate the partition logs to the specified offsets and checkpoint the recovery point to this offset
   *
   * @param partitionOffsets Partition logs that need to be truncated
   * @param isFuture True iff the truncation should be performed on the future log of the specified partitions
   */
  def truncateTo(partitionOffsets: Map[TopicPartition, Long], isFuture: Boolean) {
    val affectedLogs = ArrayBuffer.empty[Log]
    for ((topicPartition, truncateOffset) <- partitionOffsets) {
      val log = {
        if (isFuture)
          futureLogs.get(topicPartition)
        else
          currentLogs.get(topicPartition)
      }
      // If the log does not exist, skip it
      if (log != null) {
        //May need to abort and pause the cleaning of the log, and resume after truncation is done.
        val needToStopCleaner = cleaner != null && truncateOffset < log.activeSegment.baseOffset
        if (needToStopCleaner && !isFuture)
          cleaner.abortAndPauseCleaning(topicPartition)
        try {
          if (log.truncateTo(truncateOffset))
            affectedLogs += log
          if (needToStopCleaner && !isFuture)
            cleaner.maybeTruncateCheckpoint(log.dir.getParentFile, topicPartition, log.activeSegment.baseOffset)
        } finally {
          if (needToStopCleaner && !isFuture) {
            cleaner.resumeCleaning(Seq(topicPartition))
            info(s"Compaction for partition $topicPartition is resumed")
          }
        }
      }
    }

    for ((dir, logs) <- affectedLogs.groupBy(_.dir.getParentFile)) {
      checkpointRecoveryOffsetsAndCleanSnapshot(dir, logs)
    }
  }

Log#truncateTo方法

基于targetOffset，对Log的activeSegment进行日志截断，使Log的最大offset小于targetOffset，并设置Log的LEO为targetOffset。删除掉LeaderEpochFileCache中startOffset大于targetOffset的所有EpochEntry。

 /**
   * Truncate this log so that it ends with the greatest offset < targetOffset.
   *
   * @param targetOffset The offset to truncate to, an upper bound on all offsets in the log after truncation is complete.
   * @return True iff targetOffset < logEndOffset
   */
  private[log] def truncateTo(targetOffset: Long): Boolean = {
    maybeHandleIOException(s"Error while truncating log to offset $targetOffset for $topicPartition in dir ${dir.getParent}") {
      if (targetOffset < 0)
        throw new IllegalArgumentException(s"Cannot truncate partition $topicPartition to a negative offset (%d).".format(targetOffset))
      if (targetOffset >= logEndOffset) {
        info(s"Truncating to $targetOffset has no effect as the largest offset in the log is ${logEndOffset - 1}")
        false
      } else {
//只有targetOffset小于LEO，才能进行日志截断
        info(s"Truncating to offset $targetOffset")
        lock synchronized {
          checkIfMemoryMappedBufferClosed()
          if (segments.firstEntry.getValue.baseOffset > targetOffset) {
            truncateFullyAndStartAt(targetOffset)
          } else {
            val deletable = logSegments.filter(segment => segment.baseOffset > targetOffset)
            deletable.foreach(deleteSegment)
//基于targetOffset，对Log的activeSegment进行日志截断
            activeSegment.truncateTo(targetOffset)
//设置Log的LEO为targetOffset
            updateLogEndOffset(targetOffset)
//设置Log的recoveryPoint为targetOffset（如果targetOffset更小）
            this.recoveryPoint = math.min(targetOffset, this.recoveryPoint)
//设置Log的logstartOffset为targetOffset（如果targetOffset更小）
            this.logStartOffset = math.min(targetOffset, this.logStartOffset)
//删除掉LeaderEpochFileCache中startOffset大于targetOffset的所有EpochEntry
            leaderEpochCache.foreach(_.truncateFromEnd(targetOffset))
            loadProducerState(targetOffset, reloadFromCleanShutdown = false)
          }
          true
        }
      }
    }
  }

LogSegment#truncateTo方法

将offset索引、timestamp索引、transaction索引截断到指定的offset处。

  /**
   * Truncate off all index and log entries with offsets >= the given offset.
   * If the given offset is larger than the largest message in this segment, do nothing.
   *
   * @param offset The offset to truncate to
   * @return The number of log bytes truncated
   */
  @nonthreadsafe
  def truncateTo(offset: Long): Int = {
    // Do offset translation before truncating the index to avoid needless scanning
    // in case we truncate the full index
    val mapping = translateOffset(offset)
//将offset索引截断到指定的offset
    offsetIndex.truncateTo(offset)
//将timestamp索引截断到指定的offset
    timeIndex.truncateTo(offset)
//将transaction索引截断到指定的offset
    txnIndex.truncateTo(offset)

    // After truncation, reset and allocate more space for the (new currently active) index
    offsetIndex.resize(offsetIndex.maxIndexSize)
    timeIndex.resize(timeIndex.maxIndexSize)

    val bytesTruncated = if (mapping == null) 0 else log.truncateTo(mapping.position)
    if (log.sizeInBytes == 0) {
      created = time.milliseconds
      rollingBasedTimestamp = None
    }

    bytesSinceLastIndexEntry = 0
    if (maxTimestampSoFar >= 0)
      loadLargestTimestamp()
    bytesTruncated
  }

ReplicaFetcherThread#fetchEpochEndOffsets方法

  override def fetchEpochEndOffsets(partitions: Map[TopicPartition, EpochData]): Map[TopicPartition, EpochEndOffset] = {

    if (partitions.isEmpty) {
      debug("Skipping leaderEpoch request since all partitions do not have an epoch")
      return Map.empty
    }
//创建OffsetsForLeaderEpochRequest
    val epochRequest = new OffsetsForLeaderEpochRequest.Builder(offsetForLeaderEpochRequestVersion, partitions.asJava)
    debug(s"Sending offset for leader epoch request $epochRequest")

    try {
//向leader副本发送OffsetsForLeaderEpochRequest
      val response = leaderEndpoint.sendRequest(epochRequest)
      val responseBody = response.responseBody.asInstanceOf[OffsetsForLeaderEpochResponse]
      debug(s"Received leaderEpoch response $response")
      responseBody.responses.asScala
    } catch {
      case t: Throwable =>
        warn(s"Error when sending leader epoch request for $partitions", t)

        // if we get any unexpected exception, mark all partitions with an error
        val error = Errors.forException(t)
        partitions.map { case (tp, _) =>
          tp -> new EpochEndOffset(error, UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET)
        }
    }
  }

OffsetForLeaderEpochRequest

    public OffsetsForLeaderEpochRequest(Struct struct, short version) {
        super(ApiKeys.OFFSET_FOR_LEADER_EPOCH, version);
        epochsByPartition = new HashMap<>();
        for (Object topicAndEpochsObj : struct.get(TOPICS)) {
            Struct topicAndEpochs = (Struct) topicAndEpochsObj;
            String topic = topicAndEpochs.get(TOPIC_NAME);
            for (Object partitionAndEpochObj : topicAndEpochs.get(PARTITIONS)) {
                Struct partitionAndEpoch = (Struct) partitionAndEpochObj;
                int partitionId = partitionAndEpoch.get(PARTITION_ID);
                int leaderEpoch = partitionAndEpoch.get(LEADER_EPOCH);
//获取当前的LeaderEpoch
                Optional<Integer> currentEpoch = RequestUtils.getLeaderEpoch(partitionAndEpoch, CURRENT_LEADER_EPOCH);
                TopicPartition tp = new TopicPartition(topic, partitionId);
                epochsByPartition.put(tp, new PartitionData(currentEpoch, leaderEpoch));
            }
        }
    }

原文链接：https://blog.csdn.net/qq_26222859/article/details/50983840

概述

你可能也喜欢