2020年7月31日 20:35:47go评论155阅读模式

英文:

How to resolve current committed offsets differing from current available offsets?

问题

我正尝试使用Spark Streaming从Kafka读取Avro数据，但我收到以下错误消息：

Streaming Query Exception caught!: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: [id = 8b54c92d-6bbc-4dbc-84d0-55b762c21ba2, runId = 4bc92b3c-343e-4886-b0bc-0777b89f9ec8]
Current Committed Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":17}}}
Current Available Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":20}}}
Current State: ACTIVE
Thread State: RUNNABLE

对于可能出现的问题和如何解决它，有什么想法吗？以下是代码（受 xebia-france spark-structured-streaming-blog 启发）：

import com.databricks.spark.avro.SchemaConverters
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.StreamingQueryException
object AvroConsumer {
  private val topic = "customer-avro4"
  private val kafkaUrl = "http://localhost:9092"
  private val schemaRegistryUrl = "http://localhost:8081"
  private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
  private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
  private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema
  private val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
  def main(args: Array[String]): Unit = {
    // ...（这里是Spark应用程序的主要部分，包括配置和数据流处理）
  }
  object DeserializerWrapper {
    val deserializer: AvroDeserializer = kafkaAvroDeserializer
  }
  class AvroDeserializer extends AbstractKafkaAvroDeserializer {
    def this(client: SchemaRegistryClient) {
      this()
      this.schemaRegistry = client
    }
    override def deserialize(bytes: Array[Byte]): String = {
      val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
      genericRecord.toString
    }
  }
}

如果您需要进一步的帮助，请提出具体问题。

英文:

I am attempting to read avro data from Kafka using Spark Streaming but I receive the following error message:

Streaming Query Exception caught!: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: [id = 8b54c92d-6bbc-4dbc-84d0-55b762c21ba2, runId = 4bc92b3c-343e-4886-b0bc-0777b89f9ec8]
Current Committed Offsets: {KafkaV2[Subscribe[customer-avro4]]: {&quot;customer-avro&quot;:{&quot;0&quot;:17}}}
Current Available Offsets: {KafkaV2[Subscribe[customer-avro4]]: {&quot;customer-avro&quot;:{&quot;0&quot;:20}}}
Current State: ACTIVE
Thread State: RUNNABLE

Any idea on what the issue might be and how to resolve it? Code is the following (inspired from xebia-france spark-structured-streaming-blog). Actually, I think it ran earlier already but now there is a problem.

import com.databricks.spark.avro.SchemaConverters
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.StreamingQueryException
object AvroConsumer {
  private val topic = &quot;customer-avro4&quot;
  private val kafkaUrl = &quot;http://localhost:9092&quot;
  private val schemaRegistryUrl = &quot;http://localhost:8081&quot;
  private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
  private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
  private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + &quot;-value&quot;).getSchema
  private val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName(&quot;ConfluentConsumer&quot;)
      .master(&quot;local[*]&quot;)
      .getOrCreate()
    spark.sparkContext.setLogLevel(&quot;ERROR&quot;)
    spark.udf.register(&quot;deserialize&quot;, (bytes: Array[Byte]) =&gt;
      DeserializerWrapper.deserializer.deserialize(bytes)
    )
    val kafkaDataFrame = spark
      .readStream
      .format(&quot;kafka&quot;)
      .option(&quot;kafka.bootstrap.servers&quot;, kafkaUrl)
      .option(&quot;subscribe&quot;, topic)
      .load()
    val valueDataFrame = kafkaDataFrame.selectExpr(&quot;&quot;&quot;deserialize(value) AS message&quot;&quot;&quot;)
    import org.apache.spark.sql.functions._
    val formattedDataFrame = valueDataFrame.select(
      from_json(col(&quot;message&quot;), sparkSchema.dataType).alias(&quot;parsed_value&quot;))
      .select(&quot;parsed_value.*&quot;)
    val writer = formattedDataFrame
      .writeStream
      .format(&quot;parquet&quot;)
      .option(&quot;checkpointLocation&quot;, &quot;hdfs://localhost:9000/data/spark/parquet/checkpoint&quot;)
    while (true) {
      val query = writer.start(&quot;hdfs://localhost:9000/data/spark/parquet/total&quot;)
      try {
        query.awaitTermination()
      }
      catch {
        case e: StreamingQueryException =&gt; println(&quot;Streaming Query Exception caught!: &quot; + e);
      }
    }
  }
  object DeserializerWrapper {
    val deserializer: AvroDeserializer = kafkaAvroDeserializer
  }
  class AvroDeserializer extends AbstractKafkaAvroDeserializer {
    def this(client: SchemaRegistryClient) {
      this()
      this.schemaRegistry = client
    }
    override def deserialize(bytes: Array[Byte]): String = {
      val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
      genericRecord.toString
    }
  }
}

答案1

得分: 0

弄清楚了 - 问题并不是我之前认为的与Spark-Kafka直接集成有关，而是与HDFS文件系统中的检查点信息有关。删除并重新创建HDFS中的检查点文件夹解决了我的问题。

英文:

Figured it out - the problem was not as I had thought with the Spark-Kafka integration directly, but with the checkpoint information inside the hdfs filesystem instead. Deleting and recreating the checkpoint folder in hdfs solved it for me.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何解决当前提交的偏移量与当前可用的偏移量不一致？

问题

答案1

在Netbeans上使用GlassFish和Derby设置Jakarta EE时遇到问题。

如何在没有安装JDK的不同计算机上运行jar文件？

石英调度程序下次执行时间等于当前时间加上调度间隔。

在运行 .class 文件和 .java 文件之间是否有实质性的区别？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。