如何解决当前提交的偏移量与当前可用的偏移量不一致?

huangapple go评论155阅读模式
英文:

How to resolve current committed offsets differing from current available offsets?

问题

我正尝试使用Spark Streaming从Kafka读取Avro数据,但我收到以下错误消息:

  1. Streaming Query Exception caught!: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
  2. === Streaming Query ===
  3. Identifier: [id = 8b54c92d-6bbc-4dbc-84d0-55b762c21ba2, runId = 4bc92b3c-343e-4886-b0bc-0777b89f9ec8]
  4. Current Committed Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":17}}}
  5. Current Available Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":20}}}
  6. Current State: ACTIVE
  7. Thread State: RUNNABLE

对于可能出现的问题和如何解决它,有什么想法吗?以下是代码(受 xebia-france spark-structured-streaming-blog 启发):

  1. import com.databricks.spark.avro.SchemaConverters
  2. import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
  3. import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
  4. import org.apache.avro.Schema
  5. import org.apache.avro.generic.GenericRecord
  6. import org.apache.spark.sql.SparkSession
  7. import org.apache.spark.sql.streaming.StreamingQueryException
  8. object AvroConsumer {
  9. private val topic = "customer-avro4"
  10. private val kafkaUrl = "http://localhost:9092"
  11. private val schemaRegistryUrl = "http://localhost:8081"
  12. private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
  13. private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
  14. private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema
  15. private val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
  16. def main(args: Array[String]): Unit = {
  17. // ...(这里是Spark应用程序的主要部分,包括配置和数据流处理)
  18. }
  19. object DeserializerWrapper {
  20. val deserializer: AvroDeserializer = kafkaAvroDeserializer
  21. }
  22. class AvroDeserializer extends AbstractKafkaAvroDeserializer {
  23. def this(client: SchemaRegistryClient) {
  24. this()
  25. this.schemaRegistry = client
  26. }
  27. override def deserialize(bytes: Array[Byte]): String = {
  28. val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
  29. genericRecord.toString
  30. }
  31. }
  32. }

如果您需要进一步的帮助,请提出具体问题。

英文:

I am attempting to read avro data from Kafka using Spark Streaming but I receive the following error message:

  1. Streaming Query Exception caught!: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
  2. === Streaming Query ===
  3. Identifier: [id = 8b54c92d-6bbc-4dbc-84d0-55b762c21ba2, runId = 4bc92b3c-343e-4886-b0bc-0777b89f9ec8]
  4. Current Committed Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":17}}}
  5. Current Available Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":20}}}
  6. Current State: ACTIVE
  7. Thread State: RUNNABLE

Any idea on what the issue might be and how to resolve it? Code is the following (inspired from xebia-france spark-structured-streaming-blog). Actually, I think it ran earlier already but now there is a problem.

  1. import com.databricks.spark.avro.SchemaConverters
  2. import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
  3. import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
  4. import org.apache.avro.Schema
  5. import org.apache.avro.generic.GenericRecord
  6. import org.apache.spark.sql.SparkSession
  7. import org.apache.spark.sql.streaming.StreamingQueryException
  8. object AvroConsumer {
  9. private val topic = "customer-avro4"
  10. private val kafkaUrl = "http://localhost:9092"
  11. private val schemaRegistryUrl = "http://localhost:8081"
  12. private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
  13. private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
  14. private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema
  15. private val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
  16. def main(args: Array[String]): Unit = {
  17. val spark = SparkSession
  18. .builder
  19. .appName("ConfluentConsumer")
  20. .master("local[*]")
  21. .getOrCreate()
  22. spark.sparkContext.setLogLevel("ERROR")
  23. spark.udf.register("deserialize", (bytes: Array[Byte]) =>
  24. DeserializerWrapper.deserializer.deserialize(bytes)
  25. )
  26. val kafkaDataFrame = spark
  27. .readStream
  28. .format("kafka")
  29. .option("kafka.bootstrap.servers", kafkaUrl)
  30. .option("subscribe", topic)
  31. .load()
  32. val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
  33. import org.apache.spark.sql.functions._
  34. val formattedDataFrame = valueDataFrame.select(
  35. from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
  36. .select("parsed_value.*")
  37. val writer = formattedDataFrame
  38. .writeStream
  39. .format("parquet")
  40. .option("checkpointLocation", "hdfs://localhost:9000/data/spark/parquet/checkpoint")
  41. while (true) {
  42. val query = writer.start("hdfs://localhost:9000/data/spark/parquet/total")
  43. try {
  44. query.awaitTermination()
  45. }
  46. catch {
  47. case e: StreamingQueryException => println("Streaming Query Exception caught!: " + e);
  48. }
  49. }
  50. }
  51. object DeserializerWrapper {
  52. val deserializer: AvroDeserializer = kafkaAvroDeserializer
  53. }
  54. class AvroDeserializer extends AbstractKafkaAvroDeserializer {
  55. def this(client: SchemaRegistryClient) {
  56. this()
  57. this.schemaRegistry = client
  58. }
  59. override def deserialize(bytes: Array[Byte]): String = {
  60. val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
  61. genericRecord.toString
  62. }
  63. }
  64. }

答案1

得分: 0

弄清楚了 - 问题并不是我之前认为的与Spark-Kafka直接集成有关,而是与HDFS文件系统中的检查点信息有关。删除并重新创建HDFS中的检查点文件夹解决了我的问题。

英文:

Figured it out - the problem was not as I had thought with the Spark-Kafka integration directly, but with the checkpoint information inside the hdfs filesystem instead. Deleting and recreating the checkpoint folder in hdfs solved it for me.

huangapple
  • 本文由 发表于 2020年7月31日 20:35:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/63191950.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定