英文:
Sending data to Azure Event Hub using Synapse Spark
问题
在使用 PySpark 在 Synapse Analytics Studio 上工作时,我能够读取 Event Hub 消息,但无法生产消息。在使用 save 方法时出现错误。
### %pip install azure-eventhub
import json
connectionString = "Endpoint=sb://::hidden::"
ehConf = { }
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
# 创建起始位置
startingEventPosition = {
"offset": -1,
"seqNo": -1, #不使用
"enqueuedTime": None, #不使用
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
df = spark.read.format("eventhubs").options(**ehConf).load()
display(df)
这段代码可以正常工作,我可以看到包含序列号、正文等完整消息。
现在,我想从数据框写入到事件中心。
df1 = spark.read.parquet(silver_path) # 通过 display(df1) 确认有数据
df1 \
.select(struct(*[c for c in df1.columns]).alias("body")) \
.write \
.format("eventhubs") \
.options(**ehConf) \
.save()
结果如下:
Py4JJavaError: An error occurred while calling o4089.save.
: java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException.<init>(Ljava/lang/String;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;)V
at org.apache.spark.sql.eventhubs.EventHubsWriter$.validateQuery(EventHubsWriter.scala:58)
...
提前致以问候。
英文:
Working on Synapse Analytics Studio using PySpark and while I was able to read the Event Hub messages, I am not able to produce messages. Getting an error when using the save method.
### %pip install azure-eventhub
import json
connectionString = "Endpoint=sb://::hidden::"
ehConf = { }
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
# Create the positions
startingEventPosition = {
"offset": -1,
"seqNo": -1, #not in use
"enqueuedTime": None, #not in use
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
df = spark.read.format("eventhubs").options(**ehConf).load()
display(df)
Works... I can see the messages complete with sequenceNumber, body, et al
Now, I want to write to the event hub from a data frame.
df1 = spark.read.parquet(silver_path) # confirmed to have data via display(df1)
df1 \
.select(struct(*[c for c in df1.columns]).alias("body")) \
.write \
.format("eventhubs") \
.options(**ehConf) \
.save()
Results in:
Py4JJavaError: An error occurred while calling o4089.save.
: java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException.<init>(Ljava/lang/String;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;)V
at org.apache.spark.sql.eventhubs.EventHubsWriter$.validateQuery(EventHubsWriter.scala:58)
at org.apache.spark.sql.eventhubs.EventHubsWriter$.write(EventHubsWriter.scala:70)
at org.apache.spark.sql.eventhubs.EventHubsSourceProvider.createRelation(EventHubsSourceProvider.scala:124)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:108)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:183)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:97)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:104)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:104)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:88)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:136)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:901)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Regards in advance.
答案1
得分: 0
I tried in my environment I got same error when writing dataframe to event hub.
要解决此错误,请尝试以下代码。
import pyspark.sql.functions as F
df1 \
.select(F.to_json(F.struct("*")).alias("body")) \
.write \
.format("eventhubs") \
.options(**ehConf) \
.save()
Here my df1
dataframe contains a sample of 5 rows.
在这里,我的 df1
数据帧包含了5行示例数据。
I have read the data from event hub, and you can see 5 new rows are ingested into event hub from dataframe df1
.
我已经从事件中心读取了数据,您可以看到从数据帧 df1
中摄取了5行新数据到事件中心。
英文:
I tried in my environment I got same error when writing dataframe to event hub.
To resolve this error, try the below code.
import pyspark.sql.functions as F
df1 \
.select(F.to_json(F.struct("*")).alias("body")) \
.write \
.format("eventhubs") \
.options(**ehConf) \
.save()
Here my df1
dataframe contains a sample of 5 rows.
I have read the data from event hub, and you can see 5 new rows are ingested into event hub from dataframe df1
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论