使用 Synapse Spark 将数据发送到 Azure Event Hub

huangapple go评论76阅读模式
英文:

Sending data to Azure Event Hub using Synapse Spark

问题

在使用 PySpark 在 Synapse Analytics Studio 上工作时,我能够读取 Event Hub 消息,但无法生产消息。在使用 save 方法时出现错误。

### %pip install azure-eventhub
import json
connectionString = "Endpoint=sb://::hidden::"
ehConf = { }
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)

# 创建起始位置
startingEventPosition = {
  "offset": -1,  
  "seqNo": -1,            #不使用
  "enqueuedTime": None,   #不使用
  "isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
df = spark.read.format("eventhubs").options(**ehConf).load()
display(df)

这段代码可以正常工作,我可以看到包含序列号、正文等完整消息。

现在,我想从数据框写入到事件中心。

df1 = spark.read.parquet(silver_path) # 通过 display(df1) 确认有数据
df1 \
  .select(struct(*[c for c in df1.columns]).alias("body")) \
  .write \
  .format("eventhubs") \
  .options(**ehConf) \
  .save()

结果如下:

Py4JJavaError: An error occurred while calling o4089.save.
: java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException.<init>(Ljava/lang/String;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;)V
	at org.apache.spark.sql.eventhubs.EventHubsWriter$.validateQuery(EventHubsWriter.scala:58)
    ...

提前致以问候。

英文:

Working on Synapse Analytics Studio using PySpark and while I was able to read the Event Hub messages, I am not able to produce messages. Getting an error when using the save method.

### %pip install azure-eventhub
import json
connectionString = "Endpoint=sb://::hidden::"
ehConf = { }
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)

# Create the positions
startingEventPosition = {
  "offset": -1,  
  "seqNo": -1,            #not in use
  "enqueuedTime": None,   #not in use
  "isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
df = spark.read.format("eventhubs").options(**ehConf).load()
display(df)

Works... I can see the messages complete with sequenceNumber, body, et al

Now, I want to write to the event hub from a data frame.

df1 = spark.read.parquet(silver_path) # confirmed to have data via display(df1)
df1 \
  .select(struct(*[c for c in df1.columns]).alias("body")) \
  .write \
  .format("eventhubs") \
  .options(**ehConf) \
  .save()

Results in:

Py4JJavaError: An error occurred while calling o4089.save.
: java.lang.NoSuchMethodError: org.apache.spark.sql.AnalysisException.<init>(Ljava/lang/String;Lscala/Option;Lscala/Option;Lscala/Option;Lscala/Option;)V
	at org.apache.spark.sql.eventhubs.EventHubsWriter$.validateQuery(EventHubsWriter.scala:58)
	at org.apache.spark.sql.eventhubs.EventHubsWriter$.write(EventHubsWriter.scala:70)
	at org.apache.spark.sql.eventhubs.EventHubsSourceProvider.createRelation(EventHubsSourceProvider.scala:124)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:108)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:183)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:97)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:104)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:104)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:88)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:136)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:901)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:382)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)

Regards in advance.

答案1

得分: 0

I tried in my environment I got same error when writing dataframe to event hub.

要解决此错误,请尝试以下代码。

import pyspark.sql.functions as F
df1 \
  .select(F.to_json(F.struct("*")).alias("body")) \
  .write \
  .format("eventhubs") \
  .options(**ehConf) \
  .save()

Here my df1 dataframe contains a sample of 5 rows.

在这里,我的 df1 数据帧包含了5行示例数据。

I have read the data from event hub, and you can see 5 new rows are ingested into event hub from dataframe df1.

我已经从事件中心读取了数据,您可以看到从数据帧 df1 中摄取了5行新数据到事件中心。

英文:

I tried in my environment I got same error when writing dataframe to event hub.

使用 Synapse Spark 将数据发送到 Azure Event Hub

To resolve this error, try the below code.

import pyspark.sql.functions as F
df1 \
  .select(F.to_json(F.struct("*")).alias("body")) \
  .write \
  .format("eventhubs") \
  .options(**ehConf) \
  .save()

Here my df1 dataframe contains a sample of 5 rows.

使用 Synapse Spark 将数据发送到 Azure Event Hub

I have read the data from event hub, and you can see 5 new rows are ingested into event hub from dataframe df1.

使用 Synapse Spark 将数据发送到 Azure Event Hub

huangapple
  • 本文由 发表于 2023年6月8日 23:56:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76433688.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定