2023年7月17日 17:51:57go评论81阅读模式

英文:

Spark streaming forEachBatch giving inconsistent/unordered result while writing to database

问题

问题：
在单个流中接收到多个表/模式数据。
现在，在对数据进行分离后，我为每个表打开了一个并行写入流。

我在forEachBatch中使用的函数是：

def writeToAurora(df, batch_id, tableName):
    df = df.persist()
    stagingTable = f'{str(tableName.lower())}_delta'
    
    df.write \
            .mode("overwrite") \
            .format("jdbc") \
            .option("truncate", "true") \
            .option("driver", DB_conf['DRIVER']) \
            .option("batchsize", 1000) \
            .option("url", DB_conf['URL']) \
            .option("dbtable", stagingTable) \
            .option("user", DB_conf['USER_ID']) \
            .option("password", DB_conf['PASSWORD']) \
            .save() 
    df.unpersist()

用于打开多个写入流的逻辑是：

data_df = spark.readStream.format("kinesis") \
    .option("streamName", stream_name) \
    .option("startingPosition", initial_position) \
    .load()

# 区分每个表的数据框
distinctTables = ['Table1', 'Table2', 'Table3']
tablesDF = {table: data_df.filter(f"TableName = '{table}'") for table in distinctTables}

# 处理每个表
for table, tableDF in tablesDF.items():
    df = tableDF.withColumn('csvData', F.from_csv('finalData', schema=tableSchema[table], options={'sep': '|','quote': '"'}))\
        .select('csvData.*')

    vars()[table+'_query'] = df.writeStream\
                .trigger(processingTime='120 seconds') \
                .foreachBatch(lambda fdf, batch_id: writeToAurora(fdf, batch_id, table)) \
                .option("checkpointLocation", f"s3://{bucket}/temporary/checkpoint/{table}")\
                .start()

for table in tablesDF.keys():
    eval(table+'_query').awaitTermination()

问题：
现在，在运行上述代码时，有时Table1的数据被加载到Table2中，而且每次代码运行时顺序都不同。
在数据框和应该加载到其中的表之间没有保持顺序。

需要帮助理解为什么会发生这种情况。

英文:

Problem:
I am receiving multiple table/schema data in a single stream.
Now after segregating the data I am opening a parallel write stream for each table.

The function I used in forEachBatch is:

def writeToAurora(df, batch_id, tableName):
    df = df.persist()
    stagingTable = f&#39;{str(tableName.lower())}_delta&#39;
    
    df.write \
            .mode(&quot;overwrite&quot;) \
            .format(&quot;jdbc&quot;) \
            .option(&quot;truncate&quot;, &quot;true&quot;) \
            .option(&quot;driver&quot;, DB_conf[&#39;DRIVER&#39;]) \
            .option(&quot;batchsize&quot;, 1000) \
            .option(&quot;url&quot;, DB_conf[&#39;URL&#39;]) \
            .option(&quot;dbtable&quot;, stagingTable) \
            .option(&quot;user&quot;, DB_conf[&#39;USER_ID&#39;]) \
            .option(&quot;password&quot;, DB_conf[&#39;PASSWORD&#39;]) \
            .save() 
	df.unpersist()

The logic to open multiple writestreams is

data_df = spark.readStream.format(&quot;kinesis&quot;) \
	.option(&quot;streamName&quot;, stream_name) \
	.option(&quot;startingPosition&quot;, initial_position) \
	.load()

				 
#Distinguishing table wise df
distinctTables = [&#39;Table1&#39;, &#39;Table2&#39;, &#39;Table3&#39;]
tablesDF = {table: data_df.filter(f&quot;TableName = &#39;{table}&#39;&quot;) for table in distinctTables}


#Processing Each Table
for table, tableDF in tablesDF.items():
	df = tableDF.withColumn(&#39;csvData&#39;, F.from_csv(&#39;finalData&#39;, schema=tableSchema[table], options={&#39;sep&#39;: &#39;|&#39;,&#39;quote&#39;: &#39;&quot;&#39;}))\
		.select(&#39;csvData.*&#39;)


	vars()[table+&#39;_query&#39;] = df.writeStream\
				.trigger(processingTime=&#39;120 seconds&#39;) \
				.foreachBatch(lambda fdf, batch_id: writeToAurora(fdf, batch_id, table)) \
				.option(&quot;checkpointLocation&quot;, f&quot;s3://{bucket}/temporary/checkpoint/{table}&quot;)\
				.start()
				
				
for table in tablesDF.keys():
	eval(table+&#39;_query&#39;).awaitTermination()

Issue:
Now when running the above code sometimes the table1 is getting loaded in table2 and the order is differenet each time the code runs.
The order is not maintained between the dataframe and the table in which it should be loaded.

Need help on understanding why this is happening.

答案1

得分: 3

这是由于在foreachBatch方法中，你的lambda函数使用了late binding导致的。

以下是一个示例。这将尝试将所有表写入"t2"，但会失败（实际上只写入了"t2"表，但写入了"t0"数据）：

from pyspark.sql.functions import *
from pyspark.sql import *

def writeToTable(df, epochId, table_name):
  df.write.mode("overwrite").saveAsTable(f"custanwo.dsci.stream_test_{table_name}")

data_df = spark.readStream.format("rate").load()
data_df = (data_df
           .selectExpr("value % 10 as key")
           .groupBy("key")
           .count()
           .withColumn("t", concat(lit("t"), (col("key") % 3).astype("string")))
)

table_names = ["t0", "t1", "t2"]
table_df = {t: data_df.filter(f"t = '{t}'") for t in table_names}

for t, df in table_df.items():
    vars()[f"{t}_query"] = (df
                            .writeStream
                            .foreachBatch(lambda df, epochId: writeToTable(df, epochId, t))
                            .outputMode("update")
                            .start()
                            )

为了解决这个问题，有几种选项之一是使用partial：

from functools import partial

def writeToTable(df, epochId, table_name):
  df.write.mode("overwrite").saveAsTable(f"custanwo.dsci.stream_test_{table_name}")

data_df = spark.readStream.format("rate").load()
data_df = (data_df
           .selectExpr("value % 10 as key")
           .groupBy("key")
           .count()
           .withColumn("t", concat(lit("t"), (col("key") % 3).astype("string")))
) 

table_names = ["t0", "t1", "t2"]
table_df = {t: data_df.filter(f"t = '{t}'") for t in table_names}

for t, df in table_df.items():
    vars()[f"{t}_query"] = (df
                            .writeStream
                            .foreachBatch(partial(writeToTable, table_name=t))
                            .outputMode("update")
                            .start()
                            )

在你的代码中，重写你的writeStream如下：

    vars()[table+'_query'] = df.writeStream\
                .trigger(processingTime='120 seconds') \
                .foreachBatch(partial(writeToAurora, tableName = table)) \
                .option("checkpointLocation", f"s3://{bucket}/temporary/checkpoint/{table}")\
                .start()

英文:

This is caused by late binding for your lambda function in the foreachBatch method.

Here's an example. This will try and write all tables to "t2", and fails (actually only writing the "t2" table, but writing the "t0" data:

from pyspark.sql.functions import *
from pyspark.sql import *

def writeToTable(df, epochId, table_name):
  df.write.mode(&quot;overwrite&quot;).saveAsTable(f&quot;custanwo.dsci.stream_test_{table_name}&quot;)

data_df = spark.readStream.format(&quot;rate&quot;).load()
data_df = (data_df
           .selectExpr(&quot;value % 10 as key&quot;)
           .groupBy(&quot;key&quot;)
           .count()
           .withColumn(&quot;t&quot;, concat(lit(&quot;t&quot;),(col(&quot;key&quot;)%3).astype(&quot;string&quot;)))
)

table_names = [&quot;t0&quot;, &quot;t1&quot;, &quot;t2&quot;]
table_df = {t: data_df.filter(f&quot;t = &#39;{t}&#39;&quot;) for t in table_names}

for t, df in table_df.items():
    vars()[f&quot;{t}_query&quot;] = (df
                            .writeStream
                            .foreachBatch(lambda df, epochId: writeToTable(df, epochId, t))
                            .outputMode(&quot;update&quot;)
                            .start()
                            )

To resolve this there are a few options. One is using partial:

from functools import partial

def writeToTable(df, epochId, table_name):
  df.write.mode(&quot;overwrite&quot;).saveAsTable(f&quot;custanwo.dsci.stream_test_{table_name}&quot;)

data_df = spark.readStream.format(&quot;rate&quot;).load()
data_df = (data_df
           .selectExpr(&quot;value % 10 as key&quot;)
           .groupBy(&quot;key&quot;)
           .count()
           .withColumn(&quot;t&quot;, concat(lit(&quot;t&quot;),(col(&quot;key&quot;)%3).astype(&quot;string&quot;)))
) 

table_names = [&quot;t0&quot;, &quot;t1&quot;, &quot;t2&quot;]
table_df = {t: data_df.filter(f&quot;t = &#39;{t}&#39;&quot;) for t in table_names}

for t, df in table_df.items():
    vars()[f&quot;{t}_query&quot;] = (df
                            .writeStream
                            .foreachBatch(partial(writeToTable, table_name=t))
                            .outputMode(&quot;update&quot;)
                            .start()
                            )

In your code, rewrite your writeStream to:

    vars()[table+&#39;_query&#39;] = df.writeStream\
                .trigger(processingTime=&#39;120 seconds&#39;) \
                .foreachBatch(partial(writeToAurora, tableName = table)) \
                .option(&quot;checkpointLocation&quot;, f&quot;s3://{bucket}/temporary/checkpoint/{table}&quot;)\
                .start()

答案2

得分: -1

def writeToAurora(df, batch_id, tableName):
    df = df.withColumn("TableName", F.lit(tableName))  # Add the TableName column to the DataFrame
    df = df.persist()
    stagingTable = f'{str(tableName.lower())}_delta'
    
    df.write \
        .mode("overwrite") \
        .format("jdbc") \
        .option("truncate", "true") \
        .option("driver", DB_conf['DRIVER']) \
        .option("batchsize", 1000) \
        .option("url", DB_conf['URL']) \
        .option("dbtable", stagingTable) \
        .option("user", DB_conf['USER_ID']) \
        .option("password", DB_conf['PASSWORD']) \
        .save() 
    df.unpersist()

英文:

def writeToAurora(df, batch_id, tableName):
    df = df.withColumn(&quot;TableName&quot;, F.lit(tableName))  # Add the TableName column to the DataFrame
    df = df.persist()
    stagingTable = f&#39;{str(tableName.lower())}_delta&#39;
    
    df.write \
        .mode(&quot;overwrite&quot;) \
        .format(&quot;jdbc&quot;) \
        .option(&quot;truncate&quot;, &quot;true&quot;) \
        .option(&quot;driver&quot;, DB_conf[&#39;DRIVER&#39;]) \
        .option(&quot;batchsize&quot;, 1000) \
        .option(&quot;url&quot;, DB_conf[&#39;URL&#39;]) \
        .option(&quot;dbtable&quot;, stagingTable) \
        .option(&quot;user&quot;, DB_conf[&#39;USER_ID&#39;]) \
        .option(&quot;password&quot;, DB_conf[&#39;PASSWORD&#39;]) \
        .save() 
    df.unpersist()

With this change, the DataFrame df passed to the writeToAurora function will now have an additional column named "TableName" containing the name of the table for which the data belongs. The writeToAurora function will then use this information to write the data to the appropriate staging table in Aurora.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark Streaming 在写入数据库时，forEachBatch 提供的结果不一致/无序。

问题

答案1

答案2

“Glue自定义可视脚本无限运行”

如何在pyspark中根据另一列将列转换为列表

为什么 Spark 模式的 .simpleString() 方法会截断我的输出？

如何使用Spark 3.0.0从/向S3读取和写入数据？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论