2023年2月19日 09:44:14go评论81阅读模式

英文:

How to generate Pyspark dynamic frame name dynamically

问题

我有一个表，其中的数据如图所示。我想创建动态生成的数据框名称来存储结果。

例如，在下面的示例中，我想创建两个不同的数据框名称 dnb_df 和 es_df，并将读取结果存储在这两个框架中，并打印每个数据框的结构。

当我运行下面的代码时出现错误

&gt; SyntaxError: can&#39;t assign to operator (TestGlue2.py, line 66)

英文:

I have a table which has data as shown in the diagram . I want to create store results in dynamically generated data frame names.

For eg here in the below example I want to create two different data frame name
dnb_df and es_df and store the read result in these two frames and print structure of each data frame

When I am running the below code getting the error

> SyntaxError: can't assign to operator (TestGlue2.py, line 66)


import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import regexp_replace, col


args = getResolvedOptions(sys.argv, [&#39;JOB_NAME&#39;])





sc = SparkContext()
#sc.setLogLevel(&#39;DEBUG&#39;)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#logger = glueContext.get_logger()
#logger.DEBUG(&#39;Hello Glue&#39;)
job = Job(glueContext)
job.init(args[&quot;JOB_NAME&quot;], args)



client = boto3.client(&#39;glue&#39;, region_name=&#39;XXXXXX&#39;)
response = client.get_connection(Name=&#39;XXXXXX&#39;)
connection_properties = response[&#39;Connection&#39;][&#39;ConnectionProperties&#39;]
URL = connection_properties[&#39;JDBC_CONNECTION_URL&#39;]
url_list = URL.split(&quot;/&quot;)
host = &quot;{}&quot;.format(url_list[-2][:-5])
new_host=host.split(&#39;@&#39;,1)[1]
port = url_list[-2][-4:]
database = &quot;{}&quot;.format(url_list[-1])
Oracle_Username = &quot;{}&quot;.format(connection_properties[&#39;USERNAME&#39;])
Oracle_Password = &quot;{}&quot;.format(connection_properties[&#39;PASSWORD&#39;])

#print(&quot;Oracle_Username:&quot;,Oracle_Username)
#print(&quot;Oracle_Password:&quot;,Oracle_Password)
print(&quot;Host:&quot;,host)
print(&quot;New Host:&quot;,new_host)
print(&quot;Port:&quot;,port)
print(&quot;Database:&quot;,database)
Oracle_jdbc_url=&quot;jdbc:oracle:thin:@//&quot;+new_host+&quot;:&quot;+port+&quot;/&quot;+database
print(&quot;Oracle_jdbc_url:&quot;,Oracle_jdbc_url)
source_df = spark.read.format(&quot;jdbc&quot;).option(&quot;url&quot;, Oracle_jdbc_url).option(&quot;dbtable&quot;, &quot;(select * from schema.table order by VENDOR_EXECUTION_ORDER) &quot;).option(&quot;user&quot;, Oracle_Username).option(&quot;password&quot;, Oracle_Password).load()
vendor_data=source_df.collect()
for row  in vendor_data :
    vendor_query=row.SRC_QUERY
   row.VENDOR_NAME+&#39;_df&#39;= spark.read.format(&quot;jdbc&quot;).option(&quot;url&quot;, 
               Oracle_jdbc_url).option(&quot;dbtable&quot;, vendor_query).option(&quot;user&quot;, 
            Oracle_Username).option(&quot;password&quot;, Oracle_Password).load()
    print(row.VENDOR_NAME+&#39;_df&#39;)

Added use case in picture

答案1

得分: 1

更新： 根据评论讨论，您的要求是进一步将所有内容与另一个数据框联接。

for row in vendor_data:
  rowAsDict=row.asDict()
  # 这里你可以使用任何变量，因为rowAsDict 不会在其他地方使用
  rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"] = spark.sql(rowAsDict["SOURCE_QUERY"])
  main_dataframe=main_dataframe.join(rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"], "acc_id")

输入 main_dataframe:

source_df：

View1 和 View2：

输出 main_dataframe：

如果我理解正确，您需要动态生成 VENDOR_NAME_DF。

你无法给 Row 对象赋值，也不能将 dataframe 赋值给 Row，因为你无法创建一个列类型为 Dataframe 的数据框。

不过，你可以使用 asDict 将行转换为字典并代替。

这样会起作用：

vendor_data=source_df.collect()

for row in vendor_data:
  rowAsDict=row.asDict()
  # 用 spark.read() 或其他方法替换这里以创建一个数据框
  rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"] = spark.sql(rowAsDict["SOURCE_QUERY"]) 
  rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"].show()

输入 Source_DF：

SOURCE_QUERY 的结果：

输出 (rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"].show()):

最终的 rowAsDict:

{'VENDOR_NAME': 'Name1', 'SOURCE_QUERY': 'select * from view1', 'Name1_df': DataFrame[id: string, date: string, Code: string]}

英文:

Update: As discussed in the comments, your requirement is to further join all with another dataframe

for row in vendor_data:
  rowAsDict=row.asDict()
  # Here you can use any variable as rowAsDict is not going to be used anywhere else anyway 
  rowAsDict[rowAsDict[&quot;VENDOR_NAME&quot;]+&quot;_df&quot;] = spark.sql(rowAsDict[&quot;SOURCE_QUERY&quot;])
  main_dataframe=main_dataframe.join(rowAsDict[rowAsDict[&quot;VENDOR_NAME&quot;]+&quot;_df&quot;], &quot;acc_id&quot;)

Input main_dataframe:

source_df :

View1 and View2:

Output main_dataframe

If I understood correctly, you need to generate the VENDOR_NAME_DF dynamically.

You won't be able to assign to the Row Object, neither it'll be useful to assign dataframe to a Row as you can't create a Dataframe with a column of type Dataframe.

Though, you can convert a row to a dict using asDict and use that instead.

This would work:

vendor_data=source_df.collect()

for row in vendor_data:
  rowAsDict=row.asDict()
  # Replace this with spark.read() or any way to create a Dataframe
  rowAsDict[rowAsDict[&quot;VENDOR_NAME&quot;]+&quot;_df&quot;] = spark.sql(rowAsDict[&quot;SOURCE_QUERY&quot;]) 
  rowAsDict[rowAsDict[&quot;VENDOR_NAME&quot;]+&quot;_df&quot;].show()

Input Source_DF:

Result of SOURCE_QUERY:

Output (of rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"].show()):

Final rowAsDict:

{&#39;VENDOR_NAME&#39;: &#39;Name1&#39;, &#39;SOURCE_QUERY&#39;: &#39;select * from view1&#39;, &#39;Name1_df&#39;: DataFrame[id: string, date: string, Code: string]}

答案2

得分: 1

将以下两行添加到你的for循环中，你应该能够获得结果。
第一行是使用动态的df名称创建临时表。
第二行是显示该临时表中的数据。

for row in vendor_data:
    vendor_query = row.SRC_QUERY
    spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("dbtable", vendor_query).option("user", Oracle_Username).option("password", Oracle_Password).load().createOrReplaceTempView(row.VENDOR_NAME + '_df')
    spark.sql("select * from " + row.VENDOR_NAME + "_df").show()

英文:

Add the last two lines in your for loop, you should be able to get the results.
First one is creating a temp table using the dynamic df name
Second is to show the data in that temp table.

for row  in vendor_data :
    vendor_query=row.SRC_QUERY
    spark.read.format(&quot;jdbc&quot;).option(&quot;url&quot;, 
               Oracle_jdbc_url).option(&quot;dbtable&quot;, vendor_query).option(&quot;user&quot;, 
            Oracle_Username).option(&quot;password&quot;, Oracle_Password).load().createOrReplaceTempView(row.VENDOR_NAME+&#39;_df&#39;)   
    spark.sql(&quot;select * from &quot;+row.VENDOR_NAME+&quot;_df&quot;).show()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to generate Pyspark dynamic frame name dynamically

问题

答案1

答案2

Python – Math Operatives (Class, Functions) 如何创建一个带有函数并结合数学的类

Python的if语句产生了误报的结果

Handling Large Datasets Efficiently in Python: Pandas vs. Dask

统计句子中的单词，考虑否定词。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论