英文:
Finding total time it takes for dataframe write in ADLS path?
问题
我在循环中编写了100+个数据框。如何记录单个数据框写入ADLS路径的总持续时间?
我想将这些信息存储在一个表中,以便我可以查看哪个数据框需要优化。
写入TSV到Datalake路径的示例代码:
dataFrame
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.option("sep", colDelim)
.option("quoteAll", true)
.option("escape", "\"")
.mode("overwrite")
.save(filePath + fileName)
英文:
I write 100+ dataframes in a loop. How do I log the total duration a single dataframe took to write a CSV in ADLS path?
I would like to store this information in a table where I can check which dataframe needs an optimization.
Sample code to write a TSV to Datalake path:
dataFrame
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.option("sep", colDelim)
.option("quoteAll", true)
.option("escape", "\"")
.mode("overwrite")
.save(filePath + fileName)
答案1
得分: 0
以下是您可以使用的pyspark代码来获取每个数据框写入操作的持续时间:
这里,您可以使用下面的pyspark代码来获取每个数据框写入操作的持续时间。
我假设您有一个数据框列表。
```python
import time
log = []
for i, df in enumerate(dataframe_list):
start_time = time.time()
df.repartition(1).write.format("csv") \
.option("header", "true") \
.option("sep", " ") \
.option("quoteAll", "true") \
.option("escape", "\\") \
.mode("overwrite") \
.save("/mnt/csv/dataframe_" + str(i))
end_time = time.time()
duration = end_time - start_time
each_df_time = {
"DF_name": "dataframe" + str(i),
"time_taken": duration
}
log.append(each_df_time)
log_df = spark.createDataFrame(log, schema="DF_name STRING, time_taken DOUBLE")
display(log_df)
输出:
Output:
![在此处输入图像描述](https://i.imgur.com/tQvlfzK.png)
<details>
<summary>英文:</summary>
Here, you can use below pyspark code for getting duration of each dataframe write operation.
I am assuming you are having list of dataframes.
import time
log = []
for i,df in enumerate(dataframe_list):
start_time = time.time()
df.repartition(1).write.format("csv") \
.option("header", "true") \
.option("sep", " ") \
.option("quoteAll", "true") \
.option("escape", "\"") \
.mode("overwrite") \
.save("/mnt/csv/dataframe_"+str(i))
end_time = time.time()
duration = end_time - start_time
each_df_time = {
"DF_name":"dataframe"+str(i),
"time_taken":duration
}
log.append(each_df_time)
log_df = spark.createDataFrame(log,schema="DF_name STRING, time_taken DOUBLE")
display(log_df)
Output:
![enter image description here](https://i.imgur.com/tQvlfzK.png)
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论