查找数据框写入 ADLS 路径所需的总时间。

huangapple go评论48阅读模式
英文:

Finding total time it takes for dataframe write in ADLS path?

问题

我在循环中编写了100+个数据框。如何记录单个数据框写入ADLS路径的总持续时间?
我想将这些信息存储在一个表中,以便我可以查看哪个数据框需要优化。

写入TSV到Datalake路径的示例代码:

dataFrame
      .repartition(1)
      .write
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("sep", colDelim)
      .option("quoteAll", true)
      .option("escape", "\"")
      .mode("overwrite")
      .save(filePath + fileName)
英文:

I write 100+ dataframes in a loop. How do I log the total duration a single dataframe took to write a CSV in ADLS path?
I would like to store this information in a table where I can check which dataframe needs an optimization.

Sample code to write a TSV to Datalake path:

dataFrame
      .repartition(1)
      .write
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("sep", colDelim)
      .option("quoteAll", true)
      .option("escape", "\"")
      .mode("overwrite")
      .save(filePath + fileName)

答案1

得分: 0

以下是您可以使用的pyspark代码来获取每个数据框写入操作的持续时间:

这里您可以使用下面的pyspark代码来获取每个数据框写入操作的持续时间

我假设您有一个数据框列表

```python
import time

log = []

for i, df in enumerate(dataframe_list):
    start_time = time.time()

    df.repartition(1).write.format("csv") \
        .option("header", "true") \
        .option("sep", " ") \
        .option("quoteAll", "true") \
        .option("escape", "\\") \
        .mode("overwrite") \
        .save("/mnt/csv/dataframe_" + str(i))

    end_time = time.time()
    duration = end_time - start_time

    each_df_time = {
        "DF_name": "dataframe" + str(i),
        "time_taken": duration
    }
    log.append(each_df_time)

log_df = spark.createDataFrame(log, schema="DF_name STRING, time_taken DOUBLE")
display(log_df)

输出:

查找数据框写入 ADLS 路径所需的总时间。


Output:

![在此处输入图像描述](https://i.imgur.com/tQvlfzK.png)

<details>
<summary>英文:</summary>

Here, you can use below pyspark code for getting duration of each dataframe write operation.

I am assuming you are having list of dataframes.

    import time
    
    log = []
    
    for i,df in  enumerate(dataframe_list):
	    start_time = time.time()
	    
	    df.repartition(1).write.format(&quot;csv&quot;) \
	    .option(&quot;header&quot;, &quot;true&quot;) \
	    .option(&quot;sep&quot;, &quot; &quot;) \
	    .option(&quot;quoteAll&quot;, &quot;true&quot;) \
	    .option(&quot;escape&quot;, &quot;\&quot;&quot;) \
	    .mode(&quot;overwrite&quot;) \
	    .save(&quot;/mnt/csv/dataframe_&quot;+str(i))
	    
	    end_time = time.time()
	    duration = end_time - start_time
	    
	    each_df_time = {
	    &quot;DF_name&quot;:&quot;dataframe&quot;+str(i),
	    &quot;time_taken&quot;:duration
	    }
	    log.append(each_df_time)
	    
	log_df = spark.createDataFrame(log,schema=&quot;DF_name STRING, time_taken DOUBLE&quot;)
	display(log_df)


Output:

![enter image description here](https://i.imgur.com/tQvlfzK.png)

</details>



huangapple
  • 本文由 发表于 2023年6月1日 14:20:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76379160.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定