问题

我在循环中编写了100+个数据框。如何记录单个数据框写入ADLS路径的总持续时间？
我想将这些信息存储在一个表中，以便我可以查看哪个数据框需要优化。

写入TSV到Datalake路径的示例代码：

dataFrame
      .repartition(1)
      .write
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("sep", colDelim)
      .option("quoteAll", true)
      .option("escape", "\"")
      .mode("overwrite")
      .save(filePath + fileName)

英文:

I write 100+ dataframes in a loop. How do I log the total duration a single dataframe took to write a CSV in ADLS path?
I would like to store this information in a table where I can check which dataframe needs an optimization.

Sample code to write a TSV to Datalake path:

dataFrame
      .repartition(1)
      .write
      .format(&quot;com.databricks.spark.csv&quot;)
      .option(&quot;header&quot;, &quot;true&quot;)
      .option(&quot;sep&quot;, colDelim)
      .option(&quot;quoteAll&quot;, true)
      .option(&quot;escape&quot;, &quot;\&quot;&quot;)
      .mode(&quot;overwrite&quot;)
      .save(filePath + fileName)

答案1

得分: 0

以下是您可以使用的pyspark代码来获取每个数据框写入操作的持续时间：

这里，您可以使用下面的pyspark代码来获取每个数据框写入操作的持续时间。

我假设您有一个数据框列表。

```python
import time

log = []

for i, df in enumerate(dataframe_list):
    start_time = time.time()

    df.repartition(1).write.format("csv") \
        .option("header", "true") \
        .option("sep", " ") \
        .option("quoteAll", "true") \
        .option("escape", "\\") \
        .mode("overwrite") \
        .save("/mnt/csv/dataframe_" + str(i))

    end_time = time.time()
    duration = end_time - start_time

    each_df_time = {
        "DF_name": "dataframe" + str(i),
        "time_taken": duration
    }
    log.append(each_df_time)

log_df = spark.createDataFrame(log, schema="DF_name STRING, time_taken DOUBLE")
display(log_df)

输出：

查找数据框写入 ADLS 路径所需的总时间。


Output:

![在此处输入图像描述](https://i.imgur.com/tQvlfzK.png)

<details>
<summary>英文:</summary>

Here, you can use below pyspark code for getting duration of each dataframe write operation.

I am assuming you are having list of dataframes.

    import time
    
    log = []
    
    for i,df in  enumerate(dataframe_list):
	    start_time = time.time()
	    
	    df.repartition(1).write.format(&quot;csv&quot;) \
	    .option(&quot;header&quot;, &quot;true&quot;) \
	    .option(&quot;sep&quot;, &quot; &quot;) \
	    .option(&quot;quoteAll&quot;, &quot;true&quot;) \
	    .option(&quot;escape&quot;, &quot;\&quot;&quot;) \
	    .mode(&quot;overwrite&quot;) \
	    .save(&quot;/mnt/csv/dataframe_&quot;+str(i))
	    
	    end_time = time.time()
	    duration = end_time - start_time
	    
	    each_df_time = {
	    &quot;DF_name&quot;:&quot;dataframe&quot;+str(i),
	    &quot;time_taken&quot;:duration
	    }
	    log.append(each_df_time)
	    
	log_df = spark.createDataFrame(log,schema=&quot;DF_name STRING, time_taken DOUBLE&quot;)
	display(log_df)


Output:

![enter image description here](https://i.imgur.com/tQvlfzK.png)

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

查找数据框写入 ADLS 路径所需的总时间。

问题

答案1

使用Hydra配置管理与Python和YAML文件一起使用，在Databricks中使用笔记本。

将带有逗号和控制字符的字符串写入CSV文件中。

多个Spark执行器在单个GPU上

AddJobFlowStep在AWS EMR SDK中的正确使用方法是什么？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论