问题

我用PySpark这种方式读取CSV格式的数据：

```python
spark.read.format('csv').option('header', 'true').load(my_path)

其中my_path是这样的：

s3://my_root/my_datasource/2012-01-01/
s3://my_root/my_datasource/2012-01-02/
s3://my_root/my_datasource/2012-01-03/
s3://my_root/my_datasource/2012-01-04/
...
s3://my_root/my_datasource/<直到今天>/

如你所见，每一天都有一个不同的存储桶，格式是YYYY-MM-DD。
如果我用以下方式读取上述数据：

spark.read.format('csv').option('header', 'true').load('s3://my_root/my_datasource')

我将能够处理整个数据。但是，当我这样做时，如何提取相对日期（2012-01-01，2012-01-02等），因为它在数据本身中不存在，只作为文件夹名称存在？


<details>
<summary>英文:</summary>

I am reading data (csv format) with PySpark in this way:

    spark.read.format(&#39;csv&#39;).option(&#39;header&#39;, &#39;true&#39;).load(my_path)

where `my_path` it is something like:

    s3://my_root/my_datasource/2012-01-01/
    s3://my_root/my_datasource/2012-01-02/
    s3://my_root/my_datasource/2012-01-03/
    s3://my_root/my_datasource/2012-01-04/
    ...
    s3://my_root/my_datasource/&lt;until today&gt;/

as you can see there are different buckets per single day `YYYY-MM-DD`.
If I read the data above with:

    spark.read.format(&#39;csv&#39;).option(&#39;header&#39;, &#39;true&#39;).load(&#39;s3://my_root/my_datasource&#39;)

I will be able to process the whole data. However when I do that, how can I extract the information for the relative date (`2012-01-01`, `2012-01-02`, etc) as it is not present in the data itself but only as folder name?

</details>


# 答案1
**得分**: 1

使用**`regexp_extract`**函数和匹配的正则表达式`&quot;(\\d{4}\-\\d{2}\-\\d{2})&quot;`来捕获日期部分。

要获取文件夹路径，请在Spark中使用**`input_file_name()`**函数。

**`示例`**：

```python
from pyspark.sql.functions import *

# 创建示例数据框
df = spark.createDataFrame([(&#39;s3://my_root/my_datasource/2012-01-01/&#39;,),(&#39;s3://my_root/my_datasource/2012-01-02/&#39;,)],[&#39;folder_path&#39;])
# 使用input_file_name()函数
df = df.withColumn(&quot;input_folder_path&quot;, input_file_name())
df1 = df.withColumn(&quot;dt&quot;, regexp_extract(col(&quot;folder_path&quot;),&quot;(\\d{4}\-\\d{2}\-\\d{2})&quot;,1))

df1.show(10,False)
# +--------------------------------------+----------+
# |folder_path                           |dt        |
# +--------------------------------------+----------+
# |s3://my_root/my_datasource/2012-01-01/|2012-01-01|
# |s3://my_root/my_datasource/2012-01-02/|2012-01-02|
# +--------------------------------------+----------+

英文:

Use regexp_extract function and the matching regular expression "(\\d{4}\-\\d{2}\-\\d{2})" to capture date part.

To get the folder path use input_file_name() function in spark.

Example:

from pyspark.sql.functions import *

#creating sample dataframe
df=spark.createDataFrame([(&#39;s3://my_root/my_datasource/2012-01-01/&#39;,),(&#39;s3://my_root/my_datasource/2012-01-02/&#39;,)],[&#39;folder_path&#39;])
#use input_file_name() function
df = df.withColumn(&quot;input_folder_path&quot;, input_file_name())
df1 = df.withColumn(&quot;dt&quot;, regexp_extract(col(&quot;folder_path&quot;),&quot;(\\d{4}\-\\d{2}\-\\d{2})&quot;,1))

df1.show(10,False)
#+--------------------------------------+----------+
#|folder_path                           |dt        |
#+--------------------------------------+----------+
#|s3://my_root/my_datasource/2012-01-01/|2012-01-01|
#|s3://my_root/my_datasource/2012-01-02/|2012-01-02|
#+--------------------------------------+----------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何从 PySpark 读取过程中提取参数？

问题

rsd在pyspark的approx_count_distinct中的解释是什么，以及更改它会有什么后果？

在Pyspark中，在数据框中添加带有时间间隔的新时间戳列。

使用Azure Databricks和Pyspark从Azure SQL表中删除行。

In spark dataframe add columns to from one df to another without creating combination of matching rows

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论