如何从 PySpark 读取过程中提取参数?

huangapple go评论54阅读模式
英文:

How to extract parameters from PySpark reading process?

问题

我用PySpark这种方式读取CSV格式的数据

```python
spark.read.format('csv').option('header', 'true').load(my_path)

其中my_path是这样的:

s3://my_root/my_datasource/2012-01-01/
s3://my_root/my_datasource/2012-01-02/
s3://my_root/my_datasource/2012-01-03/
s3://my_root/my_datasource/2012-01-04/
...
s3://my_root/my_datasource/<直到今天>/

如你所见,每一天都有一个不同的存储桶,格式是YYYY-MM-DD
如果我用以下方式读取上述数据:

spark.read.format('csv').option('header', 'true').load('s3://my_root/my_datasource')

我将能够处理整个数据。但是,当我这样做时,如何提取相对日期(2012-01-012012-01-02等),因为它在数据本身中不存在,只作为文件夹名称存在?


<details>
<summary>英文:</summary>

I am reading data (csv format) with PySpark in this way:

    spark.read.format(&#39;csv&#39;).option(&#39;header&#39;, &#39;true&#39;).load(my_path)

where `my_path` it is something like:

    s3://my_root/my_datasource/2012-01-01/
    s3://my_root/my_datasource/2012-01-02/
    s3://my_root/my_datasource/2012-01-03/
    s3://my_root/my_datasource/2012-01-04/
    ...
    s3://my_root/my_datasource/&lt;until today&gt;/

as you can see there are different buckets per single day `YYYY-MM-DD`.
If I read the data above with:

    spark.read.format(&#39;csv&#39;).option(&#39;header&#39;, &#39;true&#39;).load(&#39;s3://my_root/my_datasource&#39;)

I will be able to process the whole data. However when I do that, how can I extract the information for the relative date (`2012-01-01`, `2012-01-02`, etc) as it is not present in the data itself but only as folder name?

</details>


# 答案1
**得分**: 1

使用**`regexp_extract`**函数和匹配的正则表达式`&quot;(\\d{4}\-\\d{2}\-\\d{2})&quot;`来捕获日期部分。

要获取文件夹路径,请在Spark中使用**`input_file_name()`**函数。

**`示例`**:

```python
from pyspark.sql.functions import *

# 创建示例数据框
df = spark.createDataFrame([(&#39;s3://my_root/my_datasource/2012-01-01/&#39;,),(&#39;s3://my_root/my_datasource/2012-01-02/&#39;,)],[&#39;folder_path&#39;])
# 使用input_file_name()函数
df = df.withColumn(&quot;input_folder_path&quot;, input_file_name())
df1 = df.withColumn(&quot;dt&quot;, regexp_extract(col(&quot;folder_path&quot;),&quot;(\\d{4}\-\\d{2}\-\\d{2})&quot;,1))

df1.show(10,False)
# +--------------------------------------+----------+
# |folder_path                           |dt        |
# +--------------------------------------+----------+
# |s3://my_root/my_datasource/2012-01-01/|2012-01-01|
# |s3://my_root/my_datasource/2012-01-02/|2012-01-02|
# +--------------------------------------+----------+
英文:

Use regexp_extract function and the matching regular expression &quot;(\\d{4}\-\\d{2}\-\\d{2})&quot; to capture date part.

To get the folder path use input_file_name() function in spark.

Example:

from pyspark.sql.functions import *

#creating sample dataframe
df=spark.createDataFrame([(&#39;s3://my_root/my_datasource/2012-01-01/&#39;,),(&#39;s3://my_root/my_datasource/2012-01-02/&#39;,)],[&#39;folder_path&#39;])
#use input_file_name() function
df = df.withColumn(&quot;input_folder_path&quot;, input_file_name())
df1 = df.withColumn(&quot;dt&quot;, regexp_extract(col(&quot;folder_path&quot;),&quot;(\\d{4}\-\\d{2}\-\\d{2})&quot;,1))

df1.show(10,False)
#+--------------------------------------+----------+
#|folder_path                           |dt        |
#+--------------------------------------+----------+
#|s3://my_root/my_datasource/2012-01-01/|2012-01-01|
#|s3://my_root/my_datasource/2012-01-02/|2012-01-02|
#+--------------------------------------+----------+

huangapple
  • 本文由 发表于 2023年3月7日 19:18:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75661319.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定