英文:
How to extract parameters from PySpark reading process?
问题
我用PySpark这种方式读取CSV格式的数据:
```python
spark.read.format('csv').option('header', 'true').load(my_path)
其中my_path
是这样的:
s3://my_root/my_datasource/2012-01-01/
s3://my_root/my_datasource/2012-01-02/
s3://my_root/my_datasource/2012-01-03/
s3://my_root/my_datasource/2012-01-04/
...
s3://my_root/my_datasource/<直到今天>/
如你所见,每一天都有一个不同的存储桶,格式是YYYY-MM-DD
。
如果我用以下方式读取上述数据:
spark.read.format('csv').option('header', 'true').load('s3://my_root/my_datasource')
我将能够处理整个数据。但是,当我这样做时,如何提取相对日期(2012-01-01
,2012-01-02
等),因为它在数据本身中不存在,只作为文件夹名称存在?
<details>
<summary>英文:</summary>
I am reading data (csv format) with PySpark in this way:
spark.read.format('csv').option('header', 'true').load(my_path)
where `my_path` it is something like:
s3://my_root/my_datasource/2012-01-01/
s3://my_root/my_datasource/2012-01-02/
s3://my_root/my_datasource/2012-01-03/
s3://my_root/my_datasource/2012-01-04/
...
s3://my_root/my_datasource/<until today>/
as you can see there are different buckets per single day `YYYY-MM-DD`.
If I read the data above with:
spark.read.format('csv').option('header', 'true').load('s3://my_root/my_datasource')
I will be able to process the whole data. However when I do that, how can I extract the information for the relative date (`2012-01-01`, `2012-01-02`, etc) as it is not present in the data itself but only as folder name?
</details>
# 答案1
**得分**: 1
使用**`regexp_extract`**函数和匹配的正则表达式`"(\\d{4}\-\\d{2}\-\\d{2})"`来捕获日期部分。
要获取文件夹路径,请在Spark中使用**`input_file_name()`**函数。
**`示例`**:
```python
from pyspark.sql.functions import *
# 创建示例数据框
df = spark.createDataFrame([('s3://my_root/my_datasource/2012-01-01/',),('s3://my_root/my_datasource/2012-01-02/',)],['folder_path'])
# 使用input_file_name()函数
df = df.withColumn("input_folder_path", input_file_name())
df1 = df.withColumn("dt", regexp_extract(col("folder_path"),"(\\d{4}\-\\d{2}\-\\d{2})",1))
df1.show(10,False)
# +--------------------------------------+----------+
# |folder_path |dt |
# +--------------------------------------+----------+
# |s3://my_root/my_datasource/2012-01-01/|2012-01-01|
# |s3://my_root/my_datasource/2012-01-02/|2012-01-02|
# +--------------------------------------+----------+
英文:
Use regexp_extract
function and the matching regular expression "(\\d{4}\-\\d{2}\-\\d{2})"
to capture date part.
To get the folder path use input_file_name()
function in spark.
Example
:
from pyspark.sql.functions import *
#creating sample dataframe
df=spark.createDataFrame([('s3://my_root/my_datasource/2012-01-01/',),('s3://my_root/my_datasource/2012-01-02/',)],['folder_path'])
#use input_file_name() function
df = df.withColumn("input_folder_path", input_file_name())
df1 = df.withColumn("dt", regexp_extract(col("folder_path"),"(\\d{4}\-\\d{2}\-\\d{2})",1))
df1.show(10,False)
#+--------------------------------------+----------+
#|folder_path |dt |
#+--------------------------------------+----------+
#|s3://my_root/my_datasource/2012-01-01/|2012-01-01|
#|s3://my_root/my_datasource/2012-01-02/|2012-01-02|
#+--------------------------------------+----------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论