英文:
How to read selected partitions of in Spark
问题
我有一个以 ldt
列(格式为 YYYY-MM-dd-HH-mm-ss
)分区的表,以下是分区的情况。
SHOW partitions test.emp_table;
partition
ldt=2023-02-26-00-47-01
ldt=2023-02-26-01-27-40
ldt=2023-02-26-23-48-06
如何使用通配符从这些分区的子集中读取数据?也就是如何加载特定日期或小时的所有分区?以下方式不起作用:
select * from test.emp_table where ldt = "2023-02-26-%"
英文:
I have a table partitioned on ldt
(format YYYY-MM-dd-HH-mm-ss
) column and the following are the partitions.
SHOW partitions test.emp_table;
partition
ldt=2023-02-26-00-47-01
ldt=2023-02-26-01-27-40
ldt=2023-02-26-23-48-06
How to read from the subset of those partitions using wildcards? i.e. how to load all partitions of a particular day or hour? The following isn't working:
select * from test.emp_table where ldt = "2023-02-26-%"
答案1
得分: 1
我建议使用比较运算符
select * from test.emp_table where ldt >= '2023-02-26' and ldt < '2023-02-27'
另外,在使用Python API的情况下,也可以方便地生成一个分区值列表并使用 isin
方法
emp_table.filter(F.col('ldt').isin(dates_list))
英文:
I would suggest using comparison operators
select * from test.emp_table where ldt >= '2023-02-26' and ldt < '2023-02-27'
Alternatively, in case of Python API, it may be convenient to generate a list of partition values and use isin
emp_table.filter(F.col('ldt').isin(dates_list))
答案2
得分: 1
你是否愿意使用PySpark API?如果是的话,你可以使用通配符过滤来读取文件。
相关文档:
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-global-filter
英文:
Are you open to using the PySpark API? If so you're able to read files using a glob filter.
Relevant documentation:
https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-global-filter
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论