如何在Spark中读取选定的分区

huangapple go评论60阅读模式
英文:

How to read selected partitions of in Spark

问题

我有一个以 ldt 列(格式为 YYYY-MM-dd-HH-mm-ss)分区的表,以下是分区的情况。

SHOW partitions test.emp_table;

partition
ldt=2023-02-26-00-47-01
ldt=2023-02-26-01-27-40
ldt=2023-02-26-23-48-06

如何使用通配符从这些分区的子集中读取数据?也就是如何加载特定日期或小时的所有分区?以下方式不起作用:

select * from test.emp_table where ldt = "2023-02-26-%"
英文:

I have a table partitioned on ldt (format YYYY-MM-dd-HH-mm-ss) column and the following are the partitions.

SHOW partitions test.emp_table;

partition
ldt=2023-02-26-00-47-01
ldt=2023-02-26-01-27-40
ldt=2023-02-26-23-48-06

How to read from the subset of those partitions using wildcards? i.e. how to load all partitions of a particular day or hour? The following isn't working:

select * from test.emp_table where ldt = "2023-02-26-%"

答案1

得分: 1

我建议使用比较运算符

select * from test.emp_table where ldt >= '2023-02-26' and ldt < '2023-02-27'

另外,在使用Python API的情况下,也可以方便地生成一个分区值列表并使用 isin 方法

emp_table.filter(F.col('ldt').isin(dates_list))
英文:

I would suggest using comparison operators

select * from test.emp_table where ldt &gt;= &#39;2023-02-26&#39; and ldt &lt; &#39;2023-02-27&#39;

Alternatively, in case of Python API, it may be convenient to generate a list of partition values and use isin

emp_table.filter(F.col(&#39;ldt&#39;).isin(dates_list))

答案2

得分: 1

你是否愿意使用PySpark API?如果是的话,你可以使用通配符过滤来读取文件。

相关文档:

https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-global-filter

英文:

Are you open to using the PySpark API? If so you're able to read files using a glob filter.

Relevant documentation:

https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#path-global-filter

huangapple
  • 本文由 发表于 2023年2月27日 08:16:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/75575831.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定