问题

我有一个以 ldt 列（格式为 YYYY-MM-dd-HH-mm-ss）分区的表，以下是分区的情况。

SHOW partitions test.emp_table;

partition
ldt=2023-02-26-00-47-01
ldt=2023-02-26-01-27-40
ldt=2023-02-26-23-48-06

如何使用通配符从这些分区的子集中读取数据？也就是如何加载特定日期或小时的所有分区？以下方式不起作用：

select * from test.emp_table where ldt = "2023-02-26-%"

英文:

I have a table partitioned on ldt (format YYYY-MM-dd-HH-mm-ss) column and the following are the partitions.

SHOW partitions test.emp_table;

partition
ldt=2023-02-26-00-47-01
ldt=2023-02-26-01-27-40
ldt=2023-02-26-23-48-06

How to read from the subset of those partitions using wildcards? i.e. how to load all partitions of a particular day or hour? The following isn't working:

select * from test.emp_table where ldt = &quot;2023-02-26-%&quot;

答案1

得分: 1

我建议使用比较运算符

select * from test.emp_table where ldt >= '2023-02-26' and ldt < '2023-02-27'

另外，在使用Python API的情况下，也可以方便地生成一个分区值列表并使用 isin 方法

emp_table.filter(F.col('ldt').isin(dates_list))

英文:

I would suggest using comparison operators

select * from test.emp_table where ldt &gt;= &#39;2023-02-26&#39; and ldt &lt; &#39;2023-02-27&#39;

Alternatively, in case of Python API, it may be convenient to generate a list of partition values and use isin

emp_table.filter(F.col(&#39;ldt&#39;).isin(dates_list))

答案2

得分: 1

你是否愿意使用PySpark API？如果是的话，你可以使用通配符过滤来读取文件。

如何在Spark中读取选定的分区

问题

答案1

答案2

哪个更有效，Cassandra的库查询还是PySpark的Cassandra查询？

处理嵌套的 JSON 结构

需要在YARN上安装Spark才能从HDFS读取数据到PySpark吗？

如何创建一个返回元组或同时更新两列的Spark UDF？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论