2023年6月11日 22:37:49go评论175阅读模式

英文:

How do I apply a filter on a map type column in a Pyarrow table while loading?

问题

I have a file written in the Deltalake/Parquet format which has a map as one of the columns. The map stores various properties of the row entry in a "property_name": "property_value" format. I'd like to filter on a particular property stored in this map column, preferably before loading the table into memory using predicate pushdown if available.

This was my attempt to solve the problem via accessing the key-value data through nested fields:

Code to create example parquet file:

import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd

data = {'Name': ['Name1', 'Name2', 'Name3'],
        'Trial_Map': [{'a': 'a1', 'b': 'b1'}, {'a': 'a2', 'b': 'b2'}, {'a': 'a1', 'b': 'b3'}]}
df = pd.DataFrame(data)

schema = pa.schema([
    ('Name', pa.string()),
    ('Trial_Map', pa.map_(pa.string(), pa.string()))
])
table = pa.Table.from_pandas(df, schema=schema)
writer = pq.ParquetWriter("example.parquet", table.schema)
writer.write_table(table)
writer.close()

Code that attempts filtering while reading file to table:

import pyarrow.parquet as pq
import pyarrow.dataset as ds

condition = ((ds.field("Trial_Map", "Trial_Map", "key") == "a") &
             (ds.field("Trial_Map", "Trial_Map", "value") == "a1"))

table = pq.read_table("example.parquet", filters=condition)
print(table.schema)

However, this code gave me the following error:
pyarrow.lib.ArrowNotImplementedError: Function 'struct_field' has no kernel matching input types (map<string, string ('Trial_Map')>)

I would appreciate any help in solving this error/pointing me to any other methods of performing this pre-loading filter. Thanks for your time!

英文:

This was my attempt to solve the problem via accessing the key-value data through nested fields:

Code to create example parquet file:

import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd

data = {&#39;Name&#39;: [&#39;Name1&#39;, &#39;Name2&#39;, &#39;Name3&#39;],
        &#39;Trial_Map&#39;: [{&#39;a&#39;: &#39;a1&#39;, &#39;b&#39;: &#39;b1&#39;}, {&#39;a&#39;: &#39;a2&#39;, &#39;b&#39;: &#39;b2&#39;}, {&#39;a&#39;: &#39;a1&#39;, &#39;b&#39;: &#39;b3&#39;}]}
df = pd.DataFrame(data)

schema = pa.schema([
    (&#39;Name&#39;, pa.string()),
    (&#39;Trial_Map&#39;, pa.map_(pa.string(), pa.string()))
])
table = pa.Table.from_pandas(df, schema=schema)
writer = pq.ParquetWriter(&quot;example.parquet&quot;, table.schema)
writer.write_table(table)
writer.close()

Code that attempts filtering while reading file to table:

import pyarrow.parquet as pq
import pyarrow.dataset as ds

condition = ((ds.field(&quot;Trial_Map&quot;, &quot;Trial_Map&quot;, &quot;key&quot;) == &quot;a&quot;) &amp;
             (ds.field(&quot;Trial_Map&quot;, &quot;Trial_Map&quot;, &quot;value&quot;) == &quot;a1&quot;))

table = pq.read_table(&quot;example.parquet&quot;, filters=condition)
print(table.schema)

However, this code gave me the following error:
pyarrow.lib.ArrowNotImplementedError: Function 'struct_field' has no kernel matching input types (map<string, string ('Trial_Map')>)

I would appreciate any help in solving this error/pointing me to any other methods of performing this pre-loading filter. Thanks for your time!

答案1

得分: 0

PyArrow的谓词下推功能允许您在Parquet文件读取过程中过滤数据，减少加载到内存中的数据量。然而，似乎PyArrow目前不支持对诸如映射列之类的嵌套字段进行谓词下推。

要解决这个限制，您可以考虑使用带有PySpark的Apache Spark，它提供了更高级的功能来查询和操作嵌套数据结构。Spark支持复杂嵌套类型（如映射）的谓词下推，允许您在加载到内存之前过滤数据。

以下是使用PySpark实现所需过滤的示例：

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode

spark = SparkSession.builder.getOrCreate()

# 将Parquet文件读取为DataFrame
df = spark.read.parquet("example.parquet")

# 将映射列展开为单独的键值对
df_exploded = df.select("Name", explode("Trial_Map").alias("Trial_Map"))

# 在展开的DataFrame上应用所需的过滤条件
filtered_df = df_exploded.filter((col("Trial_Map.key") == "a") & (col("Trial_Map.value") == "a1"))

# 显示过滤后的结果
filtered_df.show()

# 如果需要，还可以将DataFrame转换回Parquet文件
filtered_df.write.parquet("filtered_example.parquet")

在这段代码中，我们首先使用PySpark将Parquet文件读取为DataFrame。然后，我们使用explode函数将映射列展开为单独的键值对。接下来，我们使用filter函数在展开的DataFrame上应用所需的过滤条件。最后，我们显示过滤后的结果，并可以选择将其写回到Parquet文件中。

使用PySpark可以让您更灵活地处理复杂的数据类型，并在嵌套字段（如映射）上执行高级操作。

英文:

PyArrow's predicate pushdown feature allows you to filter data during the Parquet file reading process, reducing the amount of data loaded into memory. However, it seems that PyArrow does not currently support predicate pushdown for nested fields such as map columns.

To work around this limitation, you can consider using Apache Spark with PySpark, which provides more advanced functionality for querying and manipulating nested data structures. Spark supports predicate pushdown for complex nested types like maps, allowing you to filter data before loading it into memory.

Here's an example of how you can achieve the desired filtering using PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode

spark = SparkSession.builder.getOrCreate()

# Read the Parquet file into a DataFrame
df = spark.read.parquet(&quot;example.parquet&quot;)

# Explode the map column into individual key-value pairs
df_exploded = df.select(&quot;Name&quot;, explode(&quot;Trial_Map&quot;).alias(&quot;Trial_Map&quot;))

# Apply the filter condition on the exploded DataFrame
filtered_df = df_exploded.filter((col(&quot;Trial_Map.key&quot;) == &quot;a&quot;) &amp; (col(&quot;Trial_Map.value&quot;) == &quot;a1&quot;))

# Show the filtered results
filtered_df.show()

# You can also convert the DataFrame back to a Parquet file if needed
filtered_df.write.parquet(&quot;filtered_example.parquet&quot;)

In this code, we first read the Parquet file into a DataFrame using PySpark. Then, we explode the map column into individual key-value pairs using the explode function. After that, we apply the desired filter condition on the exploded DataFrame using the filter function. Finally, we show the filtered results and optionally write them back to a Parquet file.

Using PySpark gives you more flexibility in handling complex data types and performing advanced operations on nested fields like maps.

答案2

得分: 0

PyArrow目前不支持直接使用嵌套字段引用来选择特定键的值（就像您尝试的ds.field("Trial_Map", "key")一样），但有一个计算函数允许选择这些值，即"map_lookup"。

如果我们可以假设每个键在每个映射元素中只出现一次（即每行没有重复），我们可以定义如下的过滤器：

import pyarrow.compute as pc
map_filter = pc.map_lookup(pc.field("Trial_Map"), pa.scalar("a"), "first") == "a1"

在您的示例中使用它，您可以看到它在实际中的工作方式：

>>> pq.read_table("example.parquet").to_pandas()
    Name           Trial_Map
0  Name1  [(a, a1), (b, b1)]
1  Name2  [(a, a2), (b, b2)]
2  Name3  [(a, a1), (b, b3)]

>>> pq.read_table("example.parquet", filters=map_filter).to_pandas()
    Name           Trial_Map
0  Name1  [(a, a1), (b, b1)]
1  Name3  [(a, a1), (b, b3)]

之所以仅在每个映射元素中只有一个"a"键时才起作用，是因为我必须指定"first"来获取每个元素中键"a"的第一个值。如果有多个值，您可以指定"all"，但然后您会得到一个ListArray作为结果。将计算函数直接应用于表以说明这一点：

>>> pc.map_lookup(table["Trial_Map"].chunk(0), pa.scalar("a"), "first") 
<pyarrow.lib.StringArray object at 0x7f099d2b8040>
[
  "a1",
  "a2",
  "a1"
]

>>> pc.map_lookup(table["Trial_Map"].chunk(0), pa.scalar("a"), "all")
<pyarrow.lib.ListArray object at 0x7f099d3134c0>
[
  [
    "a1"
  ],
  [
    "a2"
  ],
  [
    "a1"
  ]
]

然后，如果我们有这个ListArray，元素级别的相等性== "a1"不会直接起作用（关于这个问题有一个开放的增强请求，用于添加一个检查列表是否包含某个值的函数：https://github.com/apache/arrow/issues/33295）。

英文:

PyArrow currently doesn't support directly selecting the values for a certain key using a nested field referenced (as you were trying with ds.field("Trial_Map", "key")), but there is a compute function that allows selecting those values, i.e. "map_lookup".

If we can assume that each key occurs only once in each map element (i.e. no duplicates per row), we can define a filter like this:

import pyarrow.compute as pc
map_filter = pc.map_lookup(pc.field(&quot;Trial_Map&quot;), pa.scalar(&quot;a&quot;), &quot;first&quot;) == &quot;a1&quot;

Using it on your example, you can see it working in practice:

&gt;&gt;&gt; pq.read_table(&quot;example.parquet&quot;).to_pandas()
    Name           Trial_Map
0  Name1  [(a, a1), (b, b1)]
1  Name2  [(a, a2), (b, b2)]
2  Name3  [(a, a1), (b, b3)]

&gt;&gt;&gt; pq.read_table(&quot;example.parquet&quot;, filters=map_filter).to_pandas()
    Name           Trial_Map
0  Name1  [(a, a1), (b, b1)]
1  Name3  [(a, a1), (b, b3)]

The reason that this only works when there is only a single "a" key per map element is because I have to specify "first" to get the first value for key "a" for each element. If there are multiple ones, you can specify "all", but then you get a ListArray as result.
Applying the compute function directly on the table to illustrate this:

&gt;&gt;&gt; pc.map_lookup(table[&quot;Trial_Map&quot;].chunk(0), pa.scalar(&quot;a&quot;), &quot;first&quot;) 
&lt;pyarrow.lib.StringArray object at 0x7f099d2b8040&gt;
[
  &quot;a1&quot;,
  &quot;a2&quot;,
  &quot;a1&quot;
]

&gt;&gt;&gt; pc.map_lookup(table[&quot;Trial_Map&quot;].chunk(0), pa.scalar(&quot;a&quot;), &quot;all&quot;)
&lt;pyarrow.lib.ListArray object at 0x7f099d3134c0&gt;
[
  [
    &quot;a1&quot;
  ],
  [
    &quot;a2&quot;
  ],
  [
    &quot;a1&quot;
  ]
]

And then if we have this ListArray, the element-wise equality == "a1" doesn't work out of the box (there is an open enhancement request about this to add a function to check if a list contains some value: https://github.com/apache/arrow/issues/33295)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在加载Pyarrow表时对地图类型列应用过滤器？

问题

答案1

答案2

如何在移位整数时绕过Go int64值的限制？

想要删除包含特定文本的所有行。

Python Flask: 如何通过不是路由的函数构建模板。

使用位运算在3D欧几里德坐标网格中找到点所在的八分之一。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论