2023年6月22日 07:15:26go评论80阅读模式

英文:

ChunkedArray.Index on secondary column of table not working | ArrowTypeError

问题

我目前正在尝试使用 pyarrow 实现最短路径算法（对于无权图的第一步，对于带权图的第二步）。

我目前在需要验证目标节点是否在当前节点的邻居中的部分遇到了问题。

我的数据如下：

我有三列：节点、邻居节点和访问过的标志。节点列包含图中每个节点的名称。邻居节点列包含直接连接到节点的节点名称的数组。访问列包含一个布尔值，指示在遍历算法中节点是否已被访问过。

在我的示例中，我将起始节点设置为12160432，为了获取邻居节点，我使用了 pc.filter 函数来检索下面显示的红色圆圈中的表格

下一步将是检查是否已到达目标节点，否则我将不得不检查当前节点的邻居节点的邻居节点。

为了检查目标是否在数组中，我想要使用以下函数索引函数，分块数组，如下所示：

filtered_graph['neighboring_nodes'].index(10000001)

但是我收到了以下错误信息：
"ArrowTypeError: 无法将类型为int的10000001转换为列表类型：不是序列或可识别的null值，无法转换为列表类型"

target_node = pa.scalar(10000001, type=pa.int64())
filtered_graph['neighboring_nodes'].index(target_node)

但是得到了相同的错误。

注意：使用"node"列时，索引函数按预期工作：

(-1表示未找到值)

我感激您提供的任何指导！
1: https://i.stack.imgur.com/2h3qb.png
2: https://i.stack.imgur.com/KWE3x.png
3: https://arrow.apache.org/docs/python/generated/pyarrow.ChunkedArray.html#pyarrow.ChunkedArray.index
4: https://i.stack.imgur.com/9TNnz.png

英文:

I am currently trying to implement the shortest path algorithm using pyarrow (first step for unweighted Graphs, second step for weighted graphs).

I am currently having an issue with the part where I need to verify if the target node is among the neighbors of the current node.

My data looks like this:

I have three columns: node, neighboring nodes and visited. The node column contains the name of each node in the graph. The neighboring nodes column contains an array of the names of the nodes that are directly connected to the node. The visited column contains a boolean value that indicates whether the node has been visited or not during a traversal algorithm.

In my example, I set the start node as 12160432, to obtain the neighboring nodes, I used the pc.filter function to retrieve the table in the red circle shown below

The next step would be to check if we reached the Target node, otherwise I will have to check the neighbors of the neighbors of my current node.

To check if the target is in the array I wanted to use the following functionIndex Function, Chunked Array as follows:

filtered_graph[&#39;neighboring_nodes&#39;].index(10000001)

but I got the following error:
"ArrowTypeError: Could not convert 10000001 with type int: was not a sequence or recognized null for conversion to list type"

target_node = pa.scalar(10000001, type=pa.int64())
filtered_graph[&#39;neighboring_nodes&#39;].index(target_node)

but got the same error.

Note: When using the "node" column, the index function works as intended:

(-1 means value not found)

I appreciate any guidance you can offer !

答案1

得分: 1

index函数未实现于类型为列表的数组/分块数组上（例如：pa.list_(pa.int64())）：

&gt;&gt;&gt; pa.chunked_array([pa.array([[1,2,3], [4,5,6]])]).index([1,2,3])
ArrowNotImplementedError: 函数 'index' 没有匹配输入类型的内核（list&lt;item: int64&gt;）

即使它能够工作，它也只能让你搜索与行中完全匹配的值列表。而你正试图检查元素是否包含在行中。

您可以实现这一点，但需要稍微处理一下数据，使用 pyarrow.compute：

import pyarrow as pa
import pyarrow.compute as pc
table = pa.table({'node': [1,4], 'neighbors': [[2,3], [5,6]]})
flat_neighbors = pc.list_flatten(table['neighbors'])
flat_neighbors_index = pc.list_parent_indices(table['neighbors'])
flat_neightbors_parent = table['node'].take(flat_neighbors_index)

它看起来像这样：

flat_neighbors	flat_neighbors_index	flat_neightbors_parent
2	0	1
3	0	1
5	1	4
6	1	4

然后，您可以查找节点并找到它们的父节点：

parents_of_2 = flat_neightbors_parent.filter(
    pc.equal(flat_neighbors, 2)
)
parents_of_2.to_pylist() # 返回 [1]

英文:

The index function is not implemented for arrays/chunked arrays of type list (eg: pa.list_(pa.int64()):

&gt;&gt;&gt; pa.chunked_array([pa.array([[1,2,3], [4,5,6]])]).index([1,2,3])
ArrowNotImplementedError: Function &#39;index&#39; has no kernel matching input types (list&lt;item: int64&gt;)

Even it it worked, it would allow you to search for a list of values that match exactly the list that is in the row. Whereas you are trying to check if an element is contained in a row.

You can achieve this but it requires you to massage the data a bit using pyarrow.compute

import pyarrow as pa
import pyarrow.compute as pc
table = pa.table({&#39;node&#39;: [1,4], &#39;neighbors&#39;: [[2,3], [5,6]]})
flat_neighbors = pc.list_flatten(table[&#39;neighbors&#39;])
flat_neighbors_index = pc.list_parent_indices(table[&#39;neighbors&#39;])
flat_neightbors_parent = table[&#39;node&#39;].take(flat_neighbors_index)

Which looks like this:

flat_neighbors	flat_neighbors_index	flat_neightbors_parent
2	0	1
3	0	1
5	1	4
6	1	4

And then you can look up nodes and find their parents:

parents_of_2 = flat_neightbors_parent.filter(
    pc.equal(flat_neighbors, 2)
)
parents_of_2.to_pylist() # Returns [1]

答案2

得分: 0

感谢0x26res的回复！根据您的逻辑，我也找到了一个小技巧：

import pyarrow as pa
import pyarrow.compute as pc
table = pa.table({'node': [1, 4], 'neighbors': [[2, 3], [5, 6]]})
flat_neighbors = pc.list_flatten(table['neighbors'])
values_to_check = pa.array([2])
mask = pc.is_in(flat_neighbors, value_set=values_to_check)
pc.any(mask) 返回 <pyarrow.BooleanScalar: True>

英文:

Thank you 0x26res for your response !
following your logic I also found a little neat trick:

import pyarrow as pa
import pyarrow.compute as pc
table = pa.table({&#39;node&#39;: [1,4], &#39;neighbors&#39;: [[2,3], [5,6]]})
flat_neighbors = pc.list_flatten(table[&#39;neighbors&#39;])
values_to_check = pa.array([2])
mask = pc.is_in(flat_neighbors, value_set=values_to_check)
pc.any(mask) returns &lt;pyarrow.BooleanScalar: True&gt;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

ChunkedArray.Index在表的辅助列上不起作用 | ArrowTypeError

问题

答案1

答案2

关于Parquet行组大小的实际含义是什么？

Having issue installing 'Streamlit' with pip, I believe the failure is linked to Pyarrow and Cmake. I'm running MacOS High Sierra 10.13

使用pyarrow字符串与pandas的map或apply函数。

Parquet pyarrow schema 转换为 Glue schema AWS

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。