ChunkedArray.Index在表的辅助列上不起作用 | ArrowTypeError

huangapple go评论59阅读模式
英文:

ChunkedArray.Index on secondary column of table not working | ArrowTypeError

问题

我目前正在尝试使用 pyarrow 实现最短路径算法(对于无权图的第一步,对于带权图的第二步)。

我目前在需要验证目标节点是否在当前节点的邻居中的部分遇到了问题。

我的数据如下:
ChunkedArray.Index在表的辅助列上不起作用 | ArrowTypeError
我有三列:节点、邻居节点和访问过的标志。节点列包含图中每个节点的名称。邻居节点列包含直接连接到节点的节点名称的数组。访问列包含一个布尔值,指示在遍历算法中节点是否已被访问过。

在我的示例中,我将起始节点设置为12160432,为了获取邻居节点,我使用了 pc.filter 函数来检索下面显示的红色圆圈中的表格

ChunkedArray.Index在表的辅助列上不起作用 | ArrowTypeError

下一步将是检查是否已到达目标节点,否则我将不得不检查当前节点的邻居节点的邻居节点。

为了检查目标是否在数组中,我想要使用以下函数索引函数,分块数组,如下所示:

filtered_graph['neighboring_nodes'].index(10000001)

但是我收到了以下错误信息:
"ArrowTypeError: 无法将类型为int的10000001转换为列表类型:不是序列或可识别的null值,无法转换为列表类型"

target_node = pa.scalar(10000001, type=pa.int64())
filtered_graph['neighboring_nodes'].index(target_node)

但是得到了相同的错误。

注意:使用"node"列时,索引函数按预期工作:
ChunkedArray.Index在表的辅助列上不起作用 | ArrowTypeError
(-1表示未找到值)

我感激您提供的任何指导!
1: https://i.stack.imgur.com/2h3qb.png
2: https://i.stack.imgur.com/KWE3x.png
3: https://arrow.apache.org/docs/python/generated/pyarrow.ChunkedArray.html#pyarrow.ChunkedArray.index
4: https://i.stack.imgur.com/9TNnz.png

英文:

I am currently trying to implement the shortest path algorithm using pyarrow (first step for unweighted Graphs, second step for weighted graphs).

I am currently having an issue with the part where I need to verify if the target node is among the neighbors of the current node.

My data looks like this:
ChunkedArray.Index在表的辅助列上不起作用 | ArrowTypeError
I have three columns: node, neighboring nodes and visited. The node column contains the name of each node in the graph. The neighboring nodes column contains an array of the names of the nodes that are directly connected to the node. The visited column contains a boolean value that indicates whether the node has been visited or not during a traversal algorithm.

In my example, I set the start node as 12160432, to obtain the neighboring nodes, I used the pc.filter function to retrieve the table in the red circle shown below

ChunkedArray.Index在表的辅助列上不起作用 | ArrowTypeError

The next step would be to check if we reached the Target node, otherwise I will have to check the neighbors of the neighbors of my current node.

To check if the target is in the array I wanted to use the following functionIndex Function, Chunked Array as follows:

filtered_graph['neighboring_nodes'].index(10000001)

but I got the following error:
"ArrowTypeError: Could not convert 10000001 with type int: was not a sequence or recognized null for conversion to list type"

target_node = pa.scalar(10000001, type=pa.int64())
filtered_graph['neighboring_nodes'].index(target_node)

but got the same error.

Note: When using the "node" column, the index function works as intended:
ChunkedArray.Index在表的辅助列上不起作用 | ArrowTypeError
(-1 means value not found)

I appreciate any guidance you can offer !

答案1

得分: 1

index函数未实现于类型为列表的数组/分块数组上(例如:pa.list_(pa.int64())):

>>> pa.chunked_array([pa.array([[1,2,3], [4,5,6]])]).index([1,2,3])
ArrowNotImplementedError: 函数 'index' 没有匹配输入类型的内核list<item: int64>

即使它能够工作,它也只能让你搜索与行中完全匹配的值列表。而你正试图检查元素是否包含在行中。

您可以实现这一点,但需要稍微处理一下数据,使用 pyarrow.compute

import pyarrow as pa
import pyarrow.compute as pc

table = pa.table({'node': [1,4], 'neighbors': [[2,3], [5,6]]})

flat_neighbors = pc.list_flatten(table['neighbors'])
flat_neighbors_index = pc.list_parent_indices(table['neighbors'])
flat_neightbors_parent = table['node'].take(flat_neighbors_index)

它看起来像这样:

flat_neighbors flat_neighbors_index flat_neightbors_parent
2 0 1
3 0 1
5 1 4
6 1 4

然后,您可以查找节点并找到它们的父节点:

parents_of_2 = flat_neightbors_parent.filter(
    pc.equal(flat_neighbors, 2)
)
parents_of_2.to_pylist() # 返回 [1]
英文:

The index function is not implemented for arrays/chunked arrays of type list (eg: pa.list_(pa.int64()):

>>> pa.chunked_array([pa.array([[1,2,3], [4,5,6]])]).index([1,2,3])
ArrowNotImplementedError: Function 'index' has no kernel matching input types (list<item: int64>)

Even it it worked, it would allow you to search for a list of values that match exactly the list that is in the row. Whereas you are trying to check if an element is contained in a row.

You can achieve this but it requires you to massage the data a bit using pyarrow.compute

import pyarrow as pa
import pyarrow.compute as pc

table = pa.table({'node': [1,4], 'neighbors': [[2,3], [5,6]]})

flat_neighbors = pc.list_flatten(table['neighbors'])
flat_neighbors_index = pc.list_parent_indices(table['neighbors'])
flat_neightbors_parent = table['node'].take(flat_neighbors_index)

Which looks like this:

flat_neighbors flat_neighbors_index flat_neightbors_parent
2 0 1
3 0 1
5 1 4
6 1 4

And then you can look up nodes and find their parents:

parents_of_2 = flat_neightbors_parent.filter(
    pc.equal(flat_neighbors, 2)
)
parents_of_2.to_pylist() # Returns [1]

答案2

得分: 0

感谢0x26res的回复!根据您的逻辑,我也找到了一个小技巧:

import pyarrow as pa
import pyarrow.compute as pc

table = pa.table({'node': [1, 4], 'neighbors': [[2, 3], [5, 6]]})

flat_neighbors = pc.list_flatten(table['neighbors'])
values_to_check = pa.array([2])

mask = pc.is_in(flat_neighbors, value_set=values_to_check)
pc.any(mask) 返回 <pyarrow.BooleanScalar: True>
英文:

Thank you 0x26res for your response !
following your logic I also found a little neat trick:

import pyarrow as pa
import pyarrow.compute as pc

table = pa.table({&#39;node&#39;: [1,4], &#39;neighbors&#39;: [[2,3], [5,6]]})

flat_neighbors = pc.list_flatten(table[&#39;neighbors&#39;])
values_to_check = pa.array([2])

mask = pc.is_in(flat_neighbors, value_set=values_to_check)
pc.any(mask) returns &lt;pyarrow.BooleanScalar: True&gt;

huangapple
  • 本文由 发表于 2023年6月22日 07:15:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76527700.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定