在Polars中筛选`list(Int64)`数据类型

huangapple go评论106阅读模式
英文:

Filter on `list(Int64)` dtype in polars

问题

以下是翻译好的部分:

"Say I have" -> "假设我有"
"I'd like to keep rows where 'a' equals [1,2,3]" -> "我想保留'a'等于[1,2,3]的行"
"I've tried" -> "我尝试过"
"but it raises" -> "但它引发了"

英文:

Say I have

In [20]: df = pl.DataFrame({'a': [[1,2,3], [1,4,2], [1,3,3]], 'b': [4,2,1]})

In [21]: df
Out[21]:
shape: (3, 2)
┌───────────┬─────┐
 a          b   
 ---        --- 
 list[i64]  i64 
╞═══════════╪═════╡
 [1, 2, 3]  4   
 [1, 4, 2]  2   
 [1, 3, 3]  1   
└───────────┴─────┘

I'd like to keep rows where 'a' equals [1,2,3]

I've tried

In [23]: df.filter(pl.col('a')==[1,2,3])

ArrowErrorException: NotYetImplemented("Casting from Int64 to LargeList(Field { name: \"item\", data_type: Int64, is_nullable: true, metadata: {} }) not supported")

but it raises

答案1

得分: 2

这个函数似乎尚未实现错误。

但是,你可以添加自己的筛选函数,就像这样(将你自己的列作为筛选条件进行累积):

from functools import reduce
def filterList(c: pl.col, l: list) -> pl.col:
    return reduce(lambda a, b: a & b, [c.list.get(idx) == item for idx, item in enumerate(l)])

或者如果你更喜欢通常的循环方式:

def filterList(c: pl.col, l: list) -> pl.col:
    res = pl.lit(True)
    for idx, item in enumerate(l):
        res = res & (c.list.get(idx) == item)
    return res

然后只需调用:

df.filter(filterList(pl.col('a'), [1, 2, 3]))

即使原始数据框中的列表条目较短(因为.get(idx)只会返回null),这也应该为你提供正确的结果。

英文:

By the error this function doesn't seem to be implemented yet

However you could add your own filter function - like this (accumulating your own column as a filter):

from functools import reduce
def filterList(c: pl.col, l: list) -> pl.col:
    return reduce(lambda a,b: a & b, [c.list.get(idx) == item for idx, item in enumerate(l)])

or if you prefer the usual for loop-style:

def filterList(c: pl.col, l: list) -> pl.col:
    res = pl.lit(True)
    for idx, item in enumerate(l):
        res = res & (c.list.get(idx) == item)
    return res

and then simply call

df.filter(filterList(pl.col('a'), [1,2,3]))

which should give you the right result even if the list entries in the original dataframe are shorter (because .get(idx)would simply return null)

答案2

得分: 2

你可以先对列表进行哈希,然后对文字进行哈希,然后比较这两者:

df.filter(pl.col('a').hash() == pl.lit([[1,2,3]]).hash())
英文:

You can hash the list first and hash a literal and then compare the two:

df.filter(pl.col('a').hash() == pl.lit([[1,2,3]]).hash())

huangapple
  • 本文由 发表于 2023年8月10日 20:31:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76875762.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定