英文:
In a polars dataframe, filter a column of type list by another column of type list
问题
以下是您要翻译的内容:
我有一个 [tag:polars] 数据帧如下:
示例输入:
import polars as pl
d = pl.DataFrame()
df = d.with_columns(pl.lit([1]).alias('user_id'),
pl.lit([[1,2,3,4]]).alias('items'),
pl.lit([[3,4,5,6]]).alias('popular_items'))
┌─────────────┬─────────────┬───────────────┐
│ user_id ┆ items ┆ popular_items │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] │
╞═════════════╪═════════════╪═══════════════╡
│ 1 ┆[1, 2, 3, 4] ┆ [3, 4, 5, 6] │
└─────────────┴─────────────┴───────────────┘
我想要通过删除每个 user_id
中在 items
列中的任何项目来筛选 popular_items
列。
我一直在尝试让它工作,但由于各种问题一直没有成功。很可能是因为我过于复杂化了事情。
预期的输出应如下所示:
┌─────────────┬─────────────┬───────────────┬───────────┐
│ user_id ┆ items ┆ popular_items ┆ suggested │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════════╪═════════════╪═══════════════╪═══════════╡
│ 1 ┆ [1, 2, 3, 4]┆ [3, 4, 5, 6] ┆ [5, 6] │
└─────────────┴─────────────┴───────────────┴───────────┘
看起来解决方案应该很简单,但在尝试了不同的方法后,似乎一直不成功。
任何帮助将不胜感激!
英文:
I have a [tag:polars] dataframe as below:
Example input:
import polars as pl
d = pl.DataFrame()
df = d.with_columns(pl.lit([1]).alias('user_id'),
pl.lit([[1,2,3,4]]).alias('items'),
pl.lit([[3,4,5,6]]).alias('popular_items'))
┌─────────────┬─────────────┬───────────────┐
│ user_id ┆ items ┆ popular_items │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] │
╞═════════════╪═════════════╪═══════════════╡
│ 1 ┆[1, 2, 3, 4] ┆ [3, 4, 5, 6] │
└─────────────┴─────────────┴───────────────┘
I want to filter popular_items
column by removing any items that are in items
column for each user_id
I have been trying to get it to work but have been unsuccessful due to various issues. In all likelihood, I am probably overcomplicating things.
The expected output should be as follows:
┌─────────────┬─────────────┬───────────────┬───────────┐
│ user_id ┆ items ┆ popular_items ┆ suggested │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════════╪═════════════╪═══════════════╪═══════════╡
│ 1 ┆ [1, 2, 3, 4]┆ [3, 4, 5, 6] ┆ [5, 6] │
└─────────────┴─────────────┴───────────────┴───────────┘
It seems like the solution should be simple, but it seems to escape me after some time now trying different things.
Any help would be greatly appreciated!
答案1
得分: 1
我明白了,以下是您提供的代码的中文翻译部分:
我不确定是否有一种“简单”的方法来做到这一点。
您可以展开列表并使用`.is_in`来检查是否在其他列表中。
(df.explode("popular_items")
.with_columns(is_in = pl.col("popular_items").is_in("items").is_not()))
形状:(4,4)
┌─────────┬─────────────┬───────────────┬───────┐
│ user_id ┆ items ┆ popular_items ┆ is_in │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ i64 ┆ bool │
╞═════════╪═════════════╪═══════════════╪═══════╡
│ 1 ┆ [1, 2, … 4] ┆ 3 ┆ false │
│ 1 ┆ [1, 2, … 4] ┆ 4 ┆ false │
│ 1 ┆ [1, 2, … 4] ┆ 5 ┆ true │
│ 1 ┆ [1, 2, … 4] ┆ 6 ┆ true │
└─────────┴─────────────┴───────────────┴───────┘
然后,您可以使用.filter
筛选,转回列表,并进行.join
操作。
df.join(
(df.explode("popular_items")
.filter(pl.col("popular_items").is_in("items").is_not())
.groupby("user_id")
.agg(suggested="popular_items")),
on="user_id"
)
形状:(1,4)
┌─────────┬─────────────┬───────────────┬───────────┐
│ user_id ┆ items ┆ popular_items ┆ suggested │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════╪═════════════╪═══════════════╪═══════════╡
│ 1 ┆ [1, 2, … 4] ┆ [3, 4, … 6] ┆ [5, 6] │
└─────────┴─────────────┴───────────────┴───────────┘
也许更好的方法是展开两个列表并使用[反连接](https://pola-rs.github.io/polars-book/user-guide/transformations/joins/#anti-join)来计算“差异”。
difference = (
df.explode("popular_items")
.join(
df.explode("items"),
left_on=["user_id", "popular_items"],
right_on=["user_id", "items"],
how="anti"
)
.groupby("user_id", maintain_order=True)
.agg(suggested="popular_items")
)
df.join(difference, on="user_id")
希望这能帮助您理解这些代码的中文翻译部分。
<details>
<summary>英文:</summary>
I'm not sure if there is a "simple" way to do this.
You can explode the list and use `.is_in` to check if it is in the other.
(df.explode("popular_items")
.with_columns(is_in = pl.col("popular_items").is_in("items").is_not()))
shape: (4, 4)
┌─────────┬─────────────┬───────────────┬───────┐
│ user_id ┆ items ┆ popular_items ┆ is_in │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ i64 ┆ bool │
╞═════════╪═════════════╪═══════════════╪═══════╡
│ 1 ┆ [1, 2, … 4] ┆ 3 ┆ false │
│ 1 ┆ [1, 2, … 4] ┆ 4 ┆ false │
│ 1 ┆ [1, 2, … 4] ┆ 5 ┆ true │
│ 1 ┆ [1, 2, … 4] ┆ 6 ┆ true │
└─────────┴─────────────┴───────────────┴───────┘
Which you could `.filter`, turn back into a list, and `.join`.
df.join(
(df.explode("popular_items")
.filter(pl.col("popular_items").is_in("items").is_not())
.groupby("user_id")
.agg(suggested="popular_items")),
on="user_id"
)
shape: (1, 4)
┌─────────┬─────────────┬───────────────┬───────────┐
│ user_id ┆ items ┆ popular_items ┆ suggested │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════╪═════════════╪═══════════════╪═══════════╡
│ 1 ┆ [1, 2, … 4] ┆ [3, 4, … 6] ┆ [5, 6] │
└─────────┴─────────────┴───────────────┴───────────┘
Perhaps it's better to explode both lists and use an [Anti-join](https://pola-rs.github.io/polars-book/user-guide/transformations/joins/#anti-join) to compute the "difference".
difference = (
df.explode("popular_items")
.join(
df.explode("items"),
left_on=["user_id", "popular_items"],
right_on=["user_id", "items"],
how="anti"
)
.groupby("user_id", maintain_order=True)
.agg(suggested="popular_items")
)
df.join(difference, on="user_id")
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论