2023年5月7日 21:29:54go评论111阅读模式

英文:

In a polars dataframe, filter a column of type list by another column of type list

问题

以下是您要翻译的内容：

我有一个 [tag:polars] 数据帧如下：

示例输入：

import polars as pl
d = pl.DataFrame()
df = d.with_columns(pl.lit([1]).alias('user_id'), 
                    pl.lit([[1,2,3,4]]).alias('items'),
                    pl.lit([[3,4,5,6]]).alias('popular_items'))
┌─────────────┬─────────────┬───────────────┐
│ user_id     ┆ items       ┆ popular_items │
│ ---         ┆ ---         ┆ ---           │
│ i64         ┆ list[i64]   ┆ list[i64]     │
╞═════════════╪═════════════╪═══════════════╡
│ 1           ┆[1, 2, 3, 4] ┆ [3, 4, 5, 6]  │
└─────────────┴─────────────┴───────────────┘

我想要通过删除每个 user_id 中在 items 列中的任何项目来筛选 popular_items 列。

我一直在尝试让它工作，但由于各种问题一直没有成功。很可能是因为我过于复杂化了事情。

预期的输出应如下所示：

┌─────────────┬─────────────┬───────────────┬───────────┐
│ user_id     ┆ items       ┆ popular_items ┆ suggested │
│ ---         ┆ ---         ┆ ---           ┆ ---       │
│ i64         ┆ list[i64]   ┆ list[i64]     ┆ list[i64] │
╞═════════════╪═════════════╪═══════════════╪═══════════╡
│ 1           ┆ [1, 2, 3, 4]┆ [3, 4, 5, 6]  ┆ [5, 6]    │
└─────────────┴─────────────┴───────────────┴───────────┘

看起来解决方案应该很简单，但在尝试了不同的方法后，似乎一直不成功。

任何帮助将不胜感激！

英文:

I have a [tag:polars] dataframe as below:

Example input:

import polars as pl
d = pl.DataFrame()
df = d.with_columns(pl.lit([1]).alias(&#39;user_id&#39;), 
                    pl.lit([[1,2,3,4]]).alias(&#39;items&#39;),
                    pl.lit([[3,4,5,6]]).alias(&#39;popular_items&#39;))
┌─────────────┬─────────────┬───────────────┐
│ user_id     ┆ items       ┆ popular_items │
│ ---         ┆ ---         ┆ ---           │
│ i64         ┆ list[i64]   ┆ list[i64]     │
╞═════════════╪═════════════╪═══════════════╡
│ 1           ┆[1, 2, 3, 4] ┆ [3, 4, 5, 6]  │
└─────────────┴─────────────┴───────────────┘

I want to filter popular_items column by removing any items that are in items column for each user_id

I have been trying to get it to work but have been unsuccessful due to various issues. In all likelihood, I am probably overcomplicating things.

The expected output should be as follows:

┌─────────────┬─────────────┬───────────────┬───────────┐
│ user_id     ┆ items       ┆ popular_items ┆ suggested │
│ ---         ┆ ---         ┆ ---           ┆ ---       │
│ i64         ┆ list[i64]   ┆ list[i64]     ┆ list[i64] │
╞═════════════╪═════════════╪═══════════════╪═══════════╡
│ 1           ┆ [1, 2, 3, 4]┆ [3, 4, 5, 6]  ┆ [5, 6]    │
└─────────────┴─────────────┴───────────────┴───────────┘

It seems like the solution should be simple, but it seems to escape me after some time now trying different things.

Any help would be greatly appreciated!

答案1

得分: 1

我明白了，以下是您提供的代码的中文翻译部分：

我不确定是否有一种“简单”的方法来做到这一点。
您可以展开列表并使用`.is_in`来检查是否在其他列表中。

(df.explode("popular_items")
.with_columns(is_in = pl.col("popular_items").is_in("items").is_not()))


形状：（4，4）
┌─────────┬─────────────┬───────────────┬───────┐
│ user_id ┆ items       ┆ popular_items ┆ is_in │
│ ---     ┆ ---         ┆ ---           ┆ ---   │
│ i64     ┆ list[i64]   ┆ i64           ┆ bool  │
╞═════════╪═════════════╪═══════════════╪═══════╡
│ 1       ┆ [1, 2, … 4] ┆ 3             ┆ false │
│ 1       ┆ [1, 2, … 4] ┆ 4             ┆ false │
│ 1       ┆ [1, 2, … 4] ┆ 5             ┆ true  │
│ 1       ┆ [1, 2, … 4] ┆ 6             ┆ true  │
└─────────┴─────────────┴───────────────┴───────┘

然后，您可以使用.filter筛选，转回列表，并进行.join操作。

df.join(
   (df.explode("popular_items")
      .filter(pl.col("popular_items").is_in("items").is_not())
      .groupby("user_id")
      .agg(suggested="popular_items")),
   on="user_id"
)

形状：（1，4）
┌─────────┬─────────────┬───────────────┬───────────┐
│ user_id ┆ items ┆ popular_items ┆ suggested │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════╪═════════════╪═══════════════╪═══════════╡
│ 1 ┆ [1, 2, … 4] ┆ [3, 4, … 6] ┆ [5, 6] │
└─────────┴─────────────┴───────────────┴───────────┘


也许更好的方法是展开两个列表并使用[反连接](https://pola-rs.github.io/polars-book/user-guide/transformations/joins/#anti-join)来计算“差异”。

difference = (
df.explode("popular_items")
.join(
df.explode("items"),
left_on=["user_id", "popular_items"],
right_on=["user_id", "items"],
how="anti"
)
.groupby("user_id", maintain_order=True)
.agg(suggested="popular_items")
)

df.join(difference, on="user_id")


希望这能帮助您理解这些代码的中文翻译部分。
<details>
<summary>英文:</summary>
I&#39;m not sure if there is a &quot;simple&quot; way to do this.
You can explode the list and use `.is_in` to check if it is in the other.

(df.explode("popular_items")
.with_columns(is_in = pl.col("popular_items").is_in("items").is_not()))

shape: (4, 4)
┌─────────┬─────────────┬───────────────┬───────┐
│ user_id ┆ items ┆ popular_items ┆ is_in │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ i64 ┆ bool │
╞═════════╪═════════════╪═══════════════╪═══════╡
│ 1 ┆ [1, 2, … 4] ┆ 3 ┆ false │
│ 1 ┆ [1, 2, … 4] ┆ 4 ┆ false │
│ 1 ┆ [1, 2, … 4] ┆ 5 ┆ true │
│ 1 ┆ [1, 2, … 4] ┆ 6 ┆ true │
└─────────┴─────────────┴───────────────┴───────┘

Which you could `.filter`, turn back into a list, and `.join`.

df.join(
(df.explode("popular_items")
.filter(pl.col("popular_items").is_in("items").is_not())
.groupby("user_id")
.agg(suggested="popular_items")),
on="user_id"
)

shape: (1, 4)
┌─────────┬─────────────┬───────────────┬───────────┐
│ user_id ┆ items ┆ popular_items ┆ suggested │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════╪═════════════╪═══════════════╪═══════════╡
│ 1 ┆ [1, 2, … 4] ┆ [3, 4, … 6] ┆ [5, 6] │
└─────────┴─────────────┴───────────────┴───────────┘

Perhaps it&#39;s better to explode both lists and use an [Anti-join](https://pola-rs.github.io/polars-book/user-guide/transformations/joins/#anti-join) to compute the &quot;difference&quot;.

df.join(difference, on="user_id")


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Polars数据框中，通过另一列类型为列表来筛选一个类型为列表的列。

问题

答案1

Python web-scraping issue: 无法从特定URL检索正文部分数据

Flask-Login在Dash应用中表现不如预期。

如何根据多列来将数据框的一列与另一列进行匹配？

在Python中转换时区

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。