2023年4月20日 07:26:27go评论76阅读模式

英文:

How to get an element index in a list column, if element is specified in a different column

问题

I have a dataframe, where one column a is a list, and another column b contains a value that's in a. I need to create the column c which contains index of the element in b in list a.

df = pl.DataFrame({'a': [[1, 2, 3], [4, 5, 2], [6, 2, 7]], 'b': [3, 4, 2]})
print(df)

所以结果的数据框看起来像下面这样：

shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a         ┆ b   ┆ a.index(b) │
│ ---       ┆ --- ┆ ---        │
│ list[i64] ┆ i64 ┆ i64        │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3   ┆ 2          │
│ [4, 5, 2] ┆ 4   ┆ 0          │
│ [6, 2, 7] ┆ 2   ┆ 1          │
└───────────┴─────┴────────────┘

所有 a 中的元素在同一行内都是唯一的，b 保证在 a 中。

英文:

I have a dataframe, where one column a is a list, and another column b contains a value that's in a. I need to create the column c which contains index of the element in b in list a

df = pl.DataFrame({&#39;a&#39;: [[1, 2, 3], [4, 5, 2], [6, 2, 7]], &#39;b&#39;: [3, 4, 2]})
print(df)
shape: (3, 2)
┌───────────┬─────┐
│ a         ┆ b   │
│ ---       ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ [1, 2, 3] ┆ 3   │
│ [4, 5, 2] ┆ 4   │
│ [6, 2, 7] ┆ 2   │
└───────────┴─────┘

so resulting dataframe looks like following

shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a         ┆ b   ┆ a.index(b) │
│ ---       ┆ --- ┆ ---        │
│ list[i64] ┆ i64 ┆ i64        │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3   ┆ 2          │
│ [4, 5, 2] ┆ 4   ┆ 0          │
│ [6, 2, 7] ┆ 2   ┆ 1          │
└───────────┴─────┴────────────┘

all elements of a are unique within the row, and b is guaranteed to be in a.

答案1

得分: 1

This works but not sure how it scales performance-wise. I tried arr.eval but can't refer to another column in there, I tried exploding within one expression to use pl.arg_where or the like but couldn't figure out height mismatches, etc.

My answer makes a separate DataFrame with the answer by generating a row-count to do an eventual over expression and an explode of the DataFrame, so a with_context is needed overall (it's faster than a concat):

df.lazy().with_context(
    df.lazy()
    .with_row_count()
    .explode("a")
    .select(c=pl.arg_where(pl.col("a") == pl.col("b")).over("row_nr"))
).select("a", "b", "c").collect()

shape: (3, 3)
┌───────────┬─────┬─────┐
│ a         ┆ b   ┆ c   │
│ ---       ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ u32 │
╞═══════════╪═════╪═════╡
│ [1, 2, 3] ┆ 3   ┆ 2   │
│ [4, 5, 2] ┆ 4   ┆ 0   │
│ [6, 2, 7] ┆ 2   ┆ 1   │
└───────────┴─────┴─────┘

Open to improvements!

英文:

df.lazy().with_context(
    df.lazy()
    .with_row_count()
    .explode(&quot;a&quot;)
    .select(c=pl.arg_where(pl.col(&quot;a&quot;) == pl.col(&quot;b&quot;)).over(&quot;row_nr&quot;))
).select(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;).collect()

shape: (3, 3)
┌───────────┬─────┬─────┐
│ a         ┆ b   ┆ c   │
│ ---       ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ u32 │
╞═══════════╪═════╪═════╡
│ [1, 2, 3] ┆ 3   ┆ 2   │
│ [4, 5, 2] ┆ 4   ┆ 0   │
│ [6, 2, 7] ┆ 2   ┆ 1   │
└───────────┴─────┴─────┘

Open to improvements!

答案2

得分: 1

以下是您提供的内容的中文翻译：

我们想要创建两个辅助列。一个只是a的复制，另一个是a的索引值。然后，我们按照它们各自分列，然后过滤出b等于acopy的情况。最后，我们通过删除acopy来清理它

(
    df
        .with_columns(
            pl.col('a').alias('acopy'), 
            pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
        .explode(['acopy',"a.index(b)"])
        .filter(pl.col('acopy')==pl.col('b'))
        .drop('acopy')
)

形状：(3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘

我们可以使这个方法更加健壮，使其不再依赖于唯一性的保证（取第一个实例），并且允许b的值不在a中，同时返回一个包含空值的行。

所以，假设我们从以下数据开始：

df = pl.DataFrame({'a': [[1, 2, 4], [4, 4, 2], [6, 2, 7]], 'b': [3, 4, 2]})

然后我们可以这样做：

(
    df
        .with_row_count('i')
        .with_columns(
            pl.col('a').alias('acopy'), 
            pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
        .explode(['acopy',"a.index(b)"])
        .with_columns(
            (pl.when(pl.col('acopy')==pl.col('b'))
                .then(pl.col("a.index(b)"))
                .otherwise(None)).alias("a.index(b)")
            )
        .groupby('i', maintain_order=True)
        .agg(
            pl.exclude('a.index(b)').first(),
            pl.col('a.index(b)').min())
        .drop(['i', 'acopy'])
)

形状：(3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 4] ┆ 3 ┆ null │
│ [4, 4, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘

基本上，这将filter替换为了when.then.otherwise和groupby。为了使groupby正常工作，我们必须在开头创建一个行索引辅助列。

英文:

We want to make two helper columns. one is just a copy of a and the other is the index values of a. We then explode by each of them and then filter for when b==acopy. Lastly, we clean it up by dropping our acopy

(
    df
        .with_columns(
            pl.col(&#39;a&#39;).alias(&#39;acopy&#39;), 
            pl.arange(0,pl.col(&#39;a&#39;).arr.lengths()).alias(&quot;a.index(b)&quot;))
        .explode([&#39;acopy&#39;,&quot;a.index(b)&quot;])
        .filter(pl.col(&#39;acopy&#39;)==pl.col(&#39;b&#39;))
        .drop(&#39;acopy&#39;)
)
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a         ┆ b   ┆ a.index(b) │
│ ---       ┆ --- ┆ ---        │
│ list[i64] ┆ i64 ┆ i64        │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3   ┆ 2          │
│ [4, 5, 2] ┆ 4   ┆ 0          │
│ [6, 2, 7] ┆ 2   ┆ 1          │
└───────────┴─────┴────────────┘

We can make this robust to dropping your guarantees about uniqueness (taking the first instance) and also allowing for the b value to not be in a while still returning the row with a null.

So, say we started with:

df = pl.DataFrame({&#39;a&#39;: [[1, 2, 4], [4, 4, 2], [6, 2, 7]], &#39;b&#39;: [3, 4, 2]})

then we could do

(
    df
        .with_row_count(&#39;i&#39;)
        .with_columns(
            pl.col(&#39;a&#39;).alias(&#39;acopy&#39;), 
            pl.arange(0,pl.col(&#39;a&#39;).arr.lengths()).alias(&quot;a.index(b)&quot;))
        .explode([&#39;acopy&#39;,&quot;a.index(b)&quot;])
        .with_columns(
            (pl.when(pl.col(&#39;acopy&#39;)==pl.col(&#39;b&#39;))
                .then(pl.col(&quot;a.index(b)&quot;))
                .otherwise(None)).alias(&quot;a.index(b)&quot;)
            )
        .groupby(&#39;i&#39;, maintain_order=True)
        .agg(
            pl.exclude(&#39;a.index(b)&#39;).first(),
            pl.col(&#39;a.index(b)&#39;).min())
        .drop([&#39;i&#39;, &#39;acopy&#39;])
)
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a         ┆ b   ┆ a.index(b) │
│ ---       ┆ --- ┆ ---        │
│ list[i64] ┆ i64 ┆ i64        │
╞═══════════╪═════╪════════════╡
│ [1, 2, 4] ┆ 3   ┆ null       │
│ [4, 4, 2] ┆ 4   ┆ 0          │
│ [6, 2, 7] ┆ 2   ┆ 1          │
└───────────┴─────┴────────────┘

Essentially this trades the filter for a when.then.otherwise and groupby. For the groupby to work we have to create a row index helper column at the beginning.

答案3

得分: 1

以下是翻译好的部分：

有一个关于 arr.index() 的开放特性请求。
你可以使用 .explode 来展开列表，并使用 .join 来匹配它们。

df = df.with_row_count()
df.join(
   df.explode("a")
     .with_row_count("idx")
     .with_columns(pl.col("idx").cumcount().over("row_nr")),
   left_on=["row_nr", "b"],
   right_on=["row_nr", "a"],
   how="left"
)

形状：(3, 5)

┌────────┬───────────┬─────┬─────┬─────────┐
│ row_nr ┆ a         ┆ b   ┆ idx ┆ b_right │
│ ---    ┆ ---       ┆ --- ┆ --- ┆ ---     │
│ u32    ┆ list[i64] ┆ i64 ┆ u32 ┆ i64     │
╞════════╪═══════════╪═════╪═════╪═════════╡
│ 0      ┆ [1, 2, 3] ┆ 3   ┆ 2   ┆ 3       │
│ 1      ┆ [4, 5, 2] ┆ 4   ┆ 0   ┆ 4       │
│ 2      ┆ [6, 2, 7] ┆ 2   ┆ 1   ┆ 2       │
└────────┴───────────┴─────┴─────┴─────────┘

英文:

There is an open feature request for arr.index()

You can .explode the lists and use .join to match them.

df = df.with_row_count()
df.join(
   df.explode(&quot;a&quot;)
     .with_row_count(&quot;idx&quot;)
     .with_columns(pl.col(&quot;idx&quot;).cumcount().over(&quot;row_nr&quot;)),
   left_on=[&quot;row_nr&quot;, &quot;b&quot;],
   right_on=[&quot;row_nr&quot;, &quot;a&quot;],
   how=&quot;left&quot;
)

shape: (3, 5)
┌────────┬───────────┬─────┬─────┬─────────┐
│ row_nr ┆ a         ┆ b   ┆ idx ┆ b_right │
│ ---    ┆ ---       ┆ --- ┆ --- ┆ ---     │
│ u32    ┆ list[i64] ┆ i64 ┆ u32 ┆ i64     │
╞════════╪═══════════╪═════╪═════╪═════════╡
│ 0      ┆ [1, 2, 3] ┆ 3   ┆ 2   ┆ 3       │
│ 1      ┆ [4, 5, 2] ┆ 4   ┆ 0   ┆ 4       │
│ 2      ┆ [6, 2, 7] ┆ 2   ┆ 1   ┆ 2       │
└────────┴───────────┴─────┴─────┴─────────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在列表列中获取元素索引，如果元素在不同列中指定

问题

答案1

答案2

答案3

‘DataFrame’ 对象没有 ‘merge’ 属性

将一个Excel列的数据拆分成两列，使用Python。

pandas列的除法操作会返回多列。

Vectorize the assignment of a column in a pandas dataframe where a custom index has many rows and the column value is set using all rows in the index

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论