如何在列表列中获取元素索引,如果元素在不同列中指定

huangapple go评论55阅读模式
英文:

How to get an element index in a list column, if element is specified in a different column

问题

I have a dataframe, where one column a is a list, and another column b contains a value that's in a. I need to create the column c which contains index of the element in b in list a.

df = pl.DataFrame({'a': [[1, 2, 3], [4, 5, 2], [6, 2, 7]], 'b': [3, 4, 2]})
print(df)

所以结果的数据框看起来像下面这样:

shape: (3, 3)
┌───────────┬─────┬────────────┐
 a          b    a.index(b) 
 ---        ---  ---        
 list[i64]  i64  i64        
╞═══════════╪═════╪════════════╡
 [1, 2, 3]  3    2          
 [4, 5, 2]  4    0          
 [6, 2, 7]  2    1          
└───────────┴─────┴────────────┘

所有 a 中的元素在同一行内都是唯一的,b 保证在 a 中。

英文:

I have a dataframe, where one column a is a list, and another column b contains a value that's in a. I need to create the column c which contains index of the element in b in list a

df = pl.DataFrame({'a': [[1, 2, 3], [4, 5, 2], [6, 2, 7]], 'b': [3, 4, 2]})
print(df)

shape: (3, 2)
┌───────────┬─────┐
│ a         ┆ b   │
│ ---       ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ [1, 2, 3] ┆ 3   │
│ [4, 5, 2] ┆ 4   │
│ [6, 2, 7] ┆ 2   │
└───────────┴─────┘

so resulting dataframe looks like following

shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a         ┆ b   ┆ a.index(b) │
│ ---       ┆ --- ┆ ---        │
│ list[i64] ┆ i64 ┆ i64        │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3   ┆ 2          │
│ [4, 5, 2] ┆ 4   ┆ 0          │
│ [6, 2, 7] ┆ 2   ┆ 1          │
└───────────┴─────┴────────────┘

all elements of a are unique within the row, and b is guaranteed to be in a.

答案1

得分: 1

This works but not sure how it scales performance-wise. I tried arr.eval but can't refer to another column in there, I tried exploding within one expression to use pl.arg_where or the like but couldn't figure out height mismatches, etc.

My answer makes a separate DataFrame with the answer by generating a row-count to do an eventual over expression and an explode of the DataFrame, so a with_context is needed overall (it's faster than a concat):

df.lazy().with_context(
    df.lazy()
    .with_row_count()
    .explode("a")
    .select(c=pl.arg_where(pl.col("a") == pl.col("b")).over("row_nr"))
).select("a", "b", "c").collect()
shape: (3, 3)
┌───────────┬─────┬─────┐
│ a         ┆ b   ┆ c   │
│ ---       ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ u32 │
╞═══════════╪═════╪═════╡
│ [1, 2, 3] ┆ 3   ┆ 2   │
│ [4, 5, 2] ┆ 4   ┆ 0   │
│ [6, 2, 7] ┆ 2   ┆ 1   │
└───────────┴─────┴─────┘

Open to improvements!

英文:

This works but not sure how it scales performance-wise. I tried arr.eval but can't refer to another column in there, I tried exploding within one expression to use pl.arg_where or the like but couldn't figure out height mismatches, etc.

My answer makes a separate DataFrame with the answer by generating a row-count to do an eventual over expression and an explode of the DataFrame, so a with_context is needed overall (it's faster than a concat):

df.lazy().with_context(
    df.lazy()
    .with_row_count()
    .explode("a")
    .select(c=pl.arg_where(pl.col("a") == pl.col("b")).over("row_nr"))
).select("a", "b", "c").collect()
shape: (3, 3)
┌───────────┬─────┬─────┐
│ a         ┆ b   ┆ c   │
│ ---       ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ u32 │
╞═══════════╪═════╪═════╡
│ [1, 2, 3] ┆ 3   ┆ 2   │
│ [4, 5, 2] ┆ 4   ┆ 0   │
│ [6, 2, 7] ┆ 2   ┆ 1   │
└───────────┴─────┴─────┘

Open to improvements!

答案2

得分: 1

以下是您提供的内容的中文翻译:

我们想要创建两个辅助列。一个只是a的复制,另一个是a的索引值。然后,我们按照它们各自分列,然后过滤出b等于acopy的情况。最后,我们通过删除acopy来清理它

(
    df
        .with_columns(
            pl.col('a').alias('acopy'), 
            pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
        .explode(['acopy',"a.index(b)"])
        .filter(pl.col('acopy')==pl.col('b'))
        .drop('acopy')
)

形状:(3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘

我们可以使这个方法更加健壮,使其不再依赖于唯一性的保证(取第一个实例),并且允许b的值不在a中,同时返回一个包含空值的行。

所以,假设我们从以下数据开始:

df = pl.DataFrame({'a': [[1, 2, 4], [4, 4, 2], [6, 2, 7]], 'b': [3, 4, 2]})

然后我们可以这样做:

(
    df
        .with_row_count('i')
        .with_columns(
            pl.col('a').alias('acopy'), 
            pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
        .explode(['acopy',"a.index(b)"])
        .with_columns(
            (pl.when(pl.col('acopy')==pl.col('b'))
                .then(pl.col("a.index(b)"))
                .otherwise(None)).alias("a.index(b)")
            )
        .groupby('i', maintain_order=True)
        .agg(
            pl.exclude('a.index(b)').first(),
            pl.col('a.index(b)').min())
        .drop(['i', 'acopy'])
)

形状:(3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 4] ┆ 3 ┆ null │
│ [4, 4, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘

基本上,这将filter替换为了when.then.otherwisegroupby。为了使groupby正常工作,我们必须在开头创建一个行索引辅助列。

英文:

We want to make two helper columns. one is just a copy of a and the other is the index values of a. We then explode by each of them and then filter for when b==acopy. Lastly, we clean it up by dropping our acopy

(
    df
        .with_columns(
            pl.col('a').alias('acopy'), 
            pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
        .explode(['acopy',"a.index(b)"])
        .filter(pl.col('acopy')==pl.col('b'))
        .drop('acopy')
)

shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a         ┆ b   ┆ a.index(b) │
│ ---       ┆ --- ┆ ---        │
│ list[i64] ┆ i64 ┆ i64        │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3   ┆ 2          │
│ [4, 5, 2] ┆ 4   ┆ 0          │
│ [6, 2, 7] ┆ 2   ┆ 1          │
└───────────┴─────┴────────────┘

We can make this robust to dropping your guarantees about uniqueness (taking the first instance) and also allowing for the b value to not be in a while still returning the row with a null.

So, say we started with:

df = pl.DataFrame({'a': [[1, 2, 4], [4, 4, 2], [6, 2, 7]], 'b': [3, 4, 2]})

then we could do

(
    df
        .with_row_count('i')
        .with_columns(
            pl.col('a').alias('acopy'), 
            pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
        .explode(['acopy',"a.index(b)"])
        .with_columns(
            (pl.when(pl.col('acopy')==pl.col('b'))
                .then(pl.col("a.index(b)"))
                .otherwise(None)).alias("a.index(b)")
            )
        .groupby('i', maintain_order=True)
        .agg(
            pl.exclude('a.index(b)').first(),
            pl.col('a.index(b)').min())
        .drop(['i', 'acopy'])
)

shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a         ┆ b   ┆ a.index(b) │
│ ---       ┆ --- ┆ ---        │
│ list[i64] ┆ i64 ┆ i64        │
╞═══════════╪═════╪════════════╡
│ [1, 2, 4] ┆ 3   ┆ null       │
│ [4, 4, 2] ┆ 4   ┆ 0          │
│ [6, 2, 7] ┆ 2   ┆ 1          │
└───────────┴─────┴────────────┘

Essentially this trades the filter for a when.then.otherwise and groupby. For the groupby to work we have to create a row index helper column at the beginning.

答案3

得分: 1

以下是翻译好的部分:

  • 有一个关于 arr.index()开放特性请求

  • 你可以使用 .explode 来展开列表,并使用 .join 来匹配它们。

df = df.with_row_count()

df.join(
   df.explode("a")
     .with_row_count("idx")
     .with_columns(pl.col("idx").cumcount().over("row_nr")),
   left_on=["row_nr", "b"],
   right_on=["row_nr", "a"],
   how="left"
)
  • 形状:(3, 5)
┌────────┬───────────┬─────┬─────┬─────────┐
│ row_nr ┆ a         ┆ b   ┆ idx ┆ b_right │
│ ---    ┆ ---       ┆ --- ┆ --- ┆ ---     │
│ u32    ┆ list[i64] ┆ i64 ┆ u32 ┆ i64     │
╞════════╪═══════════╪═════╪═════╪═════════╡
│ 0      ┆ [1, 2, 3] ┆ 3   ┆ 2   ┆ 3       │
│ 1      ┆ [4, 5, 2] ┆ 4   ┆ 0   ┆ 4       │
│ 2      ┆ [6, 2, 7] ┆ 2   ┆ 1   ┆ 2       │
└────────┴───────────┴─────┴─────┴─────────┘
英文:

There is an open feature request for arr.index()

You can .explode the lists and use .join to match them.

df = df.with_row_count()

df.join(
   df.explode("a")
     .with_row_count("idx")
     .with_columns(pl.col("idx").cumcount().over("row_nr")),
   left_on=["row_nr", "b"],
   right_on=["row_nr", "a"],
   how="left"
)
shape: (3, 5)
┌────────┬───────────┬─────┬─────┬─────────┐
│ row_nr ┆ a         ┆ b   ┆ idx ┆ b_right │
│ ---    ┆ ---       ┆ --- ┆ --- ┆ ---     │
│ u32    ┆ list[i64] ┆ i64 ┆ u32 ┆ i64     │
╞════════╪═══════════╪═════╪═════╪═════════╡
│ 0      ┆ [1, 2, 3] ┆ 3   ┆ 2   ┆ 3       │
│ 1      ┆ [4, 5, 2] ┆ 4   ┆ 0   ┆ 4       │
│ 2      ┆ [6, 2, 7] ┆ 2   ┆ 1   ┆ 2       │
└────────┴───────────┴─────┴─────┴─────────┘

huangapple
  • 本文由 发表于 2023年4月20日 07:26:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76059509.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定