如何在列表列中获取元素索引,如果元素在不同列中指定

huangapple go评论75阅读模式
英文:

How to get an element index in a list column, if element is specified in a different column

问题

I have a dataframe, where one column a is a list, and another column b contains a value that's in a. I need to create the column c which contains index of the element in b in list a.

  1. df = pl.DataFrame({'a': [[1, 2, 3], [4, 5, 2], [6, 2, 7]], 'b': [3, 4, 2]})
  2. print(df)

所以结果的数据框看起来像下面这样:

  1. shape: (3, 3)
  2. ┌───────────┬─────┬────────────┐
  3. a b a.index(b)
  4. --- --- ---
  5. list[i64] i64 i64
  6. ╞═══════════╪═════╪════════════╡
  7. [1, 2, 3] 3 2
  8. [4, 5, 2] 4 0
  9. [6, 2, 7] 2 1
  10. └───────────┴─────┴────────────┘

所有 a 中的元素在同一行内都是唯一的,b 保证在 a 中。

英文:

I have a dataframe, where one column a is a list, and another column b contains a value that's in a. I need to create the column c which contains index of the element in b in list a

  1. df = pl.DataFrame({'a': [[1, 2, 3], [4, 5, 2], [6, 2, 7]], 'b': [3, 4, 2]})
  2. print(df)
  3. shape: (3, 2)
  4. ┌───────────┬─────┐
  5. a b
  6. --- ---
  7. list[i64] i64
  8. ╞═══════════╪═════╡
  9. [1, 2, 3] 3
  10. [4, 5, 2] 4
  11. [6, 2, 7] 2
  12. └───────────┴─────┘

so resulting dataframe looks like following

  1. shape: (3, 3)
  2. ┌───────────┬─────┬────────────┐
  3. a b a.index(b)
  4. --- --- ---
  5. list[i64] i64 i64
  6. ╞═══════════╪═════╪════════════╡
  7. [1, 2, 3] 3 2
  8. [4, 5, 2] 4 0
  9. [6, 2, 7] 2 1
  10. └───────────┴─────┴────────────┘

all elements of a are unique within the row, and b is guaranteed to be in a.

答案1

得分: 1

This works but not sure how it scales performance-wise. I tried arr.eval but can't refer to another column in there, I tried exploding within one expression to use pl.arg_where or the like but couldn't figure out height mismatches, etc.

My answer makes a separate DataFrame with the answer by generating a row-count to do an eventual over expression and an explode of the DataFrame, so a with_context is needed overall (it's faster than a concat):

  1. df.lazy().with_context(
  2. df.lazy()
  3. .with_row_count()
  4. .explode("a")
  5. .select(c=pl.arg_where(pl.col("a") == pl.col("b")).over("row_nr"))
  6. ).select("a", "b", "c").collect()
  1. shape: (3, 3)
  2. ┌───────────┬─────┬─────┐
  3. a b c
  4. --- --- ---
  5. list[i64] i64 u32
  6. ╞═══════════╪═════╪═════╡
  7. [1, 2, 3] 3 2
  8. [4, 5, 2] 4 0
  9. [6, 2, 7] 2 1
  10. └───────────┴─────┴─────┘

Open to improvements!

英文:

This works but not sure how it scales performance-wise. I tried arr.eval but can't refer to another column in there, I tried exploding within one expression to use pl.arg_where or the like but couldn't figure out height mismatches, etc.

My answer makes a separate DataFrame with the answer by generating a row-count to do an eventual over expression and an explode of the DataFrame, so a with_context is needed overall (it's faster than a concat):

  1. df.lazy().with_context(
  2. df.lazy()
  3. .with_row_count()
  4. .explode("a")
  5. .select(c=pl.arg_where(pl.col("a") == pl.col("b")).over("row_nr"))
  6. ).select("a", "b", "c").collect()
  1. shape: (3, 3)
  2. ┌───────────┬─────┬─────┐
  3. a b c
  4. --- --- ---
  5. list[i64] i64 u32
  6. ╞═══════════╪═════╪═════╡
  7. [1, 2, 3] 3 2
  8. [4, 5, 2] 4 0
  9. [6, 2, 7] 2 1
  10. └───────────┴─────┴─────┘

Open to improvements!

答案2

得分: 1

以下是您提供的内容的中文翻译:

我们想要创建两个辅助列。一个只是a的复制,另一个是a的索引值。然后,我们按照它们各自分列,然后过滤出b等于acopy的情况。最后,我们通过删除acopy来清理它

  1. (
  2. df
  3. .with_columns(
  4. pl.col('a').alias('acopy'),
  5. pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
  6. .explode(['acopy',"a.index(b)"])
  7. .filter(pl.col('acopy')==pl.col('b'))
  8. .drop('acopy')
  9. )

形状:(3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘

我们可以使这个方法更加健壮,使其不再依赖于唯一性的保证(取第一个实例),并且允许b的值不在a中,同时返回一个包含空值的行。

所以,假设我们从以下数据开始:

  1. df = pl.DataFrame({'a': [[1, 2, 4], [4, 4, 2], [6, 2, 7]], 'b': [3, 4, 2]})

然后我们可以这样做:

  1. (
  2. df
  3. .with_row_count('i')
  4. .with_columns(
  5. pl.col('a').alias('acopy'),
  6. pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
  7. .explode(['acopy',"a.index(b)"])
  8. .with_columns(
  9. (pl.when(pl.col('acopy')==pl.col('b'))
  10. .then(pl.col("a.index(b)"))
  11. .otherwise(None)).alias("a.index(b)")
  12. )
  13. .groupby('i', maintain_order=True)
  14. .agg(
  15. pl.exclude('a.index(b)').first(),
  16. pl.col('a.index(b)').min())
  17. .drop(['i', 'acopy'])
  18. )

形状:(3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 4] ┆ 3 ┆ null │
│ [4, 4, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘

基本上,这将filter替换为了when.then.otherwisegroupby。为了使groupby正常工作,我们必须在开头创建一个行索引辅助列。

英文:

We want to make two helper columns. one is just a copy of a and the other is the index values of a. We then explode by each of them and then filter for when b==acopy. Lastly, we clean it up by dropping our acopy

  1. (
  2. df
  3. .with_columns(
  4. pl.col('a').alias('acopy'),
  5. pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
  6. .explode(['acopy',"a.index(b)"])
  7. .filter(pl.col('acopy')==pl.col('b'))
  8. .drop('acopy')
  9. )
  10. shape: (3, 3)
  11. ┌───────────┬─────┬────────────┐
  12. a b a.index(b)
  13. --- --- ---
  14. list[i64] i64 i64
  15. ╞═══════════╪═════╪════════════╡
  16. [1, 2, 3] 3 2
  17. [4, 5, 2] 4 0
  18. [6, 2, 7] 2 1
  19. └───────────┴─────┴────────────┘

We can make this robust to dropping your guarantees about uniqueness (taking the first instance) and also allowing for the b value to not be in a while still returning the row with a null.

So, say we started with:

  1. df = pl.DataFrame({'a': [[1, 2, 4], [4, 4, 2], [6, 2, 7]], 'b': [3, 4, 2]})

then we could do

  1. (
  2. df
  3. .with_row_count('i')
  4. .with_columns(
  5. pl.col('a').alias('acopy'),
  6. pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
  7. .explode(['acopy',"a.index(b)"])
  8. .with_columns(
  9. (pl.when(pl.col('acopy')==pl.col('b'))
  10. .then(pl.col("a.index(b)"))
  11. .otherwise(None)).alias("a.index(b)")
  12. )
  13. .groupby('i', maintain_order=True)
  14. .agg(
  15. pl.exclude('a.index(b)').first(),
  16. pl.col('a.index(b)').min())
  17. .drop(['i', 'acopy'])
  18. )
  19. shape: (3, 3)
  20. ┌───────────┬─────┬────────────┐
  21. a b a.index(b)
  22. --- --- ---
  23. list[i64] i64 i64
  24. ╞═══════════╪═════╪════════════╡
  25. [1, 2, 4] 3 null
  26. [4, 4, 2] 4 0
  27. [6, 2, 7] 2 1
  28. └───────────┴─────┴────────────┘

Essentially this trades the filter for a when.then.otherwise and groupby. For the groupby to work we have to create a row index helper column at the beginning.

答案3

得分: 1

以下是翻译好的部分:

  • 有一个关于 arr.index()开放特性请求

  • 你可以使用 .explode 来展开列表,并使用 .join 来匹配它们。

  1. df = df.with_row_count()
  2. df.join(
  3. df.explode("a")
  4. .with_row_count("idx")
  5. .with_columns(pl.col("idx").cumcount().over("row_nr")),
  6. left_on=["row_nr", "b"],
  7. right_on=["row_nr", "a"],
  8. how="left"
  9. )
  • 形状:(3, 5)
  1. ┌────────┬───────────┬─────┬─────┬─────────┐
  2. row_nr a b idx b_right
  3. --- --- --- --- ---
  4. u32 list[i64] i64 u32 i64
  5. ╞════════╪═══════════╪═════╪═════╪═════════╡
  6. 0 [1, 2, 3] 3 2 3
  7. 1 [4, 5, 2] 4 0 4
  8. 2 [6, 2, 7] 2 1 2
  9. └────────┴───────────┴─────┴─────┴─────────┘
英文:

There is an open feature request for arr.index()

You can .explode the lists and use .join to match them.

  1. df = df.with_row_count()
  2. df.join(
  3. df.explode("a")
  4. .with_row_count("idx")
  5. .with_columns(pl.col("idx").cumcount().over("row_nr")),
  6. left_on=["row_nr", "b"],
  7. right_on=["row_nr", "a"],
  8. how="left"
  9. )
  1. shape: (3, 5)
  2. ┌────────┬───────────┬─────┬─────┬─────────┐
  3. row_nr a b idx b_right
  4. --- --- --- --- ---
  5. u32 list[i64] i64 u32 i64
  6. ╞════════╪═══════════╪═════╪═════╪═════════╡
  7. 0 [1, 2, 3] 3 2 3
  8. 1 [4, 5, 2] 4 0 4
  9. 2 [6, 2, 7] 2 1 2
  10. └────────┴───────────┴─────┴─────┴─────────┘

huangapple
  • 本文由 发表于 2023年4月20日 07:26:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76059509.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定