英文:
How to get an element index in a list column, if element is specified in a different column
问题
I have a dataframe, where one column a
is a list, and another column b
contains a value that's in a
. I need to create the column c
which contains index of the element in b
in list a
.
df = pl.DataFrame({'a': [[1, 2, 3], [4, 5, 2], [6, 2, 7]], 'b': [3, 4, 2]})
print(df)
所以结果的数据框看起来像下面这样:
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
所有 a
中的元素在同一行内都是唯一的,b
保证在 a
中。
英文:
I have a dataframe, where one column a
is a list, and another column b
contains a value that's in a
. I need to create the column c
which contains index of the element in b
in list a
df = pl.DataFrame({'a': [[1, 2, 3], [4, 5, 2], [6, 2, 7]], 'b': [3, 4, 2]})
print(df)
shape: (3, 2)
┌───────────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ [1, 2, 3] ┆ 3 │
│ [4, 5, 2] ┆ 4 │
│ [6, 2, 7] ┆ 2 │
└───────────┴─────┘
so resulting dataframe looks like following
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
all elements of a
are unique within the row, and b
is guaranteed to be in a
.
答案1
得分: 1
This works but not sure how it scales performance-wise. I tried arr.eval
but can't refer to another column in there, I tried exploding within one expression to use pl.arg_where
or the like but couldn't figure out height mismatches, etc.
My answer makes a separate DataFrame with the answer by generating a row-count to do an eventual over
expression and an explode
of the DataFrame, so a with_context
is needed overall (it's faster than a concat
):
df.lazy().with_context(
df.lazy()
.with_row_count()
.explode("a")
.select(c=pl.arg_where(pl.col("a") == pl.col("b")).over("row_nr"))
).select("a", "b", "c").collect()
shape: (3, 3)
┌───────────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ u32 │
╞═══════════╪═════╪═════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴─────┘
Open to improvements!
英文:
This works but not sure how it scales performance-wise. I tried arr.eval
but can't refer to another column in there, I tried exploding within one expression to use pl.arg_where
or the like but couldn't figure out height mismatches, etc.
My answer makes a separate DataFrame with the answer by generating a row-count to do an eventual over
expression and an explode
of the DataFrame, so a with_context
is needed overall (it's faster than a concat
):
df.lazy().with_context(
df.lazy()
.with_row_count()
.explode("a")
.select(c=pl.arg_where(pl.col("a") == pl.col("b")).over("row_nr"))
).select("a", "b", "c").collect()
shape: (3, 3)
┌───────────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ u32 │
╞═══════════╪═════╪═════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴─────┘
Open to improvements!
答案2
得分: 1
以下是您提供的内容的中文翻译:
我们想要创建两个辅助列。一个只是a
的复制,另一个是a
的索引值。然后,我们按照它们各自分列,然后过滤出b
等于acopy
的情况。最后,我们通过删除acopy
来清理它
(
df
.with_columns(
pl.col('a').alias('acopy'),
pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
.explode(['acopy',"a.index(b)"])
.filter(pl.col('acopy')==pl.col('b'))
.drop('acopy')
)
形状:(3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
我们可以使这个方法更加健壮,使其不再依赖于唯一性的保证(取第一个实例),并且允许b
的值不在a
中,同时返回一个包含空值的行。
所以,假设我们从以下数据开始:
df = pl.DataFrame({'a': [[1, 2, 4], [4, 4, 2], [6, 2, 7]], 'b': [3, 4, 2]})
然后我们可以这样做:
(
df
.with_row_count('i')
.with_columns(
pl.col('a').alias('acopy'),
pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
.explode(['acopy',"a.index(b)"])
.with_columns(
(pl.when(pl.col('acopy')==pl.col('b'))
.then(pl.col("a.index(b)"))
.otherwise(None)).alias("a.index(b)")
)
.groupby('i', maintain_order=True)
.agg(
pl.exclude('a.index(b)').first(),
pl.col('a.index(b)').min())
.drop(['i', 'acopy'])
)
形状:(3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 4] ┆ 3 ┆ null │
│ [4, 4, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
基本上,这将filter
替换为了when.then.otherwise
和groupby
。为了使groupby
正常工作,我们必须在开头创建一个行索引辅助列。
英文:
We want to make two helper columns. one is just a copy of a
and the other is the index values of a
. We then explode by each of them and then filter for when b
==acopy
. Lastly, we clean it up by dropping our acopy
(
df
.with_columns(
pl.col('a').alias('acopy'),
pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
.explode(['acopy',"a.index(b)"])
.filter(pl.col('acopy')==pl.col('b'))
.drop('acopy')
)
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 3] ┆ 3 ┆ 2 │
│ [4, 5, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
We can make this robust to dropping your guarantees about uniqueness (taking the first instance) and also allowing for the b
value to not be in a
while still returning the row with a null.
So, say we started with:
df = pl.DataFrame({'a': [[1, 2, 4], [4, 4, 2], [6, 2, 7]], 'b': [3, 4, 2]})
then we could do
(
df
.with_row_count('i')
.with_columns(
pl.col('a').alias('acopy'),
pl.arange(0,pl.col('a').arr.lengths()).alias("a.index(b)"))
.explode(['acopy',"a.index(b)"])
.with_columns(
(pl.when(pl.col('acopy')==pl.col('b'))
.then(pl.col("a.index(b)"))
.otherwise(None)).alias("a.index(b)")
)
.groupby('i', maintain_order=True)
.agg(
pl.exclude('a.index(b)').first(),
pl.col('a.index(b)').min())
.drop(['i', 'acopy'])
)
shape: (3, 3)
┌───────────┬─────┬────────────┐
│ a ┆ b ┆ a.index(b) │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ i64 ┆ i64 │
╞═══════════╪═════╪════════════╡
│ [1, 2, 4] ┆ 3 ┆ null │
│ [4, 4, 2] ┆ 4 ┆ 0 │
│ [6, 2, 7] ┆ 2 ┆ 1 │
└───────────┴─────┴────────────┘
Essentially this trades the filter
for a when.then.otherwise
and groupby
. For the groupby
to work we have to create a row index helper column at the beginning.
答案3
得分: 1
以下是翻译好的部分:
-
有一个关于
arr.index()
的 开放特性请求。 -
你可以使用
.explode
来展开列表,并使用.join
来匹配它们。
df = df.with_row_count()
df.join(
df.explode("a")
.with_row_count("idx")
.with_columns(pl.col("idx").cumcount().over("row_nr")),
left_on=["row_nr", "b"],
right_on=["row_nr", "a"],
how="left"
)
- 形状:(3, 5)
┌────────┬───────────┬─────┬─────┬─────────┐
│ row_nr ┆ a ┆ b ┆ idx ┆ b_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ list[i64] ┆ i64 ┆ u32 ┆ i64 │
╞════════╪═══════════╪═════╪═════╪═════════╡
│ 0 ┆ [1, 2, 3] ┆ 3 ┆ 2 ┆ 3 │
│ 1 ┆ [4, 5, 2] ┆ 4 ┆ 0 ┆ 4 │
│ 2 ┆ [6, 2, 7] ┆ 2 ┆ 1 ┆ 2 │
└────────┴───────────┴─────┴─────┴─────────┘
英文:
There is an open feature request for arr.index()
You can .explode
the lists and use .join
to match them.
df = df.with_row_count()
df.join(
df.explode("a")
.with_row_count("idx")
.with_columns(pl.col("idx").cumcount().over("row_nr")),
left_on=["row_nr", "b"],
right_on=["row_nr", "a"],
how="left"
)
shape: (3, 5)
┌────────┬───────────┬─────┬─────┬─────────┐
│ row_nr ┆ a ┆ b ┆ idx ┆ b_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ list[i64] ┆ i64 ┆ u32 ┆ i64 │
╞════════╪═══════════╪═════╪═════╪═════════╡
│ 0 ┆ [1, 2, 3] ┆ 3 ┆ 2 ┆ 3 │
│ 1 ┆ [4, 5, 2] ┆ 4 ┆ 0 ┆ 4 │
│ 2 ┆ [6, 2, 7] ┆ 2 ┆ 1 ┆ 2 │
└────────┴───────────┴─────┴─────┴─────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论