2023年4月10日 21:34:18go评论103阅读模式

英文:

Mark rows of one dataframe based on values from another dataframe

问题

我需要基于数据框df2的值来标记/标签数据框df1中的行，以便获得以下数据框：

print(pl.DataFrame({'a': range(10), 'tag': ['NA', 'aa', 'aa', 'aa', 'NA', 'bb', 'bb', 'NA', 'cc', 'cc']}))

这将根据df2中b列的列表指示df1中需要使用tags列中的标签标记的a列的起始和结束值。

谢谢

英文:

I have following problem. Let's say I have two dataframes

df1 = pl.DataFrame({&#39;a&#39;: range(10)})
df2 = pl.DataFrame({&#39;b&#39;: [[1, 3], [5,6], [8, 9]], &#39;tags&#39;: [&#39;aa&#39;, &#39;bb&#39;, &#39;cc&#39;]})
print(df1)
print(df2)
shape: (10, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 1   │
│ 2   │
│ 3   │
│ 4   │
│ 5   │
│ 6   │
│ 7   │
│ 8   │
│ 9   │
└─────┘
shape: (3, 2)
┌───────────┬──────┐
│ b         ┆ tags │
│ ---       ┆ ---  │
│ list[i64] ┆ str  │
╞═══════════╪══════╡
│ [1, 3]    ┆ aa   │
│ [5, 6]    ┆ bb   │
│ [8, 9]    ┆ cc   │
└───────────┴──────┘

I need to mark/tag rows in dataframe df1 based on values of dataframe df2, so I can get following dataframe

print(pl.DataFrame({&#39;a&#39;: range(10), &#39;tag&#39;: [&#39;NA&#39;, &#39;aa&#39;, &#39;aa&#39;, &#39;aa&#39;, &#39;NA&#39;, &#39;bb&#39;, &#39;bb&#39;, &#39;NA&#39;, &#39;cc&#39;, &#39;cc&#39;]}))
shape: (10, 2)
┌─────┬─────┐
│ a   ┆ tag │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0   ┆ NA  │
│ 1   ┆ aa  │
│ 2   ┆ aa  │
│ 3   ┆ aa  │
│ 4   ┆ NA  │
│ 5   ┆ bb  │
│ 6   ┆ bb  │
│ 7   ┆ NA  │
│ 8   ┆ cc  │
│ 9   ┆ cc  │
└─────┴─────┘

So list in column b of df2 indicates start and end values for column a of df1 that needs to be tagged with what's in column tags.

Thanks

答案1

得分: 3

你可以使用 .explode 对 .arange 进行操作，并使用 left join。

df1.join(
   df2.with_columns(
      pl.arange(pl.col("b").arr.first(), pl.col("b").arr.last() + 1)
   ).explode("b"),
   left_on="a",
   right_on="b",
   how="left"
)

shape: (10, 2)
┌─────┬──────┐
│ a   ┆ tags │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 0   ┆ null │
│ 1   ┆ aa   │
│ 2   ┆ aa   │
│ 3   ┆ aa   │
│ 4   ┆ null │
│ 5   ┆ bb   │
│ 6   ┆ bb   │
│ 7   ┆ null │
│ 8   ┆ cc   │
│ 9   ┆ cc   │
└─────┴──────┘

如果范围不重叠，另一种选择是重新塑造 df2：

df2.with_columns(
   pl.col("b").arr.to_struct()
     .struct.rename_fields(["start", "end"])
).unnest("b")

shape: (3, 3)
┌───────┬─────┬──────┐
│ start ┆ end ┆ tags │
│ ---   ┆ --- ┆ ---  │
│ i64   ┆ i64 ┆ str  │
╞═══════╪═════╪══════╡
│ 1     ┆ 3   ┆ aa   │
│ 5     ┆ 6   ┆ bb   │
│ 8     ┆ 9   ┆ cc   │
└───────┴─────┴──────┘

并使用 .join_asof：

df1.join_asof(
   df2.with_columns(
      pl.col("b").arr.to_struct().struct.rename_fields(["start", "end"])
   ).unnest("b"),
   left_on="a",
   right_on="end",
   strategy="forward"
).with_columns(
   pl.when(pl.col("a").is_between("start", "end"))
     .then(pl.col("tags"))
)

shape: (10, 4)
┌─────┬───────┬─────┬──────┐
│ a   ┆ start ┆ end ┆ tags │
│ --- ┆ ---   ┆ --- ┆ ---  │
│ i64 ┆ i64   ┆ i64 ┆ str  │
╞═════╪═══════╪═════╪══════╡
│ 0   ┆ 1     ┆ 3   ┆ null │
│ 1   ┆ 1     ┆ 3   ┆ aa   │
│ 2   ┆ 1     ┆ 3   ┆ aa   │
│ 3   ┆ 1     ┆ 3   ┆ aa   │
│ 4   ┆ 5     ┆ 6   ┆ null │
│ 5   ┆ 5     ┆ 6   ┆ bb   │
│ 6   ┆ 5     ┆ 6   ┆ bb   │
│ 7   ┆ 8     ┆ 9   ┆ null │
│ 8   ┆ 8     ┆ 9   ┆ cc   │
│ 9   ┆ 8     ┆ 9   ┆ cc   │
└─────┴───────┴─────┴──────┘

英文:

You could .explode the .arange and use a left join.

df1.join(
   df2.with_columns(
      pl.arange(pl.col(&quot;b&quot;).arr.first(), pl.col(&quot;b&quot;).arr.last() + 1)
   ).explode(&quot;b&quot;),
   left_on=&quot;a&quot;,
   right_on=&quot;b&quot;,
   how=&quot;left&quot;
)

shape: (10, 2)
┌─────┬──────┐
│ a   ┆ tags │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 0   ┆ null │
│ 1   ┆ aa   │
│ 2   ┆ aa   │
│ 3   ┆ aa   │
│ 4   ┆ null │
│ 5   ┆ bb   │
│ 6   ┆ bb   │
│ 7   ┆ null │
│ 8   ┆ cc   │
│ 9   ┆ cc   │
└─────┴──────┘

If the ranges don't overlap, another option is to reshape df2:

df2.with_columns(
   pl.col(&quot;b&quot;).arr.to_struct()
     .struct.rename_fields([&quot;start&quot;, &quot;end&quot;])
).unnest(&quot;b&quot;)

shape: (3, 3)
┌───────┬─────┬──────┐
│ start ┆ end ┆ tags │
│ ---   ┆ --- ┆ ---  │
│ i64   ┆ i64 ┆ str  │
╞═══════╪═════╪══════╡
│ 1     ┆ 3   ┆ aa   │
│ 5     ┆ 6   ┆ bb   │
│ 8     ┆ 9   ┆ cc   │
└───────┴─────┴──────┘

And use .join_asof

df1.join_asof(
   df2.with_columns(
      pl.col(&quot;b&quot;).arr.to_struct().struct.rename_fields([&quot;start&quot;, &quot;end&quot;])
   ).unnest(&quot;b&quot;),
   left_on=&quot;a&quot;,
   right_on=&quot;end&quot;,
   strategy=&quot;forward&quot;
).with_columns(
   pl.when(pl.col(&quot;a&quot;).is_between(&quot;start&quot;, &quot;end&quot;))
     .then(pl.col(&quot;tags&quot;))
)

shape: (10, 4)
┌─────┬───────┬─────┬──────┐
│ a   ┆ start ┆ end ┆ tags │
│ --- ┆ ---   ┆ --- ┆ ---  │
│ i64 ┆ i64   ┆ i64 ┆ str  │
╞═════╪═══════╪═════╪══════╡
│ 0   ┆ 1     ┆ 3   ┆ null │
│ 1   ┆ 1     ┆ 3   ┆ aa   │
│ 2   ┆ 1     ┆ 3   ┆ aa   │
│ 3   ┆ 1     ┆ 3   ┆ aa   │
│ 4   ┆ 5     ┆ 6   ┆ null │
│ 5   ┆ 5     ┆ 6   ┆ bb   │
│ 6   ┆ 5     ┆ 6   ┆ bb   │
│ 7   ┆ 8     ┆ 9   ┆ null │
│ 8   ┆ 8     ┆ 9   ┆ cc   │
│ 9   ┆ 8     ┆ 9   ┆ cc   │
└─────┴───────┴─────┴──────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于另一个数据框的数值标记一个数据框的行。

问题

答案1

如何将 x 轴刻度替换为位于 avxlines 下方的标签

在R中删除字符串中的特定数据

如何根据要求，在SPARK AZURE-DATABRICKS中使用SCALA将JSON对象转换为列的值

如何加快这个距离矩阵的计算速度？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。