基于另一个数据框的数值标记一个数据框的行。

huangapple go评论103阅读模式
英文:

Mark rows of one dataframe based on values from another dataframe

问题

我需要基于数据框df2的值来标记/标签数据框df1中的行,以便获得以下数据框:

  1. print(pl.DataFrame({'a': range(10), 'tag': ['NA', 'aa', 'aa', 'aa', 'NA', 'bb', 'bb', 'NA', 'cc', 'cc']}))

这将根据df2b列的列表指示df1中需要使用tags列中的标签标记的a列的起始和结束值。

谢谢

英文:

I have following problem. Let's say I have two dataframes

  1. df1 = pl.DataFrame({'a': range(10)})
  2. df2 = pl.DataFrame({'b': [[1, 3], [5,6], [8, 9]], 'tags': ['aa', 'bb', 'cc']})
  3. print(df1)
  4. print(df2)
  5. shape: (10, 1)
  6. ┌─────┐
  7. a
  8. ---
  9. i64
  10. ╞═════╡
  11. 0
  12. 1
  13. 2
  14. 3
  15. 4
  16. 5
  17. 6
  18. 7
  19. 8
  20. 9
  21. └─────┘
  22. shape: (3, 2)
  23. ┌───────────┬──────┐
  24. b tags
  25. --- ---
  26. list[i64] str
  27. ╞═══════════╪══════╡
  28. [1, 3] aa
  29. [5, 6] bb
  30. [8, 9] cc
  31. └───────────┴──────┘

I need to mark/tag rows in dataframe df1 based on values of dataframe df2, so I can get following dataframe

  1. print(pl.DataFrame({'a': range(10), 'tag': ['NA', 'aa', 'aa', 'aa', 'NA', 'bb', 'bb', 'NA', 'cc', 'cc']}))
  2. shape: (10, 2)
  3. ┌─────┬─────┐
  4. a tag
  5. --- ---
  6. i64 str
  7. ╞═════╪═════╡
  8. 0 NA
  9. 1 aa
  10. 2 aa
  11. 3 aa
  12. 4 NA
  13. 5 bb
  14. 6 bb
  15. 7 NA
  16. 8 cc
  17. 9 cc
  18. └─────┴─────┘

So list in column b of df2 indicates start and end values for column a of df1 that needs to be tagged with what's in column tags.

Thanks

答案1

得分: 3

你可以使用 .explode.arange 进行操作,并使用 left join

  1. df1.join(
  2. df2.with_columns(
  3. pl.arange(pl.col("b").arr.first(), pl.col("b").arr.last() + 1)
  4. ).explode("b"),
  5. left_on="a",
  6. right_on="b",
  7. how="left"
  8. )
  1. shape: (10, 2)
  2. ┌─────┬──────┐
  3. a tags
  4. --- ---
  5. i64 str
  6. ╞═════╪══════╡
  7. 0 null
  8. 1 aa
  9. 2 aa
  10. 3 aa
  11. 4 null
  12. 5 bb
  13. 6 bb
  14. 7 null
  15. 8 cc
  16. 9 cc
  17. └─────┴──────┘

如果范围不重叠,另一种选择是重新塑造 df2

  1. df2.with_columns(
  2. pl.col("b").arr.to_struct()
  3. .struct.rename_fields(["start", "end"])
  4. ).unnest("b")
  1. shape: (3, 3)
  2. ┌───────┬─────┬──────┐
  3. start end tags
  4. --- --- ---
  5. i64 i64 str
  6. ╞═══════╪═════╪══════╡
  7. 1 3 aa
  8. 5 6 bb
  9. 8 9 cc
  10. └───────┴─────┴──────┘

并使用 .join_asof

  1. df1.join_asof(
  2. df2.with_columns(
  3. pl.col("b").arr.to_struct().struct.rename_fields(["start", "end"])
  4. ).unnest("b"),
  5. left_on="a",
  6. right_on="end",
  7. strategy="forward"
  8. ).with_columns(
  9. pl.when(pl.col("a").is_between("start", "end"))
  10. .then(pl.col("tags"))
  11. )
  1. shape: (10, 4)
  2. ┌─────┬───────┬─────┬──────┐
  3. a start end tags
  4. --- --- --- ---
  5. i64 i64 i64 str
  6. ╞═════╪═══════╪═════╪══════╡
  7. 0 1 3 null
  8. 1 1 3 aa
  9. 2 1 3 aa
  10. 3 1 3 aa
  11. 4 5 6 null
  12. 5 5 6 bb
  13. 6 5 6 bb
  14. 7 8 9 null
  15. 8 8 9 cc
  16. 9 8 9 cc
  17. └─────┴───────┴─────┴──────┘
英文:

You could .explode the .arange and use a left join.

  1. df1.join(
  2. df2.with_columns(
  3. pl.arange(pl.col("b").arr.first(), pl.col("b").arr.last() + 1)
  4. ).explode("b"),
  5. left_on="a",
  6. right_on="b",
  7. how="left"
  8. )
  1. shape: (10, 2)
  2. ┌─────┬──────┐
  3. a tags
  4. --- ---
  5. i64 str
  6. ╞═════╪══════╡
  7. 0 null
  8. 1 aa
  9. 2 aa
  10. 3 aa
  11. 4 null
  12. 5 bb
  13. 6 bb
  14. 7 null
  15. 8 cc
  16. 9 cc
  17. └─────┴──────┘

If the ranges don't overlap, another option is to reshape df2:

  1. df2.with_columns(
  2. pl.col("b").arr.to_struct()
  3. .struct.rename_fields(["start", "end"])
  4. ).unnest("b")
  1. shape: (3, 3)
  2. ┌───────┬─────┬──────┐
  3. start end tags
  4. --- --- ---
  5. i64 i64 str
  6. ╞═══════╪═════╪══════╡
  7. 1 3 aa
  8. 5 6 bb
  9. 8 9 cc
  10. └───────┴─────┴──────┘

And use .join_asof

  1. df1.join_asof(
  2. df2.with_columns(
  3. pl.col("b").arr.to_struct().struct.rename_fields(["start", "end"])
  4. ).unnest("b"),
  5. left_on="a",
  6. right_on="end",
  7. strategy="forward"
  8. ).with_columns(
  9. pl.when(pl.col("a").is_between("start", "end"))
  10. .then(pl.col("tags"))
  11. )
  1. shape: (10, 4)
  2. ┌─────┬───────┬─────┬──────┐
  3. a start end tags
  4. --- --- --- ---
  5. i64 i64 i64 str
  6. ╞═════╪═══════╪═════╪══════╡
  7. 0 1 3 null
  8. 1 1 3 aa
  9. 2 1 3 aa
  10. 3 1 3 aa
  11. 4 5 6 null
  12. 5 5 6 bb
  13. 6 5 6 bb
  14. 7 8 9 null
  15. 8 8 9 cc
  16. 9 8 9 cc
  17. └─────┴───────┴─────┴──────┘

huangapple
  • 本文由 发表于 2023年4月10日 21:34:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/75977591.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定