使用Polars基于变量子集创建所有组合。

huangapple go评论102阅读模式
英文:

Create all combinations based on a subset of variables with Polars?

问题

以下是您提供的代码部分的翻译:

  1. import polars as pl
  2. df = pl.DataFrame(
  3. {
  4. "country": ["France", "France", "UK", "UK", "Spain"],
  5. "year": [2020, 2021, 2019, 2020, 2022],
  6. "value": [1, 2, 3, 4, 5],
  7. }
  8. )
  9. df
  10. shape: (5, 3)
  11. ┌─────────┬──────┬───────┐
  12. country year value
  13. --- --- ---
  14. str i64 i64
  15. ╞═════════╪══════╪═══════╡
  16. France 2020 1
  17. France 2021 2
  18. UK 2019 3
  19. UK 2020 4
  20. Spain 2022 5
  21. └─────────┴──────┴───────┘
  1. import time
  2. tic = time.perf_counter()
  3. (
  4. df
  5. .select("country")
  6. .unique()
  7. .join(df.select("year").unique(), how="cross")
  8. .join(df, how="left", on=["country", "year"])
  9. )
  10. toc = time.perf_counter()
  11. print(f"Lazy eval: {toc - tic:0.4f} seconds")
  12. shape: (36, 4)
  13. ┌───────┬─────────┬──────┬───────┐
  14. country year value
  15. --- --- ---
  16. str i64 i64
  17. ╞═════════╪══════╪═══════╡
  18. Spain 2021 null
  19. Spain 2022 5
  20. Spain 2019 null
  21. Spain 2020 null
  22. UK 2021 null
  23. UK 2022 null
  24. UK 2019 3
  25. UK 2020 4
  26. └───────┴─────────┴──────┴───────┘

希望这有助于您理解代码。如果您有任何其他问题,请随时提出。

英文:

I have a DataFrame that looks like this:

  1. import polars as pl
  2. df = pl.DataFrame(
  3. {
  4. "country": ["France", "France", "UK", "UK", "Spain"],
  5. "year": [2020, 2021, 2019, 2020, 2022],
  6. "value": [1, 2, 3, 4, 5],
  7. }
  8. )
  9. df
  10. shape: (5, 3)
  11. ┌─────────┬──────┬───────┐
  12. country year value
  13. --- --- ---
  14. str i64 i64
  15. ╞═════════╪══════╪═══════╡
  16. France 2020 1
  17. France 2021 2
  18. UK 2019 3
  19. UK 2020 4
  20. Spain 2022 5
  21. └─────────┴──────┴───────┘

I'd like to make a balanced panel by creating all country-year pairs. In R, I could use tidyr::complete() for this, but I didn't find a built-in way to do this in Polars. Is there something like this? If not, what would be the fastest way to mimick it?

Expected output:

  1. shape: (12, 3)
  2. ┌─────────┬──────┬───────┐
  3. country year value
  4. --- --- ---
  5. str i64 i64
  6. ╞═════════╪══════╪═══════╡
  7. France 2019 null
  8. France 2020 1
  9. France 2021 2
  10. France 2022 null
  11. UK 2019 3
  12. UK 2020 4
  13. UK 2021 null
  14. UK 2022 null
  15. Spain 2019 null
  16. Spain 2020 null
  17. Spain 2021 null
  18. Spain 2022 5
  19. └─────────┴──────┴───────┘


Edit: the example above is quite simple because it only has 2 vars to complete but it started being trickier with 3 vars and I don't see how to adapt the pivot() + melt():

  1. import polars as pl
  2. df = pl.DataFrame(
  3. {
  4. "orig": ["France", "France", "UK", "UK", "Spain"],
  5. "dest": ["Japan", "Vietnam", "Japan", "China", "China"],
  6. "year": [2020, 2021, 2019, 2020, 2022],
  7. "value": [1, 2, 3, 4, 5],
  8. }
  9. )
  10. df
  11. shape: (5, 4)
  12. ┌────────┬─────────┬──────┬───────┐
  13. orig dest year value
  14. --- --- --- ---
  15. str str i64 i64
  16. ╞════════╪═════════╪══════╪═══════╡
  17. France Japan 2020 1
  18. France Vietnam 2021 2
  19. UK Japan 2019 3
  20. UK China 2020 4
  21. Spain China 2022 5
  22. └────────┴─────────┴──────┴───────┘

While the original works, it is much slower than tidyr::complete() (66ms for Polars, 1.8ms for tidyr::complete()):

  1. import time
  2. tic = time.perf_counter()
  3. (
  4. df
  5. .select("orig")
  6. .unique()
  7. .join(df.select("dest").unique(), how="cross")
  8. .join(df.select("year").unique(), how="cross")
  9. .join(df, how="left", on=["country", "year"])
  10. )
  11. toc = time.perf_counter()
  12. print(f"Lazy eval: {toc - tic:0.4f} seconds")
  13. shape: (36, 4)
  14. ┌───────┬─────────┬──────┬───────┐
  15. orig dest year value
  16. --- --- --- ---
  17. str str i64 i64
  18. ╞═══════╪═════════╪══════╪═══════╡
  19. Spain Japan 2021 null
  20. Spain Japan 2022 null
  21. Spain Japan 2019 null
  22. Spain Japan 2020 null
  23. UK Vietnam 2021 null
  24. UK Vietnam 2022 null
  25. UK Vietnam 2019 null
  26. UK Vietnam 2020 null
  27. └───────┴─────────┴──────┴───────┘
  28. >>>
  29. >>> toc = time.perf_counter()
  30. >>> print(f"Lazy eval: {toc - tic:0.4f} seconds")
  31. Lazy eval: 0.0669 seconds

In R:

  1. test <- data.frame(
  2. orig = c("France", "France", "UK", "UK", "Spain"),
  3. dest = c("Japan", "Vietnam", "Japan", "China", "China"),
  4. year = c(2020, 2021, 2019, 2020, 2022),
  5. value = c(1, 2, 3, 4, 5)
  6. )
  7. bench::mark(
  8. test = tidyr::complete(test, orig, dest, year),
  9. iterations = 100
  10. )
  11. #> # A tibble: 1 × 6
  12. #> expression min median `itr/sec` mem_alloc `gc/sec`
  13. #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
  14. #> 1 test 1.61ms 1.81ms 496. 4.6MB 10.1

答案1

得分: 2

也许有一个更简单的方法,但组合是唯一值的交叉连接。

  1. df.select('country').unique().join(
  2. df.select('year').unique(),
  3. how='cross'
  4. )
  1. shape: (12, 2)
  2. ┌─────────┬──────┐
  3. country year
  4. --- ---
  5. str i64
  6. ╞═════════╪══════╡
  7. UK 2021
  8. UK 2022
  9. UK 2019
  10. UK 2020
  11. Spain 2021
  12. Spain 2022
  13. Spain 2019
  14. Spain 2020
  15. France 2021
  16. France 2022
  17. France 2019
  18. France 2020
  19. └─────────┴──────┘

然后,您可以与原始数据进行左连接:

  1. df.select('country').unique().join(
  2. df.select('year').unique(),
  3. how='cross'
  4. ).join(df, how='left', on=['country', 'year'])
  1. shape: (12, 3)
  2. ┌─────────┬──────┬───────┐
  3. country year value
  4. --- --- ---
  5. str i64 i64
  6. ╞═════════╪══════╪═══════╡
  7. UK 2021 null
  8. UK 2019 3
  9. UK 2022 null
  10. UK 2020 4
  11. France 2021 2
  12. France 2019 null
  13. France 2022 null
  14. France 2020 1
  15. Spain 2021 null
  16. Spain 2019 null
  17. Spain 2022 5
  18. Spain 2020 null
  19. └─────────┴──────┴───────┘
英文:

Perhaps there is a simpler way but the combinations are a cross join of the unique values.

  1. df.select('country').unique().join(
  2. df.select('year').unique(),
  3. how = 'cross'
  4. )
  1. shape: (12, 2)
  2. ┌─────────┬──────┐
  3. country year
  4. --- ---
  5. str i64
  6. ╞═════════╪══════╡
  7. UK 2021
  8. UK 2022
  9. UK 2019
  10. UK 2020
  11. Spain 2021
  12. Spain 2022
  13. Spain 2019
  14. Spain 2020
  15. France 2021
  16. France 2022
  17. France 2019
  18. France 2020
  19. └─────────┴──────┘

Which you can left join with the original:

  1. df.select('country').unique().join(
  2. df.select('year').unique(),
  3. how = 'cross'
  4. ).join(df, how='left', on=['country', 'year'])
  1. shape: (12, 3)
  2. ┌─────────┬──────┬───────┐
  3. country year value
  4. --- --- ---
  5. str i64 i64
  6. ╞═════════╪══════╪═══════╡
  7. UK 2021 null
  8. UK 2019 3
  9. UK 2022 null
  10. UK 2020 4
  11. France 2021 2
  12. France 2019 null
  13. France 2022 null
  14. France 2020 1
  15. Spain 2021 null
  16. Spain 2019 null
  17. Spain 2022 5
  18. Spain 2020 null
  19. └─────────┴──────┴───────┘

答案2

得分: 2

以下是已翻译的代码部分:

  1. (
  2. df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
  3. .explode("orig")
  4. .explode("dest")
  5. .explode("year")
  6. .join(df, how="left", on=["orig", "dest", "year"])
  7. )
英文:

As pointed out by @jqurious in the comments of their answer, it is faster to use .implode() and .explode() (it isn't faster with the small example I gave but I can clearly see the difference with larger data):

  1. (
  2. df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
  3. .explode("orig")
  4. .explode("dest")
  5. .explode("year")
  6. .join(df, how="left", on=["orig", "dest", "year"])
  7. )

huangapple
  • 本文由 发表于 2023年7月4日 20:45:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76612786.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定