使用Polars基于变量子集创建所有组合。

huangapple go评论74阅读模式
英文:

Create all combinations based on a subset of variables with Polars?

问题

以下是您提供的代码部分的翻译:

import polars as pl

df = pl.DataFrame(
    {
        "country": ["France", "France", "UK", "UK", "Spain"],
        "year": [2020, 2021, 2019, 2020, 2022],
        "value": [1, 2, 3, 4, 5],
    }
)

df

shape: (5, 3)
┌─────────┬──────┬───────┐
 country  year  value 
 ---      ---   ---   
 str      i64   i64   
╞═════════╪══════╪═══════╡
 France   2020  1     
 France   2021  2     
 UK       2019  3     
 UK       2020  4     
 Spain    2022  5     
└─────────┴──────┴───────┘
import time

tic = time.perf_counter()
(
    df
    .select("country")
    .unique()
    .join(df.select("year").unique(), how="cross")
    .join(df, how="left", on=["country", "year"])
)
toc = time.perf_counter()
print(f"Lazy eval: {toc - tic:0.4f} seconds")

shape: (36, 4)
┌───────┬─────────┬──────┬───────┐
 country  year  value 
 ---      ---   ---   
 str      i64   i64   
╞═════════╪══════╪═══════╡
 Spain    2021  null  
 Spain    2022  5     
 Spain    2019  null  
 Spain    2020  null  
                   
 UK       2021  null  
 UK       2022  null  
 UK       2019  3     
 UK       2020  4     
└───────┴─────────┴──────┴───────┘

希望这有助于您理解代码。如果您有任何其他问题,请随时提出。

英文:

I have a DataFrame that looks like this:

import polars as pl

df = pl.DataFrame(
    {
        "country": ["France", "France", "UK", "UK", "Spain"],
        "year": [2020, 2021, 2019, 2020, 2022],
        "value": [1, 2, 3, 4, 5],
    }
)

df

shape: (5, 3)
┌─────────┬──────┬───────┐
 country  year  value 
 ---      ---   ---   
 str      i64   i64   
╞═════════╪══════╪═══════╡
 France   2020  1     
 France   2021  2     
 UK       2019  3     
 UK       2020  4     
 Spain    2022  5     
└─────────┴──────┴───────┘

I'd like to make a balanced panel by creating all country-year pairs. In R, I could use tidyr::complete() for this, but I didn't find a built-in way to do this in Polars. Is there something like this? If not, what would be the fastest way to mimick it?

Expected output:

shape: (12, 3)
┌─────────┬──────┬───────┐
 country  year  value 
 ---      ---   ---   
 str      i64   i64   
╞═════════╪══════╪═══════╡
 France   2019  null  
 France   2020  1     
 France   2021  2     
 France   2022  null  
 UK       2019  3     
 UK       2020  4     
 UK       2021  null  
 UK       2022  null  
 Spain    2019  null  
 Spain    2020  null  
 Spain    2021  null  
 Spain    2022  5     
└─────────┴──────┴───────┘


Edit: the example above is quite simple because it only has 2 vars to complete but it started being trickier with 3 vars and I don't see how to adapt the pivot() + melt():

import polars as pl

df = pl.DataFrame(
    {
        "orig": ["France", "France", "UK", "UK", "Spain"],
        "dest": ["Japan", "Vietnam", "Japan", "China", "China"],
        "year": [2020, 2021, 2019, 2020, 2022],
        "value": [1, 2, 3, 4, 5],
    }
)
df

shape: (5, 4)
┌────────┬─────────┬──────┬───────┐
 orig    dest     year  value 
 ---     ---      ---   ---   
 str     str      i64   i64   
╞════════╪═════════╪══════╪═══════╡
 France  Japan    2020  1     
 France  Vietnam  2021  2     
 UK      Japan    2019  3     
 UK      China    2020  4     
 Spain   China    2022  5     
└────────┴─────────┴──────┴───────┘

While the original works, it is much slower than tidyr::complete() (66ms for Polars, 1.8ms for tidyr::complete()):

import time

tic = time.perf_counter()
(
    df
    .select("orig")
    .unique()
    .join(df.select("dest").unique(), how="cross")
    .join(df.select("year").unique(), how="cross")
    .join(df, how="left", on=["country", "year"])
)
toc = time.perf_counter()
print(f"Lazy eval: {toc - tic:0.4f} seconds")

shape: (36, 4)
┌───────┬─────────┬──────┬───────┐
 orig   dest     year  value 
 ---    ---      ---   ---   
 str    str      i64   i64   
╞═══════╪═════════╪══════╪═══════╡
 Spain  Japan    2021  null  
 Spain  Japan    2022  null  
 Spain  Japan    2019  null  
 Spain  Japan    2020  null  
                         
 UK     Vietnam  2021  null  
 UK     Vietnam  2022  null  
 UK     Vietnam  2019  null  
 UK     Vietnam  2020  null  
└───────┴─────────┴──────┴───────┘
>>>
>>> toc = time.perf_counter()
>>> print(f"Lazy eval: {toc - tic:0.4f} seconds")
Lazy eval: 0.0669 seconds

In R:

test <- data.frame(
  orig = c("France", "France", "UK", "UK", "Spain"),
  dest = c("Japan", "Vietnam", "Japan", "China", "China"),
  year = c(2020, 2021, 2019, 2020, 2022),
  value = c(1, 2, 3, 4, 5)
)

bench::mark(
  test = tidyr::complete(test, orig, dest, year),
  iterations = 100
)
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 test         1.61ms   1.81ms      496.     4.6MB     10.1

答案1

得分: 2

也许有一个更简单的方法,但组合是唯一值的交叉连接。

df.select('country').unique().join(
   df.select('year').unique(),
   how='cross'
)
shape: (12, 2)
┌─────────┬──────┐
│ country ┆ year │
│ ---     ┆ ---  │
│ str     ┆ i64  │
╞═════════╪══════╡
│ UK      ┆ 2021 │
│ UK      ┆ 2022 │
│ UK      ┆ 2019 │
│ UK      ┆ 2020 │
│ Spain   ┆ 2021 │
│ Spain   ┆ 2022 │
│ Spain   ┆ 2019 │
│ Spain   ┆ 2020 │
│ France  ┆ 2021 │
│ France  ┆ 2022 │
│ France  ┆ 2019 │
│ France  ┆ 2020 │
└─────────┴──────┘

然后,您可以与原始数据进行左连接:

df.select('country').unique().join(
   df.select('year').unique(),
   how='cross'
).join(df, how='left', on=['country', 'year'])
shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ UK      ┆ 2021 ┆ null  │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2022 ┆ null  │
│ UK      ┆ 2020 ┆ 4     │
│ France  ┆ 2021 ┆ 2     │
│ France  ┆ 2019 ┆ null  │
│ France  ┆ 2022 ┆ null  │
│ France  ┆ 2020 ┆ 1     │
│ Spain   ┆ 2021 ┆ null  │
│ Spain   ┆ 2019 ┆ null  │
│ Spain   ┆ 2022 ┆ 5     │
│ Spain   ┆ 2020 ┆ null  │
└─────────┴──────┴───────┘
英文:

Perhaps there is a simpler way but the combinations are a cross join of the unique values.

df.select('country').unique().join(
df.select('year').unique(),
how = 'cross'
)
shape: (12, 2)
┌─────────┬──────┐
│ country ┆ year │
│ ---     ┆ ---  │
│ str     ┆ i64  │
╞═════════╪══════╡
│ UK      ┆ 2021 │
│ UK      ┆ 2022 │
│ UK      ┆ 2019 │
│ UK      ┆ 2020 │
│ Spain   ┆ 2021 │
│ Spain   ┆ 2022 │
│ Spain   ┆ 2019 │
│ Spain   ┆ 2020 │
│ France  ┆ 2021 │
│ France  ┆ 2022 │
│ France  ┆ 2019 │
│ France  ┆ 2020 │
└─────────┴──────┘

Which you can left join with the original:

df.select('country').unique().join(
df.select('year').unique(),
how = 'cross'
).join(df, how='left', on=['country', 'year'])
shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ UK      ┆ 2021 ┆ null  │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2022 ┆ null  │
│ UK      ┆ 2020 ┆ 4     │
│ France  ┆ 2021 ┆ 2     │
│ France  ┆ 2019 ┆ null  │
│ France  ┆ 2022 ┆ null  │
│ France  ┆ 2020 ┆ 1     │
│ Spain   ┆ 2021 ┆ null  │
│ Spain   ┆ 2019 ┆ null  │
│ Spain   ┆ 2022 ┆ 5     │
│ Spain   ┆ 2020 ┆ null  │
└─────────┴──────┴───────┘

答案2

得分: 2

以下是已翻译的代码部分:

(
    df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
    .explode("orig")
    .explode("dest")
    .explode("year")
    .join(df, how="left", on=["orig", "dest", "year"])
)
英文:

As pointed out by @jqurious in the comments of their answer, it is faster to use .implode() and .explode() (it isn't faster with the small example I gave but I can clearly see the difference with larger data):

(
    df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
    .explode("orig")
    .explode("dest")
    .explode("year")
    .join(df, how="left", on=["orig", "dest", "year"])
)

huangapple
  • 本文由 发表于 2023年7月4日 20:45:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76612786.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定