英文:
Create all combinations based on a subset of variables with Polars?
问题
以下是您提供的代码部分的翻译:
import polars as pl
df = pl.DataFrame(
{
"country": ["France", "France", "UK", "UK", "Spain"],
"year": [2020, 2021, 2019, 2020, 2022],
"value": [1, 2, 3, 4, 5],
}
)
df
shape: (5, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════════╪══════╪═══════╡
│ France ┆ 2020 ┆ 1 │
│ France ┆ 2021 ┆ 2 │
│ UK ┆ 2019 ┆ 3 │
│ UK ┆ 2020 ┆ 4 │
│ Spain ┆ 2022 ┆ 5 │
└─────────┴──────┴───────┘
import time
tic = time.perf_counter()
(
df
.select("country")
.unique()
.join(df.select("year").unique(), how="cross")
.join(df, how="left", on=["country", "year"])
)
toc = time.perf_counter()
print(f"Lazy eval: {toc - tic:0.4f} seconds")
shape: (36, 4)
┌───────┬─────────┬──────┬───────┐
│ country │ year │ value │
│ --- │ --- │ --- │
│ str │ i64 │ i64 │
╞═════════╪══════╪═══════╡
│ Spain │ 2021 │ null │
│ Spain │ 2022 │ 5 │
│ Spain │ 2019 │ null │
│ Spain │ 2020 │ null │
│ … │ … │ … │
│ UK │ 2021 │ null │
│ UK │ 2022 │ null │
│ UK │ 2019 │ 3 │
│ UK │ 2020 │ 4 │
└───────┴─────────┴──────┴───────┘
希望这有助于您理解代码。如果您有任何其他问题,请随时提出。
英文:
I have a DataFrame
that looks like this:
import polars as pl
df = pl.DataFrame(
{
"country": ["France", "France", "UK", "UK", "Spain"],
"year": [2020, 2021, 2019, 2020, 2022],
"value": [1, 2, 3, 4, 5],
}
)
df
shape: (5, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════════╪══════╪═══════╡
│ France ┆ 2020 ┆ 1 │
│ France ┆ 2021 ┆ 2 │
│ UK ┆ 2019 ┆ 3 │
│ UK ┆ 2020 ┆ 4 │
│ Spain ┆ 2022 ┆ 5 │
└─────────┴──────┴───────┘
I'd like to make a balanced panel by creating all country-year pairs. In R, I could use tidyr::complete()
for this, but I didn't find a built-in way to do this in Polars. Is there something like this? If not, what would be the fastest way to mimick it?
Expected output:
shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════════╪══════╪═══════╡
│ France ┆ 2019 ┆ null │
│ France ┆ 2020 ┆ 1 │
│ France ┆ 2021 ┆ 2 │
│ France ┆ 2022 ┆ null │
│ UK ┆ 2019 ┆ 3 │
│ UK ┆ 2020 ┆ 4 │
│ UK ┆ 2021 ┆ null │
│ UK ┆ 2022 ┆ null │
│ Spain ┆ 2019 ┆ null │
│ Spain ┆ 2020 ┆ null │
│ Spain ┆ 2021 ┆ null │
│ Spain ┆ 2022 ┆ 5 │
└─────────┴──────┴───────┘
Edit: the example above is quite simple because it only has 2 vars to complete but it started being trickier with 3 vars and I don't see how to adapt the pivot()
+ melt()
:
import polars as pl
df = pl.DataFrame(
{
"orig": ["France", "France", "UK", "UK", "Spain"],
"dest": ["Japan", "Vietnam", "Japan", "China", "China"],
"year": [2020, 2021, 2019, 2020, 2022],
"value": [1, 2, 3, 4, 5],
}
)
df
shape: (5, 4)
┌────────┬─────────┬──────┬───────┐
│ orig ┆ dest ┆ year ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞════════╪═════════╪══════╪═══════╡
│ France ┆ Japan ┆ 2020 ┆ 1 │
│ France ┆ Vietnam ┆ 2021 ┆ 2 │
│ UK ┆ Japan ┆ 2019 ┆ 3 │
│ UK ┆ China ┆ 2020 ┆ 4 │
│ Spain ┆ China ┆ 2022 ┆ 5 │
└────────┴─────────┴──────┴───────┘
While the original works, it is much slower than tidyr::complete()
(66ms for Polars, 1.8ms for tidyr::complete()
):
import time
tic = time.perf_counter()
(
df
.select("orig")
.unique()
.join(df.select("dest").unique(), how="cross")
.join(df.select("year").unique(), how="cross")
.join(df, how="left", on=["country", "year"])
)
toc = time.perf_counter()
print(f"Lazy eval: {toc - tic:0.4f} seconds")
shape: (36, 4)
┌───────┬─────────┬──────┬───────┐
│ orig ┆ dest ┆ year ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═══════╪═════════╪══════╪═══════╡
│ Spain ┆ Japan ┆ 2021 ┆ null │
│ Spain ┆ Japan ┆ 2022 ┆ null │
│ Spain ┆ Japan ┆ 2019 ┆ null │
│ Spain ┆ Japan ┆ 2020 ┆ null │
│ … ┆ … ┆ … ┆ … │
│ UK ┆ Vietnam ┆ 2021 ┆ null │
│ UK ┆ Vietnam ┆ 2022 ┆ null │
│ UK ┆ Vietnam ┆ 2019 ┆ null │
│ UK ┆ Vietnam ┆ 2020 ┆ null │
└───────┴─────────┴──────┴───────┘
>>>
>>> toc = time.perf_counter()
>>> print(f"Lazy eval: {toc - tic:0.4f} seconds")
Lazy eval: 0.0669 seconds
In R:
test <- data.frame(
orig = c("France", "France", "UK", "UK", "Spain"),
dest = c("Japan", "Vietnam", "Japan", "China", "China"),
year = c(2020, 2021, 2019, 2020, 2022),
value = c(1, 2, 3, 4, 5)
)
bench::mark(
test = tidyr::complete(test, orig, dest, year),
iterations = 100
)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 test 1.61ms 1.81ms 496. 4.6MB 10.1
答案1
得分: 2
也许有一个更简单的方法,但组合是唯一值的交叉连接。
df.select('country').unique().join(
df.select('year').unique(),
how='cross'
)
shape: (12, 2)
┌─────────┬──────┐
│ country ┆ year │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪══════╡
│ UK ┆ 2021 │
│ UK ┆ 2022 │
│ UK ┆ 2019 │
│ UK ┆ 2020 │
│ Spain ┆ 2021 │
│ Spain ┆ 2022 │
│ Spain ┆ 2019 │
│ Spain ┆ 2020 │
│ France ┆ 2021 │
│ France ┆ 2022 │
│ France ┆ 2019 │
│ France ┆ 2020 │
└─────────┴──────┘
然后,您可以与原始数据进行左连接:
df.select('country').unique().join(
df.select('year').unique(),
how='cross'
).join(df, how='left', on=['country', 'year'])
shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════════╪══════╪═══════╡
│ UK ┆ 2021 ┆ null │
│ UK ┆ 2019 ┆ 3 │
│ UK ┆ 2022 ┆ null │
│ UK ┆ 2020 ┆ 4 │
│ France ┆ 2021 ┆ 2 │
│ France ┆ 2019 ┆ null │
│ France ┆ 2022 ┆ null │
│ France ┆ 2020 ┆ 1 │
│ Spain ┆ 2021 ┆ null │
│ Spain ┆ 2019 ┆ null │
│ Spain ┆ 2022 ┆ 5 │
│ Spain ┆ 2020 ┆ null │
└─────────┴──────┴───────┘
英文:
Perhaps there is a simpler way but the combinations are a cross join of the unique values.
df.select('country').unique().join(
df.select('year').unique(),
how = 'cross'
)
shape: (12, 2)
┌─────────┬──────┐
│ country ┆ year │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪══════╡
│ UK ┆ 2021 │
│ UK ┆ 2022 │
│ UK ┆ 2019 │
│ UK ┆ 2020 │
│ Spain ┆ 2021 │
│ Spain ┆ 2022 │
│ Spain ┆ 2019 │
│ Spain ┆ 2020 │
│ France ┆ 2021 │
│ France ┆ 2022 │
│ France ┆ 2019 │
│ France ┆ 2020 │
└─────────┴──────┘
Which you can left join with the original:
df.select('country').unique().join(
df.select('year').unique(),
how = 'cross'
).join(df, how='left', on=['country', 'year'])
shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════════╪══════╪═══════╡
│ UK ┆ 2021 ┆ null │
│ UK ┆ 2019 ┆ 3 │
│ UK ┆ 2022 ┆ null │
│ UK ┆ 2020 ┆ 4 │
│ France ┆ 2021 ┆ 2 │
│ France ┆ 2019 ┆ null │
│ France ┆ 2022 ┆ null │
│ France ┆ 2020 ┆ 1 │
│ Spain ┆ 2021 ┆ null │
│ Spain ┆ 2019 ┆ null │
│ Spain ┆ 2022 ┆ 5 │
│ Spain ┆ 2020 ┆ null │
└─────────┴──────┴───────┘
答案2
得分: 2
以下是已翻译的代码部分:
(
df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
.explode("orig")
.explode("dest")
.explode("year")
.join(df, how="left", on=["orig", "dest", "year"])
)
英文:
As pointed out by @jqurious in the comments of their answer, it is faster to use .implode()
and .explode()
(it isn't faster with the small example I gave but I can clearly see the difference with larger data):
(
df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
.explode("orig")
.explode("dest")
.explode("year")
.join(df, how="left", on=["orig", "dest", "year"])
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论