2023年7月4日 20:45:01go评论102阅读模式

英文:

Create all combinations based on a subset of variables with Polars?

问题

以下是您提供的代码部分的翻译：

import polars as pl
df = pl.DataFrame(
    {
        "country": ["France", "France", "UK", "UK", "Spain"],
        "year": [2020, 2021, 2019, 2020, 2022],
        "value": [1, 2, 3, 4, 5],
    }
)
df
shape: (5, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ France  ┆ 2020 ┆ 1     │
│ France  ┆ 2021 ┆ 2     │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2020 ┆ 4     │
│ Spain   ┆ 2022 ┆ 5     │
└─────────┴──────┴───────┘

import time
tic = time.perf_counter()
(
    df
    .select("country")
    .unique()
    .join(df.select("year").unique(), how="cross")
    .join(df, how="left", on=["country", "year"])
)
toc = time.perf_counter()
print(f"Lazy eval: {toc - tic:0.4f} seconds")
shape: (36, 4)
┌───────┬─────────┬──────┬───────┐
│ country │ year │ value │
│ ---     │ ---  │ ---   │
│ str     │ i64  │ i64   │
╞═════════╪══════╪═══════╡
│ Spain   │ 2021 │ null  │
│ Spain   │ 2022 │ 5     │
│ Spain   │ 2019 │ null  │
│ Spain   │ 2020 │ null  │
│ …       │ …    │ …     │
│ UK      │ 2021 │ null  │
│ UK      │ 2022 │ null  │
│ UK      │ 2019 │ 3     │
│ UK      │ 2020 │ 4     │
└───────┴─────────┴──────┴───────┘

希望这有助于您理解代码。如果您有任何其他问题，请随时提出。

英文:

I have a DataFrame that looks like this:

import polars as pl
df = pl.DataFrame(
    {
        &quot;country&quot;: [&quot;France&quot;, &quot;France&quot;, &quot;UK&quot;, &quot;UK&quot;, &quot;Spain&quot;],
        &quot;year&quot;: [2020, 2021, 2019, 2020, 2022],
        &quot;value&quot;: [1, 2, 3, 4, 5],
    }
)
df
shape: (5, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ France  ┆ 2020 ┆ 1     │
│ France  ┆ 2021 ┆ 2     │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2020 ┆ 4     │
│ Spain   ┆ 2022 ┆ 5     │
└─────────┴──────┴───────┘

I'd like to make a balanced panel by creating all country-year pairs. In R, I could use tidyr::complete() for this, but I didn't find a built-in way to do this in Polars. Is there something like this? If not, what would be the fastest way to mimick it?

Expected output:

shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ France  ┆ 2019 ┆ null  │
│ France  ┆ 2020 ┆ 1     │
│ France  ┆ 2021 ┆ 2     │
│ France  ┆ 2022 ┆ null  │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2020 ┆ 4     │
│ UK      ┆ 2021 ┆ null  │
│ UK      ┆ 2022 ┆ null  │
│ Spain   ┆ 2019 ┆ null  │
│ Spain   ┆ 2020 ┆ null  │
│ Spain   ┆ 2021 ┆ null  │
│ Spain   ┆ 2022 ┆ 5     │
└─────────┴──────┴───────┘

Edit: the example above is quite simple because it only has 2 vars to complete but it started being trickier with 3 vars and I don't see how to adapt the pivot() + melt():

import polars as pl
df = pl.DataFrame(
    {
        &quot;orig&quot;: [&quot;France&quot;, &quot;France&quot;, &quot;UK&quot;, &quot;UK&quot;, &quot;Spain&quot;],
        &quot;dest&quot;: [&quot;Japan&quot;, &quot;Vietnam&quot;, &quot;Japan&quot;, &quot;China&quot;, &quot;China&quot;],
        &quot;year&quot;: [2020, 2021, 2019, 2020, 2022],
        &quot;value&quot;: [1, 2, 3, 4, 5],
    }
)
df
shape: (5, 4)
┌────────┬─────────┬──────┬───────┐
│ orig   ┆ dest    ┆ year ┆ value │
│ ---    ┆ ---     ┆ ---  ┆ ---   │
│ str    ┆ str     ┆ i64  ┆ i64   │
╞════════╪═════════╪══════╪═══════╡
│ France ┆ Japan   ┆ 2020 ┆ 1     │
│ France ┆ Vietnam ┆ 2021 ┆ 2     │
│ UK     ┆ Japan   ┆ 2019 ┆ 3     │
│ UK     ┆ China   ┆ 2020 ┆ 4     │
│ Spain  ┆ China   ┆ 2022 ┆ 5     │
└────────┴─────────┴──────┴───────┘

While the original works, it is much slower than tidyr::complete() (66ms for Polars, 1.8ms for tidyr::complete()):

import time
tic = time.perf_counter()
(
    df
    .select(&quot;orig&quot;)
    .unique()
    .join(df.select(&quot;dest&quot;).unique(), how=&quot;cross&quot;)
    .join(df.select(&quot;year&quot;).unique(), how=&quot;cross&quot;)
    .join(df, how=&quot;left&quot;, on=[&quot;country&quot;, &quot;year&quot;])
)
toc = time.perf_counter()
print(f&quot;Lazy eval: {toc - tic:0.4f} seconds&quot;)
shape: (36, 4)
┌───────┬─────────┬──────┬───────┐
│ orig  ┆ dest    ┆ year ┆ value │
│ ---   ┆ ---     ┆ ---  ┆ ---   │
│ str   ┆ str     ┆ i64  ┆ i64   │
╞═══════╪═════════╪══════╪═══════╡
│ Spain ┆ Japan   ┆ 2021 ┆ null  │
│ Spain ┆ Japan   ┆ 2022 ┆ null  │
│ Spain ┆ Japan   ┆ 2019 ┆ null  │
│ Spain ┆ Japan   ┆ 2020 ┆ null  │
│ …     ┆ …       ┆ …    ┆ …     │
│ UK    ┆ Vietnam ┆ 2021 ┆ null  │
│ UK    ┆ Vietnam ┆ 2022 ┆ null  │
│ UK    ┆ Vietnam ┆ 2019 ┆ null  │
│ UK    ┆ Vietnam ┆ 2020 ┆ null  │
└───────┴─────────┴──────┴───────┘
&gt;&gt;&gt;
&gt;&gt;&gt; toc = time.perf_counter()
&gt;&gt;&gt; print(f&quot;Lazy eval: {toc - tic:0.4f} seconds&quot;)
Lazy eval: 0.0669 seconds

In R:

test &lt;- data.frame(
  orig = c(&quot;France&quot;, &quot;France&quot;, &quot;UK&quot;, &quot;UK&quot;, &quot;Spain&quot;),
  dest = c(&quot;Japan&quot;, &quot;Vietnam&quot;, &quot;Japan&quot;, &quot;China&quot;, &quot;China&quot;),
  year = c(2020, 2021, 2019, 2020, 2022),
  value = c(1, 2, 3, 4, 5)
)
bench::mark(
  test = tidyr::complete(test, orig, dest, year),
  iterations = 100
)
#&gt; # A tibble: 1 &#215; 6
#&gt;   expression      min   median `itr/sec` mem_alloc `gc/sec`
#&gt;   &lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt;     &lt;dbl&gt; &lt;bch:byt&gt;    &lt;dbl&gt;
#&gt; 1 test         1.61ms   1.81ms      496.     4.6MB     10.1

答案1

得分: 2

也许有一个更简单的方法，但组合是唯一值的交叉连接。

df.select('country').unique().join(
   df.select('year').unique(),
   how='cross'
)

shape: (12, 2)
┌─────────┬──────┐
│ country ┆ year │
│ ---     ┆ ---  │
│ str     ┆ i64  │
╞═════════╪══════╡
│ UK      ┆ 2021 │
│ UK      ┆ 2022 │
│ UK      ┆ 2019 │
│ UK      ┆ 2020 │
│ Spain   ┆ 2021 │
│ Spain   ┆ 2022 │
│ Spain   ┆ 2019 │
│ Spain   ┆ 2020 │
│ France  ┆ 2021 │
│ France  ┆ 2022 │
│ France  ┆ 2019 │
│ France  ┆ 2020 │
└─────────┴──────┘

然后，您可以与原始数据进行左连接：

df.select('country').unique().join(
   df.select('year').unique(),
   how='cross'
).join(df, how='left', on=['country', 'year'])

shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ UK      ┆ 2021 ┆ null  │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2022 ┆ null  │
│ UK      ┆ 2020 ┆ 4     │
│ France  ┆ 2021 ┆ 2     │
│ France  ┆ 2019 ┆ null  │
│ France  ┆ 2022 ┆ null  │
│ France  ┆ 2020 ┆ 1     │
│ Spain   ┆ 2021 ┆ null  │
│ Spain   ┆ 2019 ┆ null  │
│ Spain   ┆ 2022 ┆ 5     │
│ Spain   ┆ 2020 ┆ null  │
└─────────┴──────┴───────┘

英文:

Perhaps there is a simpler way but the combinations are a cross join of the unique values.

df.select(&#39;country&#39;).unique().join(
df.select(&#39;year&#39;).unique(),
how = &#39;cross&#39;
)

shape: (12, 2)
┌─────────┬──────┐
│ country ┆ year │
│ ---     ┆ ---  │
│ str     ┆ i64  │
╞═════════╪══════╡
│ UK      ┆ 2021 │
│ UK      ┆ 2022 │
│ UK      ┆ 2019 │
│ UK      ┆ 2020 │
│ Spain   ┆ 2021 │
│ Spain   ┆ 2022 │
│ Spain   ┆ 2019 │
│ Spain   ┆ 2020 │
│ France  ┆ 2021 │
│ France  ┆ 2022 │
│ France  ┆ 2019 │
│ France  ┆ 2020 │
└─────────┴──────┘

Which you can left join with the original:

df.select(&#39;country&#39;).unique().join(
df.select(&#39;year&#39;).unique(),
how = &#39;cross&#39;
).join(df, how=&#39;left&#39;, on=[&#39;country&#39;, &#39;year&#39;])

shape: (12, 3)
┌─────────┬──────┬───────┐
│ country ┆ year ┆ value │
│ ---     ┆ ---  ┆ ---   │
│ str     ┆ i64  ┆ i64   │
╞═════════╪══════╪═══════╡
│ UK      ┆ 2021 ┆ null  │
│ UK      ┆ 2019 ┆ 3     │
│ UK      ┆ 2022 ┆ null  │
│ UK      ┆ 2020 ┆ 4     │
│ France  ┆ 2021 ┆ 2     │
│ France  ┆ 2019 ┆ null  │
│ France  ┆ 2022 ┆ null  │
│ France  ┆ 2020 ┆ 1     │
│ Spain   ┆ 2021 ┆ null  │
│ Spain   ┆ 2019 ┆ null  │
│ Spain   ┆ 2022 ┆ 5     │
│ Spain   ┆ 2020 ┆ null  │
└─────────┴──────┴───────┘

答案2

得分: 2

以下是已翻译的代码部分：

(
    df.select(pl.col(["orig", "dest", "year"]).unique().sort().implode())
    .explode("orig")
    .explode("dest")
    .explode("year")
    .join(df, how="left", on=["orig", "dest", "year"])
)

英文:

As pointed out by @jqurious in the comments of their answer, it is faster to use .implode() and .explode() (it isn't faster with the small example I gave but I can clearly see the difference with larger data):

(
    df.select(pl.col([&quot;orig&quot;, &quot;dest&quot;, &quot;year&quot;]).unique().sort().implode())
    .explode(&quot;orig&quot;)
    .explode(&quot;dest&quot;)
    .explode(&quot;year&quot;)
    .join(df, how=&quot;left&quot;, on=[&quot;orig&quot;, &quot;dest&quot;, &quot;year&quot;])
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Polars基于变量子集创建所有组合。

问题

答案1

答案2

如何迭代地为列表中的每个值设置@property装饰器？

Salesforce对象无法通过API访问，但可以通过用户界面（UI）工作。

获取两个Pandas系列之间对象计数字典的最快方法

如何从一个包含超过50个文件的Google Drive文件夹中下载所有文件？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。