2023年3月7日 04:56:37go评论50阅读模式

英文:

How can I recreate the following queries using Polars syntax?

问题

以下是 Polars 查询的翻译部分：

dfp = (
    (
        sales_pl.join(menu_pl, on="product_id")
        .groupby(["customer_id", "product_name"])
        .agg(order_count=pl.col("product_name").count())
    )
    .sort(by=["customer_id", "order_count"], descending=[False, True])
    .with_columns(
        pl.col("order_count")
        .rank(method="dense", descending=True)
        .over("customer_id")
        .alias("rank")
    )
    .filter(pl.col("rank") == 1)
)

英文:

I am currently trying to learn the Polars syntax by working through SQL/Pandas queries that I have done in the past.

Here is the SQL Query:

WITH fav_item_cte AS
(
	SELECT 
    s.customer_id, 
    m.product_name, 
    COUNT(m.product_id) AS order_count,
		DENSE_RANK() OVER(PARTITION BY s.customer_id ORDER BY COUNT(s.customer_id) DESC) AS rank
FROM dannys_diner.menu AS m
JOIN dannys_diner.sales AS s
	ON m.product_id = s.product_id
GROUP BY s.customer_id, m.product_name
)

SELECT 
  customer_id, 
  product_name, 
  order_count
FROM fav_item_cte 
WHERE rank = 1;

Here is the Pandas DataFrame after the initial inner join to allow for a MRE.

df = pd.DataFrame(
    {
        &quot;customer_id&quot;: [&#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;A&#39;, &#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;C&#39;,
       &#39;C&#39;, &#39;C&#39;],
        &quot;order_id&quot;: [
            datetime.date(2021, 1, 1), datetime.date(2021, 1, 4),
       datetime.date(2021, 1, 11), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 7), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 2), datetime.date(2021, 1, 10),
       datetime.date(2021, 1, 11), datetime.date(2021, 1, 11),
       datetime.date(2021, 1, 16), datetime.date(2021, 2, 1),
       datetime.date(2021, 1, 1), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 7)
        ],
        &quot;join_date&quot;: [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
        &quot;product_name&quot;: [&#39;sushi&#39;, &#39;sushi&#39;, &#39;sushi&#39;, &#39;curry&#39;, &#39;curry&#39;, &#39;curry&#39;, &#39;curry&#39;,
       &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;,
       &#39;ramen&#39;],
        &quot;price&quot;: [10, 10, 10, 15, 15, 15, 15, 12, 12, 12, 12, 12, 12, 12, 12],
    }
)

Here is the Pandas Code that I used:

df = (df
    .groupby([&quot;customer_id&quot;, &quot;product_name&quot;])
    .agg(order_count=(&quot;product_id&quot;, &quot;count&quot;))
    .reset_index()
    .assign(
        rank=lambda df_: df_.groupby(&quot;customer_id&quot;)[&quot;order_count&quot;].rank(
            method=&quot;dense&quot;, ascending=False
        )
    )
    .query(&quot;rank == 1&quot;)
    .sort_values([&quot;customer_id&quot;, &quot;product_name&quot;])
)

And here is the output that I was seeking:

df = pd.DataFrame(
    {
        &quot;customer_id&quot;: [&#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;B&#39;, &#39;C&#39;],
        &quot;product_name&quot;: [&#39;ramen&#39;, &#39;curry&#39;, &#39;ramen&#39;, &#39;sushi&#39;, &#39;ramen&#39;],
        &quot;order_count&quot;: [3, 2, 2, 2, 3],
	&quot;rank&quot;: [1, 1, 1, 1, 1]

    }
)

Here is the Polars code that I have so far.

dfp = pl.DataFrame(
    {
        &quot;customer_id&quot;: [&#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;A&#39;, &#39;A&#39;, &#39;A&#39;, &#39;B&#39;, &#39;B&#39;, &#39;C&#39;,
       &#39;C&#39;, &#39;C&#39;],
        &quot;order_id&quot;: [
            datetime.date(2021, 1, 1), datetime.date(2021, 1, 4),
       datetime.date(2021, 1, 11), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 7), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 2), datetime.date(2021, 1, 10),
       datetime.date(2021, 1, 11), datetime.date(2021, 1, 11),
       datetime.date(2021, 1, 16), datetime.date(2021, 2, 1),
       datetime.date(2021, 1, 1), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 7)
        ],
        &quot;join_date&quot;: [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
        &quot;product_name&quot;: [&#39;sushi&#39;, &#39;sushi&#39;, &#39;sushi&#39;, &#39;curry&#39;, &#39;curry&#39;, &#39;curry&#39;, &#39;curry&#39;,
       &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;, &#39;ramen&#39;,
       &#39;ramen&#39;],
        &quot;price&quot;: [10, 10, 10, 15, 15, 15, 15, 12, 12, 12, 12, 12, 12, 12, 12],
    }
)

dfp = (
    (dfp
        .groupby([&quot;customer_id&quot;, &quot;product_name&quot;])
        .agg(order_count=pl.col([&quot;product_name&quot;]).count())
    )
    .sort(by=[&quot;customer_id&quot;, &quot;order_count&quot;], descending=[False, True])
    .with_columns(
        pl.col(&quot;order_count&quot;).rank(method=&quot;dense&quot;, descending=True).alias(&quot;rank&quot;)
    )
    .filter(pl.col(&quot;rank&quot;) == 1)
)

My Polars query doesn't account for/pick up on the repeated "customer B" orders that are tied for 3 items ordered twice each. Here is that output minus the .filter(pl.col("rank") == 1)

customer_id	product_name	order_count	rank
str	str	u32	u32
"A"	"ramen"	3	1
"A"	"curry"	2	2
"A"	"sushi"	1	3
"B"	"curry"	2	2
"B"	"sushi"	2	2
"B"	"ramen"	2	2
"C"	"ramen"	3	1

Edit: Based on @jqurious comment below, the following might be the best translation but I am still curious if it is the best way to do this in polars?

dfp = (
    (
        sales_pl.join(menu_pl, on=&quot;product_id&quot;)
        .groupby([&quot;customer_id&quot;, &quot;product_name&quot;])
        .agg(order_count=pl.col([&quot;product_name&quot;]).count())
    )
    .sort(by=[&quot;customer_id&quot;, &quot;order_count&quot;], descending=[False, True])
    .with_columns(
        pl.col(&quot;order_count&quot;)
        .rank(method=&quot;dense&quot;, descending=True)
        .over(&quot;customer_id&quot;)
        .alias(&quot;rank&quot;)
    )
    .filter(pl.col(&quot;rank&quot;) == 1)
)

答案1

得分: 1

更新：
显然，数据集中最常见值的统计术语是众数（mode）

(df.groupby("customer_id")
   .agg(pl.col("product_name").mode())

形状：(3, 2)
┌─────────────┬─────────────────────────────┐
│ customer_id ┆ product_name                │
│ ---         ┆ ---                         │
│ str         ┆ list[str]                   │
╞═════════════╪═════════════════════════════╡
│ A           ┆ ["ramen"]                   │
│ B           ┆ ["curry", "ramen", "sushi"] │
│ C           ┆ ["ramen"]                   │
└─────────────┴─────────────────────────────┘

不幸的是，似乎没有一种同时获取计数的方法，这似乎是一个有用的选项。

也许有一种更简单的方法 - 但您可以进行过滤并使用 .value_counts()：

(df.groupby("customer_id")
   .agg(
      pl.col("product_name")
        .filter(pl.col("product_name").is_in(pl.col("product_name").mode()))
        .value_counts())
   .explode("product_name")
   .unnest("product_name"))

形状：(5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts │
│ ---         ┆ ---          ┆ ---    │
│ str         ┆ str          ┆ u32    │
╞═════════════╪══════════════╪════════╡
│ A           ┆ ramen        ┆ 3      │
│ C           ┆ ramen        ┆ 3      │
│ B           ┆ ramen        ┆ 2      │
│ B           ┆ curry        ┆ 2      │
│ B           ┆ sushi        ┆ 2      │
└─────────────┴──────────────┴────────┘

看起来您的查询找到了每个customer_id的最大product_name计数（包括并列）。

我认为.value_counts()可能提供了更直接的解决方案：

(df.groupby("customer_id")
   .agg(pl.col("product_name").value_counts(sort=True)))

形状：(3, 2)
┌─────────────┬─────────────────────────────────────┐
│ customer_id ┆ product_name                        │
│ ---         ┆ ---                                 │
│ str         ┆ list[struct[2]]                     │
╞═════════════╪═════════════════════════════════════╡
│ C           ┆ [{"ramen",3}]                       │
│ A           ┆ [{"ramen",3}, {"curry",2}, {"sus... │
│ B           ┆ [{"sushi",2}, {"curry",2}, {"ram... │
└─────────────┴─────────────────────────────────────┘

但我无法找到比以下更简单的解包/过滤方法：

(df.groupby("customer_id")
   .agg(pl.col("product_name").value_counts(sort=True))
   .explode("product_name")
   .unnest("product_name")
   .filter(pl.col("counts") == pl.col("counts").first().over("customer_id"))
   .sort(pl.all()))

形状：(5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts ┆
│ ---         ┆ ---          ┆ ---    ┆
│ str         ┆ str          ┆ u32    ┆
╞═════════════╪══════════════╪════════╡
│ A           ┆ ramen        ┆ 3      ┆
│ B           ┆ curry        ┆ 2      ┆
│ B           ┆ ramen        ┆ 2      ┆
│ B           ┆ sushi        ┆ 2      ┆ 
│ C           ┆ ramen        ┆ 3      ┆ 
└─────────────┴──────────────┴────────┘

它似乎还可以在pandas中使用 .value_counts()：

counts = df.value_counts(["customer_id", "product_name"]).reset_index(1)
counts[counts[0] == counts.groupby(level=0)[0].transform("max")]

            product_name  0
customer_id                
A                  ramen  3
C                  ramen  3
B                  curry  2
B                  ramen  2
B                  sushi  2

英文:

Update:
Apparently the statistical term for most_common/most_frequent value(s) in a dataset is the mode

(df.groupby(&quot;customer_id&quot;)
   .agg(pl.col(&quot;product_name&quot;).mode())

shape: (3, 2)
┌─────────────┬─────────────────────────────┐
│ customer_id ┆ product_name                │
│ ---         ┆ ---                         │
│ str         ┆ list[str]                   │
╞═════════════╪═════════════════════════════╡
│ A           ┆ [&quot;ramen&quot;]                   │
│ B           ┆ [&quot;curry&quot;, &quot;ramen&quot;, &quot;sushi&quot;] │
│ C           ┆ [&quot;ramen&quot;]                   │
└─────────────┴─────────────────────────────┘

Unfortunately, there doesn't seem to be a way to get the count at the same time - which seems like it could be a useful option to have.

Perhaps there is a simpler way - but you could filter and use .value_counts():

(df.groupby(&quot;customer_id&quot;)
   .agg(
      pl.col(&quot;product_name&quot;)
        .filter(pl.col(&quot;product_name&quot;).is_in(pl.col(&quot;product_name&quot;).mode()))
        .value_counts())
   .explode(&quot;product_name&quot;)
   .unnest(&quot;product_name&quot;))

shape: (5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts │
│ ---         ┆ ---          ┆ ---    │
│ str         ┆ str          ┆ u32    │
╞═════════════╪══════════════╪════════╡
│ A           ┆ ramen        ┆ 3      │
│ C           ┆ ramen        ┆ 3      │
│ B           ┆ ramen        ┆ 2      │
│ B           ┆ curry        ┆ 2      │
│ B           ┆ sushi        ┆ 2      │
└─────────────┴──────────────┴────────┘

It looks like your query finds the max product_name count per customer_id (including ties)

I thought .value_counts() may provide a more direct solution:

(df.groupby(&quot;customer_id&quot;)
   .agg(pl.col(&quot;product_name&quot;).value_counts(sort=True)))

shape: (3, 2)
┌─────────────┬─────────────────────────────────────┐
│ customer_id ┆ product_name                        │
│ ---         ┆ ---                                 │
│ str         ┆ list[struct[2]]                     │
╞═════════════╪═════════════════════════════════════╡
│ C           ┆ [{&quot;ramen&quot;,3}]                       │
│ A           ┆ [{&quot;ramen&quot;,3}, {&quot;curry&quot;,2}, {&quot;sus... │
│ B           ┆ [{&quot;sushi&quot;,2}, {&quot;curry&quot;,2}, {&quot;ram... │
└─────────────┴─────────────────────────────────────┘

But I can't figure out a simpler way to unpack/filter than:

(df.groupby(&quot;customer_id&quot;)
   .agg(pl.col(&quot;product_name&quot;).value_counts(sort=True))
   .explode(&quot;product_name&quot;)
   .unnest(&quot;product_name&quot;)
   .filter(pl.col(&quot;counts&quot;) == pl.col(&quot;counts&quot;).first().over(&quot;customer_id&quot;))
   .sort(pl.all()))

shape: (5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts ┆
│ ---         ┆ ---          ┆ ---    ┆
│ str         ┆ str          ┆ u32    ┆
╞═════════════╪══════════════╪════════╡
│ A           ┆ ramen        ┆ 3      ┆
│ B           ┆ curry        ┆ 2      ┆
│ B           ┆ ramen        ┆ 2      ┆
│ B           ┆ sushi        ┆ 2      ┆ 
│ C           ┆ ramen        ┆ 3      ┆ 
└─────────────┴──────────────┴────────┘

It also looks like you could use .value_counts() in pandas:

counts = df.value_counts([&quot;customer_id&quot;, &quot;product_name&quot;]).reset_index(1)
counts[counts[0] == counts.groupby(level=0)[0].transform(&quot;max&quot;)]

            product_name  0
customer_id                
A                  ramen  3
C                  ramen  3
B                  curry  2
B                  ramen  2
B                  sushi  2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Polars语法重新创建以下查询？

问题

答案1

Python Polars – Create sequence in list from integer

Python regex, one word with n characters followed by two words with one char

如何在Python中迭代遍历数据框每个概率列，并找到概率低于50%的行号。

如何将这个函数向量化到 pandas 或 polars 中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论