如何使用Polars语法重新创建以下查询?

huangapple go评论61阅读模式
英文:

How can I recreate the following queries using Polars syntax?

问题

以下是 Polars 查询的翻译部分:

dfp = (
    (
        sales_pl.join(menu_pl, on="product_id")
        .groupby(["customer_id", "product_name"])
        .agg(order_count=pl.col("product_name").count())
    )
    .sort(by=["customer_id", "order_count"], descending=[False, True])
    .with_columns(
        pl.col("order_count")
        .rank(method="dense", descending=True)
        .over("customer_id")
        .alias("rank")
    )
    .filter(pl.col("rank") == 1)
)
英文:

I am currently trying to learn the Polars syntax by working through SQL/Pandas queries that I have done in the past.

Here is the SQL Query:

WITH fav_item_cte AS
(
	SELECT 
    s.customer_id, 
    m.product_name, 
    COUNT(m.product_id) AS order_count,
		DENSE_RANK() OVER(PARTITION BY s.customer_id ORDER BY COUNT(s.customer_id) DESC) AS rank
FROM dannys_diner.menu AS m
JOIN dannys_diner.sales AS s
	ON m.product_id = s.product_id
GROUP BY s.customer_id, m.product_name
)

SELECT 
  customer_id, 
  product_name, 
  order_count
FROM fav_item_cte 
WHERE rank = 1;

Here is the Pandas DataFrame after the initial inner join to allow for a MRE.

df = pd.DataFrame(
    {
        "customer_id": ['A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'C',
       'C', 'C'],
        "order_id": [
            datetime.date(2021, 1, 1), datetime.date(2021, 1, 4),
       datetime.date(2021, 1, 11), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 7), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 2), datetime.date(2021, 1, 10),
       datetime.date(2021, 1, 11), datetime.date(2021, 1, 11),
       datetime.date(2021, 1, 16), datetime.date(2021, 2, 1),
       datetime.date(2021, 1, 1), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 7)
        ],
        "join_date": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
        "product_name": ['sushi', 'sushi', 'sushi', 'curry', 'curry', 'curry', 'curry',
       'ramen', 'ramen', 'ramen', 'ramen', 'ramen', 'ramen', 'ramen',
       'ramen'],
        "price": [10, 10, 10, 15, 15, 15, 15, 12, 12, 12, 12, 12, 12, 12, 12],
    }
)

Here is the Pandas Code that I used:

df = (df
    .groupby(["customer_id", "product_name"])
    .agg(order_count=("product_id", "count"))
    .reset_index()
    .assign(
        rank=lambda df_: df_.groupby("customer_id")["order_count"].rank(
            method="dense", ascending=False
        )
    )
    .query("rank == 1")
    .sort_values(["customer_id", "product_name"])
)

And here is the output that I was seeking:

df = pd.DataFrame(
    {
        "customer_id": ['A', 'B', 'B', 'B', 'C'],
        "product_name": ['ramen', 'curry', 'ramen', 'sushi', 'ramen'],
        "order_count": [3, 2, 2, 2, 3],
	"rank": [1, 1, 1, 1, 1]

    }
)

Here is the Polars code that I have so far.

dfp = pl.DataFrame(
    {
        "customer_id": ['A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'C',
       'C', 'C'],
        "order_id": [
            datetime.date(2021, 1, 1), datetime.date(2021, 1, 4),
       datetime.date(2021, 1, 11), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 7), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 2), datetime.date(2021, 1, 10),
       datetime.date(2021, 1, 11), datetime.date(2021, 1, 11),
       datetime.date(2021, 1, 16), datetime.date(2021, 2, 1),
       datetime.date(2021, 1, 1), datetime.date(2021, 1, 1),
       datetime.date(2021, 1, 7)
        ],
        "join_date": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
        "product_name": ['sushi', 'sushi', 'sushi', 'curry', 'curry', 'curry', 'curry',
       'ramen', 'ramen', 'ramen', 'ramen', 'ramen', 'ramen', 'ramen',
       'ramen'],
        "price": [10, 10, 10, 15, 15, 15, 15, 12, 12, 12, 12, 12, 12, 12, 12],
    }
)
dfp = (
    (dfp
        .groupby(["customer_id", "product_name"])
        .agg(order_count=pl.col(["product_name"]).count())
    )
    .sort(by=["customer_id", "order_count"], descending=[False, True])
    .with_columns(
        pl.col("order_count").rank(method="dense", descending=True).alias("rank")
    )
    .filter(pl.col("rank") == 1)
)

My Polars query doesn't account for/pick up on the repeated "customer B" orders that are tied for 3 items ordered twice each. Here is that output minus the .filter(pl.col("rank") == 1)

customer_id product_name order_count rank
str str u32 u32
"A" "ramen" 3 1
"A" "curry" 2 2
"A" "sushi" 1 3
"B" "curry" 2 2
"B" "sushi" 2 2
"B" "ramen" 2 2
"C" "ramen" 3 1

Edit: Based on @jqurious comment below, the following might be the best translation but I am still curious if it is the best way to do this in polars?


dfp = (
    (
        sales_pl.join(menu_pl, on="product_id")
        .groupby(["customer_id", "product_name"])
        .agg(order_count=pl.col(["product_name"]).count())
    )
    .sort(by=["customer_id", "order_count"], descending=[False, True])
    .with_columns(
        pl.col("order_count")
        .rank(method="dense", descending=True)
        .over("customer_id")
        .alias("rank")
    )
    .filter(pl.col("rank") == 1)
)

答案1

得分: 1

更新:
显然,数据集中最常见值的统计术语是众数(mode)

(df.groupby("customer_id")
   .agg(pl.col("product_name").mode())
形状:(3, 2)
┌─────────────┬─────────────────────────────┐
│ customer_id ┆ product_name                │
│ ---         ┆ ---                         │
│ str         ┆ list[str]                   │
╞═════════════╪═════════════════════════════╡
│ A           ┆ ["ramen"]                   │
│ B           ┆ ["curry", "ramen", "sushi"] │
│ C           ┆ ["ramen"]                   │
└─────────────┴─────────────────────────────┘

不幸的是,似乎没有一种同时获取计数的方法,这似乎是一个有用的选项。

也许有一种更简单的方法 - 但您可以进行过滤并使用 .value_counts()

(df.groupby("customer_id")
   .agg(
      pl.col("product_name")
        .filter(pl.col("product_name").is_in(pl.col("product_name").mode()))
        .value_counts())
   .explode("product_name")
   .unnest("product_name"))
形状:(5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts │
│ ---         ┆ ---          ┆ ---    │
│ str         ┆ str          ┆ u32    │
╞═════════════╪══════════════╪════════╡
│ A           ┆ ramen        ┆ 3      │
│ C           ┆ ramen        ┆ 3      │
│ B           ┆ ramen        ┆ 2      │
│ B           ┆ curry        ┆ 2      │
│ B           ┆ sushi        ┆ 2      │
└─────────────┴──────────────┴────────┘

看起来您的查询找到了每个customer_id的最大product_name计数(包括并列)。

我认为.value_counts()可能提供了更直接的解决方案:

(df.groupby("customer_id")
   .agg(pl.col("product_name").value_counts(sort=True)))
形状:(3, 2)
┌─────────────┬─────────────────────────────────────┐
│ customer_id ┆ product_name                        │
│ ---         ┆ ---                                 │
│ str         ┆ list[struct[2]]                     │
╞═════════════╪═════════════════════════════════════╡
│ C           ┆ [{"ramen",3}]                       │
│ A           ┆ [{"ramen",3}, {"curry",2}, {"sus... │
│ B           ┆ [{"sushi",2}, {"curry",2}, {"ram... │
└─────────────┴─────────────────────────────────────┘

但我无法找到比以下更简单的解包/过滤方法:

(df.groupby("customer_id")
   .agg(pl.col("product_name").value_counts(sort=True))
   .explode("product_name")
   .unnest("product_name")
   .filter(pl.col("counts") == pl.col("counts").first().over("customer_id"))
   .sort(pl.all()))
形状:(5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts ┆
│ ---         ┆ ---          ┆ ---    ┆
│ str         ┆ str          ┆ u32    ┆
╞═════════════╪══════════════╪════════╡
│ A           ┆ ramen        ┆ 3      ┆
│ B           ┆ curry        ┆ 2      ┆
│ B           ┆ ramen        ┆ 2      ┆
│ B           ┆ sushi        ┆ 2      ┆ 
│ C           ┆ ramen        ┆ 3      ┆ 
└─────────────┴──────────────┴────────┘

它似乎还可以在pandas中使用 .value_counts()

counts = df.value_counts(["customer_id", "product_name"]).reset_index(1)
counts[counts[0] == counts.groupby(level=0)[0].transform("max")]
            product_name  0
customer_id                
A                  ramen  3
C                  ramen  3
B                  curry  2
B                  ramen  2
B                  sushi  2
英文:

Update:
Apparently the statistical term for most_common/most_frequent value(s) in a dataset is the mode

(df.groupby("customer_id")
   .agg(pl.col("product_name").mode())
shape: (3, 2)
┌─────────────┬─────────────────────────────┐
│ customer_id ┆ product_name                │
│ ---         ┆ ---                         │
│ str         ┆ list[str]                   │
╞═════════════╪═════════════════════════════╡
│ A           ┆ ["ramen"]                   │
│ B           ┆ ["curry", "ramen", "sushi"] │
│ C           ┆ ["ramen"]                   │
└─────────────┴─────────────────────────────┘

Unfortunately, there doesn't seem to be a way to get the count at the same time - which seems like it could be a useful option to have.

Perhaps there is a simpler way - but you could filter and use .value_counts():

(df.groupby("customer_id")
   .agg(
      pl.col("product_name")
        .filter(pl.col("product_name").is_in(pl.col("product_name").mode()))
        .value_counts())
   .explode("product_name")
   .unnest("product_name"))
shape: (5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts │
│ ---         ┆ ---          ┆ ---    │
│ str         ┆ str          ┆ u32    │
╞═════════════╪══════════════╪════════╡
│ A           ┆ ramen        ┆ 3      │
│ C           ┆ ramen        ┆ 3      │
│ B           ┆ ramen        ┆ 2      │
│ B           ┆ curry        ┆ 2      │
│ B           ┆ sushi        ┆ 2      │
└─────────────┴──────────────┴────────┘

It looks like your query finds the max product_name count per customer_id (including ties)

I thought .value_counts() may provide a more direct solution:

(df.groupby("customer_id")
   .agg(pl.col("product_name").value_counts(sort=True)))
shape: (3, 2)
┌─────────────┬─────────────────────────────────────┐
│ customer_id ┆ product_name                        │
│ ---         ┆ ---                                 │
│ str         ┆ list[struct[2]]                     │
╞═════════════╪═════════════════════════════════════╡
│ C           ┆ [{"ramen",3}]                       │
│ A           ┆ [{"ramen",3}, {"curry",2}, {"sus... │
│ B           ┆ [{"sushi",2}, {"curry",2}, {"ram... │
└─────────────┴─────────────────────────────────────┘

But I can't figure out a simpler way to unpack/filter than:

(df.groupby("customer_id")
   .agg(pl.col("product_name").value_counts(sort=True))
   .explode("product_name")
   .unnest("product_name")
   .filter(pl.col("counts") == pl.col("counts").first().over("customer_id"))
   .sort(pl.all()))
shape: (5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts ┆
│ ---         ┆ ---          ┆ ---    ┆
│ str         ┆ str          ┆ u32    ┆
╞═════════════╪══════════════╪════════╡
│ A           ┆ ramen        ┆ 3      ┆
│ B           ┆ curry        ┆ 2      ┆
│ B           ┆ ramen        ┆ 2      ┆
│ B           ┆ sushi        ┆ 2      ┆ 
│ C           ┆ ramen        ┆ 3      ┆ 
└─────────────┴──────────────┴────────┘

It also looks like you could use .value_counts() in pandas:

counts = df.value_counts(["customer_id", "product_name"]).reset_index(1)
counts[counts[0] == counts.groupby(level=0)[0].transform("max")]
            product_name  0
customer_id                
A                  ramen  3
C                  ramen  3
B                  curry  2
B                  ramen  2
B                  sushi  2

huangapple
  • 本文由 发表于 2023年3月7日 04:56:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/75655776.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定