英文:
How can I recreate the following queries using Polars syntax?
问题
以下是 Polars 查询的翻译部分:
dfp = (
(
sales_pl.join(menu_pl, on="product_id")
.groupby(["customer_id", "product_name"])
.agg(order_count=pl.col("product_name").count())
)
.sort(by=["customer_id", "order_count"], descending=[False, True])
.with_columns(
pl.col("order_count")
.rank(method="dense", descending=True)
.over("customer_id")
.alias("rank")
)
.filter(pl.col("rank") == 1)
)
英文:
I am currently trying to learn the Polars syntax by working through SQL/Pandas queries that I have done in the past.
Here is the SQL Query:
WITH fav_item_cte AS
(
SELECT
s.customer_id,
m.product_name,
COUNT(m.product_id) AS order_count,
DENSE_RANK() OVER(PARTITION BY s.customer_id ORDER BY COUNT(s.customer_id) DESC) AS rank
FROM dannys_diner.menu AS m
JOIN dannys_diner.sales AS s
ON m.product_id = s.product_id
GROUP BY s.customer_id, m.product_name
)
SELECT
customer_id,
product_name,
order_count
FROM fav_item_cte
WHERE rank = 1;
Here is the Pandas DataFrame after the initial inner join to allow for a MRE.
df = pd.DataFrame(
{
"customer_id": ['A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'C',
'C', 'C'],
"order_id": [
datetime.date(2021, 1, 1), datetime.date(2021, 1, 4),
datetime.date(2021, 1, 11), datetime.date(2021, 1, 1),
datetime.date(2021, 1, 7), datetime.date(2021, 1, 1),
datetime.date(2021, 1, 2), datetime.date(2021, 1, 10),
datetime.date(2021, 1, 11), datetime.date(2021, 1, 11),
datetime.date(2021, 1, 16), datetime.date(2021, 2, 1),
datetime.date(2021, 1, 1), datetime.date(2021, 1, 1),
datetime.date(2021, 1, 7)
],
"join_date": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
"product_name": ['sushi', 'sushi', 'sushi', 'curry', 'curry', 'curry', 'curry',
'ramen', 'ramen', 'ramen', 'ramen', 'ramen', 'ramen', 'ramen',
'ramen'],
"price": [10, 10, 10, 15, 15, 15, 15, 12, 12, 12, 12, 12, 12, 12, 12],
}
)
Here is the Pandas Code that I used:
df = (df
.groupby(["customer_id", "product_name"])
.agg(order_count=("product_id", "count"))
.reset_index()
.assign(
rank=lambda df_: df_.groupby("customer_id")["order_count"].rank(
method="dense", ascending=False
)
)
.query("rank == 1")
.sort_values(["customer_id", "product_name"])
)
And here is the output that I was seeking:
df = pd.DataFrame(
{
"customer_id": ['A', 'B', 'B', 'B', 'C'],
"product_name": ['ramen', 'curry', 'ramen', 'sushi', 'ramen'],
"order_count": [3, 2, 2, 2, 3],
"rank": [1, 1, 1, 1, 1]
}
)
Here is the Polars code that I have so far.
dfp = pl.DataFrame(
{
"customer_id": ['A', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'C',
'C', 'C'],
"order_id": [
datetime.date(2021, 1, 1), datetime.date(2021, 1, 4),
datetime.date(2021, 1, 11), datetime.date(2021, 1, 1),
datetime.date(2021, 1, 7), datetime.date(2021, 1, 1),
datetime.date(2021, 1, 2), datetime.date(2021, 1, 10),
datetime.date(2021, 1, 11), datetime.date(2021, 1, 11),
datetime.date(2021, 1, 16), datetime.date(2021, 2, 1),
datetime.date(2021, 1, 1), datetime.date(2021, 1, 1),
datetime.date(2021, 1, 7)
],
"join_date": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
"product_name": ['sushi', 'sushi', 'sushi', 'curry', 'curry', 'curry', 'curry',
'ramen', 'ramen', 'ramen', 'ramen', 'ramen', 'ramen', 'ramen',
'ramen'],
"price": [10, 10, 10, 15, 15, 15, 15, 12, 12, 12, 12, 12, 12, 12, 12],
}
)
dfp = (
(dfp
.groupby(["customer_id", "product_name"])
.agg(order_count=pl.col(["product_name"]).count())
)
.sort(by=["customer_id", "order_count"], descending=[False, True])
.with_columns(
pl.col("order_count").rank(method="dense", descending=True).alias("rank")
)
.filter(pl.col("rank") == 1)
)
My Polars query doesn't account for/pick up on the repeated "customer B" orders that are tied for 3 items ordered twice each. Here is that output minus the .filter(pl.col("rank") == 1)
customer_id | product_name | order_count | rank |
---|---|---|---|
str | str | u32 | u32 |
"A" | "ramen" | 3 | 1 |
"A" | "curry" | 2 | 2 |
"A" | "sushi" | 1 | 3 |
"B" | "curry" | 2 | 2 |
"B" | "sushi" | 2 | 2 |
"B" | "ramen" | 2 | 2 |
"C" | "ramen" | 3 | 1 |
Edit: Based on @jqurious comment below, the following might be the best translation but I am still curious if it is the best way to do this in polars?
dfp = (
(
sales_pl.join(menu_pl, on="product_id")
.groupby(["customer_id", "product_name"])
.agg(order_count=pl.col(["product_name"]).count())
)
.sort(by=["customer_id", "order_count"], descending=[False, True])
.with_columns(
pl.col("order_count")
.rank(method="dense", descending=True)
.over("customer_id")
.alias("rank")
)
.filter(pl.col("rank") == 1)
)
答案1
得分: 1
更新:
显然,数据集中最常见值的统计术语是众数(mode)
(df.groupby("customer_id")
.agg(pl.col("product_name").mode())
形状:(3, 2)
┌─────────────┬─────────────────────────────┐
│ customer_id ┆ product_name │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════════╪═════════════════════════════╡
│ A ┆ ["ramen"] │
│ B ┆ ["curry", "ramen", "sushi"] │
│ C ┆ ["ramen"] │
└─────────────┴─────────────────────────────┘
不幸的是,似乎没有一种同时获取计数的方法,这似乎是一个有用的选项。
也许有一种更简单的方法 - 但您可以进行过滤并使用 .value_counts()
:
(df.groupby("customer_id")
.agg(
pl.col("product_name")
.filter(pl.col("product_name").is_in(pl.col("product_name").mode()))
.value_counts())
.explode("product_name")
.unnest("product_name"))
形状:(5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═════════════╪══════════════╪════════╡
│ A ┆ ramen ┆ 3 │
│ C ┆ ramen ┆ 3 │
│ B ┆ ramen ┆ 2 │
│ B ┆ curry ┆ 2 │
│ B ┆ sushi ┆ 2 │
└─────────────┴──────────────┴────────┘
看起来您的查询找到了每个customer_id
的最大product_name
计数(包括并列)。
我认为.value_counts()
可能提供了更直接的解决方案:
(df.groupby("customer_id")
.agg(pl.col("product_name").value_counts(sort=True)))
形状:(3, 2)
┌─────────────┬─────────────────────────────────────┐
│ customer_id ┆ product_name │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════════╪═════════════════════════════════════╡
│ C ┆ [{"ramen",3}] │
│ A ┆ [{"ramen",3}, {"curry",2}, {"sus... │
│ B ┆ [{"sushi",2}, {"curry",2}, {"ram... │
└─────────────┴─────────────────────────────────────┘
但我无法找到比以下更简单的解包/过滤方法:
(df.groupby("customer_id")
.agg(pl.col("product_name").value_counts(sort=True))
.explode("product_name")
.unnest("product_name")
.filter(pl.col("counts") == pl.col("counts").first().over("customer_id"))
.sort(pl.all()))
形状:(5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts ┆
│ --- ┆ --- ┆ --- ┆
│ str ┆ str ┆ u32 ┆
╞═════════════╪══════════════╪════════╡
│ A ┆ ramen ┆ 3 ┆
│ B ┆ curry ┆ 2 ┆
│ B ┆ ramen ┆ 2 ┆
│ B ┆ sushi ┆ 2 ┆
│ C ┆ ramen ┆ 3 ┆
└─────────────┴──────────────┴────────┘
它似乎还可以在pandas中使用 .value_counts()
:
counts = df.value_counts(["customer_id", "product_name"]).reset_index(1)
counts[counts[0] == counts.groupby(level=0)[0].transform("max")]
product_name 0
customer_id
A ramen 3
C ramen 3
B curry 2
B ramen 2
B sushi 2
英文:
Update:
Apparently the statistical term for most_common/most_frequent value(s) in a dataset is the mode
(df.groupby("customer_id")
.agg(pl.col("product_name").mode())
shape: (3, 2)
┌─────────────┬─────────────────────────────┐
│ customer_id ┆ product_name │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════════╪═════════════════════════════╡
│ A ┆ ["ramen"] │
│ B ┆ ["curry", "ramen", "sushi"] │
│ C ┆ ["ramen"] │
└─────────────┴─────────────────────────────┘
Unfortunately, there doesn't seem to be a way to get the count at the same time - which seems like it could be a useful option to have.
Perhaps there is a simpler way - but you could filter and use .value_counts()
:
(df.groupby("customer_id")
.agg(
pl.col("product_name")
.filter(pl.col("product_name").is_in(pl.col("product_name").mode()))
.value_counts())
.explode("product_name")
.unnest("product_name"))
shape: (5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═════════════╪══════════════╪════════╡
│ A ┆ ramen ┆ 3 │
│ C ┆ ramen ┆ 3 │
│ B ┆ ramen ┆ 2 │
│ B ┆ curry ┆ 2 │
│ B ┆ sushi ┆ 2 │
└─────────────┴──────────────┴────────┘
It looks like your query finds the max product_name
count per customer_id
(including ties)
I thought .value_counts()
may provide a more direct solution:
(df.groupby("customer_id")
.agg(pl.col("product_name").value_counts(sort=True)))
shape: (3, 2)
┌─────────────┬─────────────────────────────────────┐
│ customer_id ┆ product_name │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════════╪═════════════════════════════════════╡
│ C ┆ [{"ramen",3}] │
│ A ┆ [{"ramen",3}, {"curry",2}, {"sus... │
│ B ┆ [{"sushi",2}, {"curry",2}, {"ram... │
└─────────────┴─────────────────────────────────────┘
But I can't figure out a simpler way to unpack/filter than:
(df.groupby("customer_id")
.agg(pl.col("product_name").value_counts(sort=True))
.explode("product_name")
.unnest("product_name")
.filter(pl.col("counts") == pl.col("counts").first().over("customer_id"))
.sort(pl.all()))
shape: (5, 3)
┌─────────────┬──────────────┬────────┐
│ customer_id ┆ product_name ┆ counts ┆
│ --- ┆ --- ┆ --- ┆
│ str ┆ str ┆ u32 ┆
╞═════════════╪══════════════╪════════╡
│ A ┆ ramen ┆ 3 ┆
│ B ┆ curry ┆ 2 ┆
│ B ┆ ramen ┆ 2 ┆
│ B ┆ sushi ┆ 2 ┆
│ C ┆ ramen ┆ 3 ┆
└─────────────┴──────────────┴────────┘
It also looks like you could use .value_counts()
in pandas:
counts = df.value_counts(["customer_id", "product_name"]).reset_index(1)
counts[counts[0] == counts.groupby(level=0)[0].transform("max")]
product_name 0
customer_id
A ramen 3
C ramen 3
B curry 2
B ramen 2
B sushi 2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论