英文:
Polars syntax for Pandas complex queries
问题
I am trying to benchmark Polars but I am stuck on how to replicate the following Pandas expression in Polars.
df['ll_lat'] = (df['lat'] // 0.1 * 0.1).round(1)
df['ll_lon'] = (df['lon'] // 0.1 * 0.1).round(1)
df['temporalBasket'] = df['eventtime'].astype(str).str[:13]
df = df.groupby(['ll_lat', 'll_lon', 'temporalBasket']).agg(strikes=('lat', 'count'))
df
Can someone help me translate and explain how I should be thinking about Polars column creation etc. please?
英文:
I am trying to benchmark Polars but I am stuck on how to replicate the following Pandas expression in Polars.
df['ll_lat'] = (df['lat'] // 0.1 * 0.1).round(1)
df['ll_lon'] = (df['lon'] // 0.1 * 0.1).round(1)
df['temporalBasket'] = df['eventtime'].astype(str).str[:13]
df = df.groupby(['ll_lat', 'll_lon', 'temporalBasket']).agg(strikes=('lat', 'count'))
df
Can someone help me translate and explain how I should be thinking about Polars column creation etc. please?
Here is a df.head()
output to make things a little clearer.
答案1
得分: 1
在Polars中,你可以执行类似Pandas的操作。然而,你可以使用截取字符串来提取日期和小时,而不是切片字符串。这应该会更快,也更易阅读。
关于向最接近的小数位取整,我没有找到Polars的方法。所以,我保留了你的逻辑。
# 示例数据
data = {
'lat': [45.123, 45.155, 45.171, 45.191, 45.123],
'lon': [12.321, 12.322, 12.345, 12.366, 12.321],
'eventtime': [
datetime(2023, 4, 1, 10, 20),
datetime(2023, 4, 1, 12, 30),
datetime(2023, 4, 1, 10, 45),
datetime(2023, 4, 2, 9, 15),
datetime(2023, 4, 2, 11, 50),
],
}
df_pl = pl.DataFrame(data)
df_pl.groupby(
(pl.col('lat') // 0.1 * 0.1).alias('ll_lat'),
(pl.col('lon') // 0.1 * 0.1).alias('ll_lon'),
pl.col('eventtime').dt.truncate('1h').alias('temporalBasket')
).agg(
strikes=pl.col('lat').count()
)
输出如下:
┌────────┬────────┬─────────────────────┬─────────┐
│ ll_lat ┆ ll_lon ┆ temporalBasket ┆ strikes │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ datetime[μs] ┆ u32 │
╞════════╪════════╪═════════════════════╪═════════╡
│ 45.1 ┆ 12.3 ┆ 2023-04-01 12:00:00 ┆ 1 │
│ 45.1 ┆ 12.3 ┆ 2023-04-02 09:00:00 ┆ 1 │
│ 45.1 ┆ 12.3 ┆ 2023-04-01 10:00:00 ┆ 2 │
│ 45.1 ┆ 12.3 ┆ 2023-04-02 11:00:00 ┆ 1 │
└────────┴────────┴─────────────────────┴─────────┘
英文:
You can do something similar in Polars to what you are doing in Pandas. However, you can use truncate the extract the day + hour instead of slicing the string. This should be faster, and also easier to read.
For rounding down to the nearest decimal, I did not find a Polars method for it. So I kept your logic.
# Sample data
data = {
'lat': [45.123, 45.155, 45.171, 45.191, 45.123],
'lon': [12.321, 12.322, 12.345, 12.366, 12.321],
'eventtime': [
datetime(2023, 4, 1, 10, 20),
datetime(2023, 4, 1, 12, 30),
datetime(2023, 4, 1, 10, 45),
datetime(2023, 4, 2, 9, 15),
datetime(2023, 4, 2, 11, 50),
],
}
df_pl = pl.DataFrame(data)
df_pl.groupby(
(pl.col('lat') // 0.1 * 0.1).alias('ll_lat'),
(pl.col('lon') // 0.1 * 0.1).alias('ll_lon'),
pl.col('eventtime').dt.truncate('1h').alias('temporalBasket')
).agg(
strikes=pl.col('lat').count()
)
# Output
┌────────┬────────┬─────────────────────┬─────────┐
│ ll_lat ┆ ll_lon ┆ temporalBasket ┆ strikes │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ datetime[μs] ┆ u32 │
╞════════╪════════╪═════════════════════╪═════════╡
│ 45.1 ┆ 12.3 ┆ 2023-04-01 12:00:00 ┆ 1 │
│ 45.1 ┆ 12.3 ┆ 2023-04-02 09:00:00 ┆ 1 │
│ 45.1 ┆ 12.3 ┆ 2023-04-01 10:00:00 ┆ 2 │
│ 45.1 ┆ 12.3 ┆ 2023-04-02 11:00:00 ┆ 1 │
└────────┴────────┴─────────────────────┴─────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论