英文:
In python polars filter and aggregate dict of lists
问题
You can optimize the calculation by using the apply
function along with json.loads
to parse the JSON strings and then using vectorized operations. Here's a more efficient solution:
import pandas as pd
import json
# Assuming you have a DataFrame 'df' with a 'json' column
def calculate_average(row):
data = json.loads(row)
x_values = data["x"]
y_values = data["y"]
# Filter x values where x > 0 and x < 3
filtered_y = [y for x, y in zip(x_values, y_values) if 0 < x < 3]
if filtered_y:
return sum(filtered_y) / len(filtered_y)
else:
return None
df["average_y"] = df["json"].apply(calculate_average)
# Now df contains a new column 'average_y' with the calculated averages
This code will parse the JSON strings and perform the average calculation in a more efficient way than your initial approach.
英文:
I have got a dataframe with string representation of json:
df = pl.DataFrame({
"json": [
'{"x":[0,1,2,3], "y":[10,20,30,40]}',
'{"x":[0,1,2,3], "y":[10,20,30,40]}',
'{"x":[0,1,2,3], "y":[10,20,30,40]}'
]
})
shape: (3, 1)
┌───────────────────────────────────┐
│ json │
│ --- │
│ str │
╞═══════════════════════════════════╡
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
│ {"x":[0,1,2,3], "y":[10,20,30,40… │
└───────────────────────────────────┘
Now I would like to calculate the average for y where x > 0 and x < 3 for each row.
This is my current working solution:
First evaluate the string -> dict and then create a dataframe, which is filtered by x.
df = df.with_columns([
pl.col('json').apply(lambda x: pl.DataFrame(ast.literal_eval(x)).filter((pl.col('x') < 3) & (pl.col('x') > 0))['y'].mean())
])
shape: (3, 1)
┌──────┐
│ json │
│ --- │
│ f64 │
╞══════╡
│ 25.0 │
│ 25.0 │
│ 25.0 │
└──────┘
This works fine, but for large datasets the apply functions is slowing down the process significantly.
Is there a more elegant and faster way of doing it?
答案1
得分: 1
以下是您要翻译的内容:
在列中,可以使用.str.json_extract()
来解析JSON字符串。
在这种情况下,您会得到一个结构,您可以使用.unnest
来展开。
>>> df.with_columns(pl.col("json").str.json_extract()).unnest("json")
shape: (3, 2)
┌─────────────┬────────────────┐
│ x ┆ y │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═════════════╪════════════════╡
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
└─────────────┴────────────────┘
然后,您可以.explode
列表并执行您的筛选/聚合逻辑:
(df.with_row_count()
.with_columns(pl.col("json").str.json_extract())
.unnest("json")
.explode("x", "y")
.filter(pl.col("x").is_between(1, 2))
.groupby("row_nr")
.agg(pl.mean("y")))
shape: (3, 2)
┌────────┬──────┐
│ row_nr ┆ y │
│ --- ┆ --- │
│ u32 ┆ f64 │
╞════════╪══════╡
│ 0 ┆ 25.0 │
│ 1 ┆ 25.0 │
│ 2 ┆ 25.0 │
└────────┴──────┘
您还可以使用List API:
(df.with_columns(pl.col("json").str.json_extract())
.unnest("json")
.select(
pl.col("y").arr.take(
pl.col("x").arr.eval(pl.element().is_between(1, 2).arg_true())
).arr.mean()
)
)
英文:
JSON strings in a column can be parsed using .str.json_extract()
In this case you get a struct which you can .unnest
>>> df.with_columns(pl.col("json").str.json_extract()).unnest("json")
shape: (3, 2)
┌─────────────┬────────────────┐
│ x ┆ y │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═════════════╪════════════════╡
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
└─────────────┴────────────────┘
You can then .explode
the lists and perform your filter/agg logic:
(df.with_row_count()
.with_columns(pl.col("json").str.json_extract())
.unnest("json")
.explode("x", "y")
.filter(pl.col("x").is_between(1, 2))
.groupby("row_nr")
.agg(pl.mean("y")))
shape: (3, 2)
┌────────┬──────┐
│ row_nr ┆ y │
│ --- ┆ --- │
│ u32 ┆ f64 │
╞════════╪══════╡
│ 0 ┆ 25.0 │
│ 1 ┆ 25.0 │
│ 2 ┆ 25.0 │
└────────┴──────┘
You can also use the List API:
(df.with_columns(pl.col("json").str.json_extract())
.unnest("json")
.select(
pl.col("y").arr.take(
pl.col("x").arr.eval(pl.element().is_between(1, 2).arg_true())
).arr.mean()
)
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论