在Python中,使用Polar库可以对字典列表进行筛选和聚合。

huangapple go评论74阅读模式
英文:

In python polars filter and aggregate dict of lists

问题

You can optimize the calculation by using the apply function along with json.loads to parse the JSON strings and then using vectorized operations. Here's a more efficient solution:

import pandas as pd
import json

# Assuming you have a DataFrame 'df' with a 'json' column

def calculate_average(row):
    data = json.loads(row)
    x_values = data["x"]
    y_values = data["y"]
    
    # Filter x values where x > 0 and x < 3
    filtered_y = [y for x, y in zip(x_values, y_values) if 0 < x < 3]
    
    if filtered_y:
        return sum(filtered_y) / len(filtered_y)
    else:
        return None

df["average_y"] = df["json"].apply(calculate_average)

# Now df contains a new column 'average_y' with the calculated averages

This code will parse the JSON strings and perform the average calculation in a more efficient way than your initial approach.

英文:

I have got a dataframe with string representation of json:

df = pl.DataFrame({ 
        &quot;json&quot;: [
            &#39;{&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40]}&#39;,
            &#39;{&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40]}&#39;,
            &#39;{&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40]}&#39;
        ] 
    })
shape: (3, 1)
┌───────────────────────────────────┐
│ json                              │
│ ---                               │
│ str                               │
╞═══════════════════════════════════╡
│ {&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40… │
│ {&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40… │
│ {&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40… │
└───────────────────────────────────┘

Now I would like to calculate the average for y where x &gt; 0 and x &lt; 3 for each row.


This is my current working solution:

First evaluate the string -&gt; dict and then create a dataframe, which is filtered by x.

df = df.with_columns([
    pl.col(&#39;json&#39;).apply(lambda x: pl.DataFrame(ast.literal_eval(x)).filter((pl.col(&#39;x&#39;) &lt; 3) &amp; (pl.col(&#39;x&#39;) &gt; 0))[&#39;y&#39;].mean())
])
shape: (3, 1)
┌──────┐
│ json │
│ ---  │
│ f64  │
╞══════╡
│ 25.0 │
│ 25.0 │
│ 25.0 │
└──────┘

This works fine, but for large datasets the apply functions is slowing down the process significantly.

Is there a more elegant and faster way of doing it?

答案1

得分: 1

以下是您要翻译的内容:

在列中,可以使用.str.json_extract()来解析JSON字符串。

在这种情况下,您会得到一个结构,您可以使用.unnest来展开。

&gt;&gt;&gt; df.with_columns(pl.col(&quot;json&quot;).str.json_extract()).unnest(&quot;json&quot;)
shape: (3, 2)
┌─────────────┬────────────────┐
│ x           ┆ y              │
│ ---         ┆ ---            │
│ list[i64]   ┆ list[i64]      │
╞═════════════╪════════════════╡
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
└─────────────┴────────────────┘

然后,您可以.explode列表并执行您的筛选/聚合逻辑:

(df.with_row_count()
   .with_columns(pl.col(&quot;json&quot;).str.json_extract())
   .unnest(&quot;json&quot;)
   .explode(&quot;x&quot;, &quot;y&quot;)
   .filter(pl.col(&quot;x&quot;).is_between(1, 2))
   .groupby(&quot;row_nr&quot;)
   .agg(pl.mean(&quot;y&quot;)))
shape: (3, 2)
┌────────┬──────┐
│ row_nr ┆ y    │
│ ---    ┆ ---  │
│ u32    ┆ f64  │
╞════════╪══════╡
│ 0      ┆ 25.0 │
│ 1      ┆ 25.0 │
│ 2      ┆ 25.0 │
└────────┴──────┘

您还可以使用List API:

(df.with_columns(pl.col(&quot;json&quot;).str.json_extract())
   .unnest(&quot;json&quot;)
   .select(
      pl.col(&quot;y&quot;).arr.take(
         pl.col(&quot;x&quot;).arr.eval(pl.element().is_between(1, 2).arg_true())
      ).arr.mean()
   )
)
英文:

JSON strings in a column can be parsed using .str.json_extract()

In this case you get a struct which you can .unnest

&gt;&gt;&gt; df.with_columns(pl.col(&quot;json&quot;).str.json_extract()).unnest(&quot;json&quot;)
shape: (3, 2)
┌─────────────┬────────────────┐
│ x           ┆ y              │
│ ---         ┆ ---            │
│ list[i64]   ┆ list[i64]      │
╞═════════════╪════════════════╡
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
└─────────────┴────────────────┘

You can then .explode the lists and perform your filter/agg logic:

(df.with_row_count()
   .with_columns(pl.col(&quot;json&quot;).str.json_extract())
   .unnest(&quot;json&quot;)
   .explode(&quot;x&quot;, &quot;y&quot;)
   .filter(pl.col(&quot;x&quot;).is_between(1, 2))
   .groupby(&quot;row_nr&quot;)
   .agg(pl.mean(&quot;y&quot;)))
shape: (3, 2)
┌────────┬──────┐
│ row_nr ┆ y    │
│ ---    ┆ ---  │
│ u32    ┆ f64  │
╞════════╪══════╡
│ 0      ┆ 25.0 │
│ 1      ┆ 25.0 │
│ 2      ┆ 25.0 │
└────────┴──────┘

You can also use the List API:

(df.with_columns(pl.col(&quot;json&quot;).str.json_extract())
   .unnest(&quot;json&quot;)
   .select(
      pl.col(&quot;y&quot;).arr.take(
         pl.col(&quot;x&quot;).arr.eval(pl.element().is_between(1, 2).arg_true())
      ).arr.mean()
   )
)

huangapple
  • 本文由 发表于 2023年5月10日 20:15:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76218303.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定