2023年5月10日 20:15:54go评论150阅读模式

英文:

In python polars filter and aggregate dict of lists

问题

You can optimize the calculation by using the apply function along with json.loads to parse the JSON strings and then using vectorized operations. Here's a more efficient solution:

import pandas as pd
import json

# Assuming you have a DataFrame 'df' with a 'json' column

def calculate_average(row):
    data = json.loads(row)
    x_values = data["x"]
    y_values = data["y"]
    
    # Filter x values where x > 0 and x < 3
    filtered_y = [y for x, y in zip(x_values, y_values) if 0 < x < 3]
    
    if filtered_y:
        return sum(filtered_y) / len(filtered_y)
    else:
        return None

df["average_y"] = df["json"].apply(calculate_average)

# Now df contains a new column 'average_y' with the calculated averages

This code will parse the JSON strings and perform the average calculation in a more efficient way than your initial approach.

英文:

I have got a dataframe with string representation of json:

df = pl.DataFrame({ 
        &quot;json&quot;: [
            &#39;{&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40]}&#39;,
            &#39;{&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40]}&#39;,
            &#39;{&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40]}&#39;
        ] 
    })

shape: (3, 1)
┌───────────────────────────────────┐
│ json                              │
│ ---                               │
│ str                               │
╞═══════════════════════════════════╡
│ {&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40… │
│ {&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40… │
│ {&quot;x&quot;:[0,1,2,3], &quot;y&quot;:[10,20,30,40… │
└───────────────────────────────────┘

Now I would like to calculate the average for y where x > 0 and x < 3 for each row.

This is my current working solution:

First evaluate the string -> dict and then create a dataframe, which is filtered by x.

df = df.with_columns([
    pl.col(&#39;json&#39;).apply(lambda x: pl.DataFrame(ast.literal_eval(x)).filter((pl.col(&#39;x&#39;) &lt; 3) &amp; (pl.col(&#39;x&#39;) &gt; 0))[&#39;y&#39;].mean())
])

shape: (3, 1)
┌──────┐
│ json │
│ ---  │
│ f64  │
╞══════╡
│ 25.0 │
│ 25.0 │
│ 25.0 │
└──────┘

This works fine, but for large datasets the apply functions is slowing down the process significantly.

Is there a more elegant and faster way of doing it?

答案1

得分: 1

以下是您要翻译的内容：

在列中，可以使用.str.json_extract()来解析JSON字符串。

在这种情况下，您会得到一个结构，您可以使用.unnest来展开。

&gt;&gt;&gt; df.with_columns(pl.col(&quot;json&quot;).str.json_extract()).unnest(&quot;json&quot;)
shape: (3, 2)
┌─────────────┬────────────────┐
│ x           ┆ y              │
│ ---         ┆ ---            │
│ list[i64]   ┆ list[i64]      │
╞═════════════╪════════════════╡
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
└─────────────┴────────────────┘

然后，您可以.explode列表并执行您的筛选/聚合逻辑：

(df.with_row_count()
   .with_columns(pl.col(&quot;json&quot;).str.json_extract())
   .unnest(&quot;json&quot;)
   .explode(&quot;x&quot;, &quot;y&quot;)
   .filter(pl.col(&quot;x&quot;).is_between(1, 2))
   .groupby(&quot;row_nr&quot;)
   .agg(pl.mean(&quot;y&quot;)))

shape: (3, 2)
┌────────┬──────┐
│ row_nr ┆ y    │
│ ---    ┆ ---  │
│ u32    ┆ f64  │
╞════════╪══════╡
│ 0      ┆ 25.0 │
│ 1      ┆ 25.0 │
│ 2      ┆ 25.0 │
└────────┴──────┘

您还可以使用List API：

(df.with_columns(pl.col(&quot;json&quot;).str.json_extract())
   .unnest(&quot;json&quot;)
   .select(
      pl.col(&quot;y&quot;).arr.take(
         pl.col(&quot;x&quot;).arr.eval(pl.element().is_between(1, 2).arg_true())
      ).arr.mean()
   )
)

英文:

JSON strings in a column can be parsed using .str.json_extract()

In this case you get a struct which you can .unnest

&gt;&gt;&gt; df.with_columns(pl.col(&quot;json&quot;).str.json_extract()).unnest(&quot;json&quot;)
shape: (3, 2)
┌─────────────┬────────────────┐
│ x           ┆ y              │
│ ---         ┆ ---            │
│ list[i64]   ┆ list[i64]      │
╞═════════════╪════════════════╡
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
│ [0, 1, … 3] ┆ [10, 20, … 40] │
└─────────────┴────────────────┘

You can then .explode the lists and perform your filter/agg logic:

(df.with_row_count()
   .with_columns(pl.col(&quot;json&quot;).str.json_extract())
   .unnest(&quot;json&quot;)
   .explode(&quot;x&quot;, &quot;y&quot;)
   .filter(pl.col(&quot;x&quot;).is_between(1, 2))
   .groupby(&quot;row_nr&quot;)
   .agg(pl.mean(&quot;y&quot;)))

shape: (3, 2)
┌────────┬──────┐
│ row_nr ┆ y    │
│ ---    ┆ ---  │
│ u32    ┆ f64  │
╞════════╪══════╡
│ 0      ┆ 25.0 │
│ 1      ┆ 25.0 │
│ 2      ┆ 25.0 │
└────────┴──────┘

You can also use the List API:

(df.with_columns(pl.col(&quot;json&quot;).str.json_extract())
   .unnest(&quot;json&quot;)
   .select(
      pl.col(&quot;y&quot;).arr.take(
         pl.col(&quot;x&quot;).arr.eval(pl.element().is_between(1, 2).arg_true())
      ).arr.mean()
   )
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Python中，使用Polar库可以对字典列表进行筛选和聚合。

问题

答案1

Loading 3D索引参数从CSV文件到Python Pyomo优化模型并调试约束。

Snowpark Python日期格式

Why does the literal string """"""" (seven quotes) give an error?

Selenium Python Chrome – “元素不可交互”

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论