2023年2月6日 13:31:55go评论89阅读模式

英文:

Query/Filter a pandas df using a dict of lists

问题

I have a dict d that can be of varying length consisting of the following format:

d = {
  "foo": [
    50,
    100
  ],
  "bar": [
    5,
    10
  ]
}

Where the key is a column name and the value is a two-length list for the min and max value of said column to filter a dataframe df on. Thus, given the input above I'd like to filter df.foo between 50-100 and df.bar between 5-10.

Of course, I could just hard code it like so:

df.loc[(df['foo'] > 50) & (df['foo'] < 100) & (df['bar'] > 5) & (df['bar'] < 10) ...]

etc, but the number of keys (columns to filter on) may vary, and also this just incredibly ugly code. Is there a cleaner/vectorized way to do this?

I am building a streamlit app where a user can create n min-max filters on a dataframe, and the format listed above is the format streamlit's slider returns.

英文:

My problem

I have a dict d that can be of varying length consisting of the following format:

d = {
  &quot;foo&quot;: [
    50,
    100
  ],
  &quot;bar&quot;: [
    5,
    10
  ]
}

Where the key is a column name and the value is a two length list for the min and max value of said column to filter a datframe df on. Thus, given the input above I'd like to filter df.foo between 50-100 and df.bar between 5-10.

What I have tried

Of course, I could just hard code it like so:

df.loc[(df.list(d.items())[0][0] &gt; list(d.items())[0][1][0]) &amp; (df.list(d.items())[0][0] &lt; list(d.items())[0][1][1]) ...]

etc, but the number of keys (columns to filter on) may vary and also this just incredibly ugly code. Is there a cleaner/vectorized way to do this?

Context

I am building a streamlit app where a user can create n min max filters on a dataframe, and the format listed above is the format streamlit's slider returns

答案1

得分: 1

IIUC，一种使用 pandas.Series.between 的方法：

# 示例
import numpy as np
np.random.seed(1234)

df = pd.DataFrame({"foo": np.random.random(10) * 100,
                   "bar": np.random.random(10) * 10})

         foo       bar
0  19.151945  3.578173
1  62.210877  5.009951
2  43.772774  6.834629
3  78.535858  7.127020
4  77.997581  3.702508
5  27.259261  5.611962
6  27.646426  5.030832
7  80.187218  0.137684
8  95.813935  7.728266
9  87.593263  8.826412

代码：

new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)

输出：

         foo       bar
1  62.210877  5.009951
3  78.535858  7.127020
8  95.813935  7.728266
9  87.593263  8.826412

验证：适用于任意数量的筛选条件：

df = pd.DataFrame(np.random.random((10, 10)), columns=["abcdefghij"])
d = {c: [0.1, 0.9] for c in df}
new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)

英文:

IIUC, one way using pandas.Series.between:

# sample
import numpy as np
np.random.seed(1234)

df = pd.DataFrame({&quot;foo&quot;: np.random.random(10) * 100,
                   &quot;bar&quot;: np.random.random(10) * 10})

         foo       bar
0  19.151945  3.578173
1  62.210877  5.009951
2  43.772774  6.834629
3  78.535858  7.127020
4  77.997581  3.702508
5  27.259261  5.611962
6  27.646426  5.030832
7  80.187218  0.137684
8  95.813935  7.728266
9  87.593263  8.826412

Code:

new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)

Output:

         foo       bar
1  62.210877  5.009951
3  78.535858  7.127020
8  95.813935  7.728266
9  87.593263  8.826412

Validation: Works on any number of filters:

df = pd.DataFrame(np.random.random((10, 10)), columns=[*&quot;abcdefghij&quot;])
d = {c: [0.1, 0.9] for c in df}
new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)

答案2

得分: 0

I hope this works for your solution, I create a DataFrame for d and then i joined with another dataframe to match these values,

import pandas as pd
d = {
  "foo": [
    50,
    100
  ],
  "bar": [
    5,
    10
  ],
  "noto": [
      11,
      30
  ]
}
df_1 = pd.DataFrame(
    {
        "keys": d.keys(),
        "vals": d.values()
    }
)
df_1
df_2 = pd.DataFrame(
    {
        "item": ["foo", "bar", "noto"],
        "price": [65, 7, 33]
    }
)
main_df = df_1.merge(df_2, left_on='keys', right_on="item")
def check_price(x):
    return x['price'] >= x['vals'][0] and x['price'] <= x['vals'][1]
main_df[main_df.apply(lambda x: check_price(x), axis=1)]

英文:

I hope this works for your solution, I create a DataFrame for d and then i joined with another dataframe to match these values,

import pandas as pd
d = {
  &quot;foo&quot;: [
    50,
    100
  ],
  &quot;bar&quot;: [
    5,
    10
  ],
  &quot;noto&quot;: [
      11,
      30
  ]
}
df_1 = pd.DataFrame(
    {
        &quot;keys&quot;: d.keys(),
        &quot;vals&quot;: d.values()
    }
)
df_1
df_2 = pd.DataFrame(
    {
        &quot;item&quot;: [&quot;foo&quot;, &quot;bar&quot;, &quot;noto&quot;],
        &quot;price&quot;: [65, 7, 33]
    }
)
main_df = df_1.merge(df_2, left_on=&#39;keys&#39;, right_on=&quot;item&quot;)
def check_price(x):
    return x[&#39;price&#39;] &gt;= x[&#39;vals&#39;][0] and x[&#39;price&#39;] &lt;= x[&#39;vals&#39;][1]
main_df[main_df.apply(lambda x: check_price(x), axis=1)]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用字典的列表来查询/筛选一个 pandas 数据框。

问题

My problem

What I have tried

Context

答案1

答案2

Count of unique days grouped by value – pandas 按值分组的唯一日期计数 – pandas

python pyjnius; PythonActivity error: 我安装了JDK+Visual C++。

删除以偶数开头的数字

Keras将图像视为数组的数组，而不是单个图片。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论