使用字典的列表来查询/筛选一个 pandas 数据框。

huangapple go评论89阅读模式
英文:

Query/Filter a pandas df using a dict of lists

问题

I have a dict d that can be of varying length consisting of the following format:

d = {
  "foo": [
    50,
    100
  ],
  "bar": [
    5,
    10
  ]
}

Where the key is a column name and the value is a two-length list for the min and max value of said column to filter a dataframe df on. Thus, given the input above I'd like to filter df.foo between 50-100 and df.bar between 5-10.

Of course, I could just hard code it like so:

df.loc[(df['foo'] > 50) & (df['foo'] < 100) & (df['bar'] > 5) & (df['bar'] < 10) ...]

etc, but the number of keys (columns to filter on) may vary, and also this just incredibly ugly code. Is there a cleaner/vectorized way to do this?

I am building a streamlit app where a user can create n min-max filters on a dataframe, and the format listed above is the format streamlit's slider returns.

英文:

My problem

I have a dict d that can be of varying length consisting of the following format:

d = {
  &quot;foo&quot;: [
    50,
    100
  ],
  &quot;bar&quot;: [
    5,
    10
  ]
}

Where the key is a column name and the value is a two length list for the min and max value of said column to filter a datframe df on. Thus, given the input above I'd like to filter df.foo between 50-100 and df.bar between 5-10.

What I have tried

Of course, I could just hard code it like so:

df.loc[(df.list(d.items())[0][0] &gt; list(d.items())[0][1][0]) &amp; (df.list(d.items())[0][0] &lt; list(d.items())[0][1][1]) ...]

etc, but the number of keys (columns to filter on) may vary and also this just incredibly ugly code. Is there a cleaner/vectorized way to do this?

Context

I am building a streamlit app where a user can create n min max filters on a dataframe, and the format listed above is the format streamlit's slider returns

答案1

得分: 1

IIUC,一种使用 pandas.Series.between 的方法:

# 示例
import numpy as np
np.random.seed(1234)

df = pd.DataFrame({"foo": np.random.random(10) * 100,
                   "bar": np.random.random(10) * 10})

         foo       bar
0  19.151945  3.578173
1  62.210877  5.009951
2  43.772774  6.834629
3  78.535858  7.127020
4  77.997581  3.702508
5  27.259261  5.611962
6  27.646426  5.030832
7  80.187218  0.137684
8  95.813935  7.728266
9  87.593263  8.826412

代码:

new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)

输出:

         foo       bar
1  62.210877  5.009951
3  78.535858  7.127020
8  95.813935  7.728266
9  87.593263  8.826412

验证:适用于任意数量的筛选条件:

df = pd.DataFrame(np.random.random((10, 10)), columns=["abcdefghij"])
d = {c: [0.1, 0.9] for c in df}
new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)
英文:

IIUC, one way using pandas.Series.between:

# sample
import numpy as np
np.random.seed(1234)

df = pd.DataFrame({&quot;foo&quot;: np.random.random(10) * 100,
                   &quot;bar&quot;: np.random.random(10) * 10})

         foo       bar
0  19.151945  3.578173
1  62.210877  5.009951
2  43.772774  6.834629
3  78.535858  7.127020
4  77.997581  3.702508
5  27.259261  5.611962
6  27.646426  5.030832
7  80.187218  0.137684
8  95.813935  7.728266
9  87.593263  8.826412

Code:

new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)

Output:

         foo       bar
1  62.210877  5.009951
3  78.535858  7.127020
8  95.813935  7.728266
9  87.593263  8.826412

Validation: Works on any number of filters:

df = pd.DataFrame(np.random.random((10, 10)), columns=[*&quot;abcdefghij&quot;])
d = {c: [0.1, 0.9] for c in df}
new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)

答案2

得分: 0

I hope this works for your solution, I create a DataFrame for d and then i joined with another dataframe to match these values,

import pandas as pd
d = {
  "foo": [
    50,
    100
  ],
  "bar": [
    5,
    10
  ],
  "noto": [
      11,
      30
  ]
}
df_1 = pd.DataFrame(
    {
        "keys": d.keys(),
        "vals": d.values()
    }
)
df_1
df_2 = pd.DataFrame(
    {
        "item": ["foo", "bar", "noto"],
        "price": [65, 7, 33]
    }
)
main_df = df_1.merge(df_2, left_on='keys', right_on="item")
def check_price(x):
    return x['price'] >= x['vals'][0] and x['price'] <= x['vals'][1]
main_df[main_df.apply(lambda x: check_price(x), axis=1)]
英文:

I hope this works for your solution, I create a DataFrame for d and then i joined with another dataframe to match these values,

import pandas as pd
d = {
  &quot;foo&quot;: [
    50,
    100
  ],
  &quot;bar&quot;: [
    5,
    10
  ],
  &quot;noto&quot;: [
      11,
      30
  ]
}
df_1 = pd.DataFrame(
    {
        &quot;keys&quot;: d.keys(),
        &quot;vals&quot;: d.values()
    }
)
df_1
df_2 = pd.DataFrame(
    {
        &quot;item&quot;: [&quot;foo&quot;, &quot;bar&quot;, &quot;noto&quot;],
        &quot;price&quot;: [65, 7, 33]
    }
)
main_df = df_1.merge(df_2, left_on=&#39;keys&#39;, right_on=&quot;item&quot;)
def check_price(x):
    return x[&#39;price&#39;] &gt;= x[&#39;vals&#39;][0] and x[&#39;price&#39;] &lt;= x[&#39;vals&#39;][1]
main_df[main_df.apply(lambda x: check_price(x), axis=1)]

huangapple
  • 本文由 发表于 2023年2月6日 13:31:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75357643.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定