英文:
Query/Filter a pandas df using a dict of lists
问题
I have a dict d
that can be of varying length consisting of the following format:
d = {
"foo": [
50,
100
],
"bar": [
5,
10
]
}
Where the key is a column name and the value is a two-length list for the min and max value of said column to filter a dataframe df
on. Thus, given the input above I'd like to filter df.foo
between 50-100 and df.bar
between 5-10.
Of course, I could just hard code it like so:
df.loc[(df['foo'] > 50) & (df['foo'] < 100) & (df['bar'] > 5) & (df['bar'] < 10) ...]
etc, but the number of keys (columns to filter on) may vary, and also this just incredibly ugly code. Is there a cleaner/vectorized way to do this?
I am building a streamlit app where a user can create n min-max filters on a dataframe, and the format listed above is the format streamlit's slider returns.
英文:
My problem
I have a dict d
that can be of varying length consisting of the following format:
d = {
"foo": [
50,
100
],
"bar": [
5,
10
]
}
Where the key is a column name and the value is a two length list for the min and max value of said column to filter a datframe df
on. Thus, given the input above I'd like to filter df.foo
between 50-100 and df.bar
between 5-10.
What I have tried
Of course, I could just hard code it like so:
df.loc[(df.list(d.items())[0][0] > list(d.items())[0][1][0]) & (df.list(d.items())[0][0] < list(d.items())[0][1][1]) ...]
etc, but the number of keys (columns to filter on) may vary and also this just incredibly ugly code. Is there a cleaner/vectorized way to do this?
Context
I am building a streamlit app where a user can create n min max filters on a dataframe, and the format listed above is the format streamlit's slider returns
答案1
得分: 1
IIUC,一种使用 pandas.Series.between
的方法:
# 示例
import numpy as np
np.random.seed(1234)
df = pd.DataFrame({"foo": np.random.random(10) * 100,
"bar": np.random.random(10) * 10})
foo bar
0 19.151945 3.578173
1 62.210877 5.009951
2 43.772774 6.834629
3 78.535858 7.127020
4 77.997581 3.702508
5 27.259261 5.611962
6 27.646426 5.030832
7 80.187218 0.137684
8 95.813935 7.728266
9 87.593263 8.826412
代码:
new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)
输出:
foo bar
1 62.210877 5.009951
3 78.535858 7.127020
8 95.813935 7.728266
9 87.593263 8.826412
验证:适用于任意数量的筛选条件:
df = pd.DataFrame(np.random.random((10, 10)), columns=["abcdefghij"])
d = {c: [0.1, 0.9] for c in df}
new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)
英文:
IIUC, one way using pandas.Series.between
:
# sample
import numpy as np
np.random.seed(1234)
df = pd.DataFrame({"foo": np.random.random(10) * 100,
"bar": np.random.random(10) * 10})
foo bar
0 19.151945 3.578173
1 62.210877 5.009951
2 43.772774 6.834629
3 78.535858 7.127020
4 77.997581 3.702508
5 27.259261 5.611962
6 27.646426 5.030832
7 80.187218 0.137684
8 95.813935 7.728266
9 87.593263 8.826412
Code:
new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)
Output:
foo bar
1 62.210877 5.009951
3 78.535858 7.127020
8 95.813935 7.728266
9 87.593263 8.826412
Validation: Works on any number of filters:
df = pd.DataFrame(np.random.random((10, 10)), columns=[*"abcdefghij"])
d = {c: [0.1, 0.9] for c in df}
new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
print(new_df)
答案2
得分: 0
I hope this works for your solution, I create a DataFrame for d and then i joined with another dataframe to match these values,
import pandas as pd
d = {
"foo": [
50,
100
],
"bar": [
5,
10
],
"noto": [
11,
30
]
}
df_1 = pd.DataFrame(
{
"keys": d.keys(),
"vals": d.values()
}
)
df_1
df_2 = pd.DataFrame(
{
"item": ["foo", "bar", "noto"],
"price": [65, 7, 33]
}
)
main_df = df_1.merge(df_2, left_on='keys', right_on="item")
def check_price(x):
return x['price'] >= x['vals'][0] and x['price'] <= x['vals'][1]
main_df[main_df.apply(lambda x: check_price(x), axis=1)]
英文:
I hope this works for your solution, I create a DataFrame for d and then i joined with another dataframe to match these values,
import pandas as pd
d = {
"foo": [
50,
100
],
"bar": [
5,
10
],
"noto": [
11,
30
]
}
df_1 = pd.DataFrame(
{
"keys": d.keys(),
"vals": d.values()
}
)
df_1
df_2 = pd.DataFrame(
{
"item": ["foo", "bar", "noto"],
"price": [65, 7, 33]
}
)
main_df = df_1.merge(df_2, left_on='keys', right_on="item")
def check_price(x):
return x['price'] >= x['vals'][0] and x['price'] <= x['vals'][1]
main_df[main_df.apply(lambda x: check_price(x), axis=1)]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论