使用字典的列表来查询/筛选一个 pandas 数据框。

huangapple go评论132阅读模式
英文:

Query/Filter a pandas df using a dict of lists

问题

I have a dict d that can be of varying length consisting of the following format:

  1. d = {
  2. "foo": [
  3. 50,
  4. 100
  5. ],
  6. "bar": [
  7. 5,
  8. 10
  9. ]
  10. }

Where the key is a column name and the value is a two-length list for the min and max value of said column to filter a dataframe df on. Thus, given the input above I'd like to filter df.foo between 50-100 and df.bar between 5-10.

Of course, I could just hard code it like so:

  1. df.loc[(df['foo'] > 50) & (df['foo'] < 100) & (df['bar'] > 5) & (df['bar'] < 10) ...]

etc, but the number of keys (columns to filter on) may vary, and also this just incredibly ugly code. Is there a cleaner/vectorized way to do this?

I am building a streamlit app where a user can create n min-max filters on a dataframe, and the format listed above is the format streamlit's slider returns.

英文:

My problem

I have a dict d that can be of varying length consisting of the following format:

  1. d = {
  2. &quot;foo&quot;: [
  3. 50,
  4. 100
  5. ],
  6. &quot;bar&quot;: [
  7. 5,
  8. 10
  9. ]
  10. }

Where the key is a column name and the value is a two length list for the min and max value of said column to filter a datframe df on. Thus, given the input above I'd like to filter df.foo between 50-100 and df.bar between 5-10.

What I have tried

Of course, I could just hard code it like so:

  1. df.loc[(df.list(d.items())[0][0] &gt; list(d.items())[0][1][0]) &amp; (df.list(d.items())[0][0] &lt; list(d.items())[0][1][1]) ...]

etc, but the number of keys (columns to filter on) may vary and also this just incredibly ugly code. Is there a cleaner/vectorized way to do this?

Context

I am building a streamlit app where a user can create n min max filters on a dataframe, and the format listed above is the format streamlit's slider returns

答案1

得分: 1

IIUC,一种使用 pandas.Series.between 的方法:

  1. # 示例
  2. import numpy as np
  3. np.random.seed(1234)
  4. df = pd.DataFrame({"foo": np.random.random(10) * 100,
  5. "bar": np.random.random(10) * 10})
  6. foo bar
  7. 0 19.151945 3.578173
  8. 1 62.210877 5.009951
  9. 2 43.772774 6.834629
  10. 3 78.535858 7.127020
  11. 4 77.997581 3.702508
  12. 5 27.259261 5.611962
  13. 6 27.646426 5.030832
  14. 7 80.187218 0.137684
  15. 8 95.813935 7.728266
  16. 9 87.593263 8.826412

代码:

  1. new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
  2. print(new_df)

输出:

  1. foo bar
  2. 1 62.210877 5.009951
  3. 3 78.535858 7.127020
  4. 8 95.813935 7.728266
  5. 9 87.593263 8.826412

验证:适用于任意数量的筛选条件:

  1. df = pd.DataFrame(np.random.random((10, 10)), columns=["abcdefghij"])
  2. d = {c: [0.1, 0.9] for c in df}
  3. new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
  4. print(new_df)
英文:

IIUC, one way using pandas.Series.between:

  1. # sample
  2. import numpy as np
  3. np.random.seed(1234)
  4. df = pd.DataFrame({&quot;foo&quot;: np.random.random(10) * 100,
  5. &quot;bar&quot;: np.random.random(10) * 10})
  6. foo bar
  7. 0 19.151945 3.578173
  8. 1 62.210877 5.009951
  9. 2 43.772774 6.834629
  10. 3 78.535858 7.127020
  11. 4 77.997581 3.702508
  12. 5 27.259261 5.611962
  13. 6 27.646426 5.030832
  14. 7 80.187218 0.137684
  15. 8 95.813935 7.728266
  16. 9 87.593263 8.826412

Code:

  1. new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
  2. print(new_df)

Output:

  1. foo bar
  2. 1 62.210877 5.009951
  3. 3 78.535858 7.127020
  4. 8 95.813935 7.728266
  5. 9 87.593263 8.826412

Validation: Works on any number of filters:

  1. df = pd.DataFrame(np.random.random((10, 10)), columns=[*&quot;abcdefghij&quot;])
  2. d = {c: [0.1, 0.9] for c in df}
  3. new_df = df[np.logical_and.reduce([df[k].between(*v) for k, v in d.items()])]
  4. print(new_df)

答案2

得分: 0

I hope this works for your solution, I create a DataFrame for d and then i joined with another dataframe to match these values,

  1. import pandas as pd
  2. d = {
  3. "foo": [
  4. 50,
  5. 100
  6. ],
  7. "bar": [
  8. 5,
  9. 10
  10. ],
  11. "noto": [
  12. 11,
  13. 30
  14. ]
  15. }
  16. df_1 = pd.DataFrame(
  17. {
  18. "keys": d.keys(),
  19. "vals": d.values()
  20. }
  21. )
  22. df_1
  23. df_2 = pd.DataFrame(
  24. {
  25. "item": ["foo", "bar", "noto"],
  26. "price": [65, 7, 33]
  27. }
  28. )
  29. main_df = df_1.merge(df_2, left_on='keys', right_on="item")
  30. def check_price(x):
  31. return x['price'] >= x['vals'][0] and x['price'] <= x['vals'][1]
  32. main_df[main_df.apply(lambda x: check_price(x), axis=1)]
英文:

I hope this works for your solution, I create a DataFrame for d and then i joined with another dataframe to match these values,

  1. import pandas as pd
  2. d = {
  3. &quot;foo&quot;: [
  4. 50,
  5. 100
  6. ],
  7. &quot;bar&quot;: [
  8. 5,
  9. 10
  10. ],
  11. &quot;noto&quot;: [
  12. 11,
  13. 30
  14. ]
  15. }
  16. df_1 = pd.DataFrame(
  17. {
  18. &quot;keys&quot;: d.keys(),
  19. &quot;vals&quot;: d.values()
  20. }
  21. )
  22. df_1
  23. df_2 = pd.DataFrame(
  24. {
  25. &quot;item&quot;: [&quot;foo&quot;, &quot;bar&quot;, &quot;noto&quot;],
  26. &quot;price&quot;: [65, 7, 33]
  27. }
  28. )
  29. main_df = df_1.merge(df_2, left_on=&#39;keys&#39;, right_on=&quot;item&quot;)
  30. def check_price(x):
  31. return x[&#39;price&#39;] &gt;= x[&#39;vals&#39;][0] and x[&#39;price&#39;] &lt;= x[&#39;vals&#39;][1]
  32. main_df[main_df.apply(lambda x: check_price(x), axis=1)]

huangapple
  • 本文由 发表于 2023年2月6日 13:31:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/75357643.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定