使用字典定义 Pandas 数据帧中计算出现次数的条件。

huangapple go评论123阅读模式
英文:

Using dictionaries to define condition for counting occurrences in pandas dataframe

问题

我正在尝试根据字典中定义的条件来统计DataFrame中的出现次数。

字典列表:

  1. [{'value_0': 13, 'column_0': 'region'},
  2. {'value_0': 7,
  3. 'value_1': 2,
  4. 'column_0': 'age1',
  5. 'column_1': 'sex'},
  6. {'value_0': 7,
  7. 'value_1': 1,
  8. 'column_0': 'age1',
  9. 'column_1': 'sex'}]

每个字典应该从如下所示的DataFrame中生成计数结果:

  1. +----+-------+-------+------+
  2. | | region| age1 | sex |
  3. +----+-------+-------+------+
  4. | 0 | NaN | NaN | NaN |
  5. | 1 | NaN | NaN | NaN |
  6. | 2 | NaN | NaN | NaN |
  7. | 3 | 18 | NaN | NaN |
  8. | 4 | 24 | 7 | NaN |
  9. | 5 | NaN | NaN | NaN |
  10. | 6 | 12 | 2 | 2 |
  11. | 7 | 9 | 3 | 1 |
  12. | 8 | 21 | 1 | 2 |
  13. | 9 | NaN | NaN | NaN |
  14. | 10 | 12 | 6 | 2 |
  15. | 11 | 21 | 6 | 1 |
  16. | 12 | 4 | 6 | 2 |
  17. | 13 | 9 | 6 | 2 |
  18. | 14 | 21 | 6 | 2 |
  19. | 15 | 12 | 6 | 1 |
  20. | 16 | NaN | NaN | NaN |
  21. | 17 | 20 | 4 | 2 |
  22. | 18 | 13 | NaN | NaN |
  23. | 19 | 1 | 4 | 2 |
  24. +----+-------+-------+------+

实际上,这个DataFrame还有许多其他列,但我只需要根据字典中定义的列/值组合来计数。

例如,上面的3个字典应该从pandas DataFrame(df)中生成以下计数值:

  1. count1 = df[df.region == 13].count()
  2. count2 = df[(df.age1 == 7) & (df.sex == 2)].count()
  3. count3 = df[(df.age1 == 7) & (df.sex == 1)].count()

我需要一个迭代器,它将为每个字典生成单个计数值,并将其放入列表或pandas Series中。

英文:

I'm trying to count occurrences in dataframe based on conditions defined in dictionaries.

list of dictionaries:

  1. [{'value_0': 13, 'column_0': 'region'},
  2. {'value_0': 7,
  3. 'value_1': 2,
  4. 'column_0': 'age1',
  5. 'column_1': 'sex'},
  6. {'value_0': 7,
  7. 'value_1': 1,
  8. 'column_0': 'age1',
  9. 'column_1': 'sex'}]

Each dictionary should produce count result from dataframe that looks like this:

  1. +----+-------+-------+------+
  2. | | region| age1 | sex |
  3. +----+-------+-------+------+
  4. | 0 | NaN | NaN | NaN |
  5. | 1 | NaN | NaN | NaN |
  6. | 2 | NaN | NaN | NaN |
  7. | 3 | 18 | NaN | NaN |
  8. | 4 | 24 | 7 | NaN |
  9. | 5 | NaN | NaN | NaN |
  10. | 6 | 12 | 2 | 2 |
  11. | 7 | 9 | 3 | 1 |
  12. | 8 | 21 | 1 | 2 |
  13. | 9 | NaN | NaN | NaN |
  14. | 10 | 12 | 6 | 2 |
  15. | 11 | 21 | 6 | 1 |
  16. | 12 | 4 | 6 | 2 |
  17. | 13 | 9 | 6 | 2 |
  18. | 14 | 21 | 6 | 2 |
  19. | 15 | 12 | 6 | 1 |
  20. | 16 | NaN | NaN | NaN |
  21. | 17 | 20 | 4 | 2 |
  22. | 18 | 13 | NaN | NaN |
  23. | 19 | 1 | 4 | 2 |
  24. +----+-------+-------+------+

In reality this dataframe has many other columns but I need to count only based onc column/values combinations defined in dictionaries.

For example 3 dicts from above should generate following count value from pandas dataframe (df):

  1. count1 = df[df.region == 13].count()
  2. count2 = df[df.age1 == 7 & df.sex == 2].count()
  3. count3 = df[df.age1 == 7 & df.sex == 1].count()

I need iterator which will produce single count value for each dictionary and put that into list or pandas Series.

答案1

得分: 3

If I understand you correctly, you are looking for a translation of the provided code snippet and explanations. Here's the translated code snippet:

  1. 如果我理解你的意思正确你正在寻找[`.query()`][1]的用法
  2. queries = [{'value_0': 13, 'column_0': 'region'},
  3. { 'value_0': 7,
  4. 'value_1': 2,
  5. 'column_0': 'age1',
  6. 'column_1': 'sex'},
  7. { 'value_0': 7,
  8. 'value_1': 1,
  9. 'column_0': 'age1',
  10. 'column_1': 'sex'}]
  11. df = pd.DataFrame({
  12. 'age1': [np.nan,np.nan,np.nan,np.nan,7,np.nan,2,3,1,np.nan,6,6,6,6,6,6,np.nan,4,np.nan,4],
  13. 'sex': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,1,2,np.nan,2,1,2,2,2,1,np.nan,2,np.nan,2],
  14. 'region': [np.nan, np.nan, np.nan, 18, 24, np.nan, 12, 9, 21, np.nan, 12, 21, 4, 9, 21, 12, np.nan, 20, 13, 1],
  15. })
  16. for query in queries:
  17. q = []
  18. for i in range(len(query) // 2):
  19. q.append('{} == {}'.format(query['column_' + str(i)], query['value_' + str(i)]))
  20. q = ' & '.join(q)
  21. print('Running query {}'.format(q))
  22. print(df.query(q).count())

I hope this helps! If you have any more specific questions or need further assistance, please let me know.

英文:

If I understand you right, you are looking for .query():

  1. queries = [{'value_0': 13, 'column_0': 'region'},
  2. {'value_0': 7,
  3. 'value_1': 2,
  4. 'column_0': 'age1',
  5. 'column_1': 'sex'},
  6. {'value_0': 7,
  7. 'value_1': 1,
  8. 'column_0': 'age1',
  9. 'column_1': 'sex'}]
  10. df = pd.DataFrame({
  11. 'age1': [np.nan,np.nan,np.nan,np.nan,7,np.nan,2,3,1,np.nan,6,6,6,6,6,6,np.nan,4,np.nan,4],
  12. 'sex': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,1,2,np.nan,2,1,2,2,2,1,np.nan,2,np.nan,2],
  13. 'region': [np.nan, np.nan, np.nan, 18, 24, np.nan, 12, 9, 21, np.nan, 12, 21, 4, 9, 21, 12, np.nan, 20, 13, 1],
  14. })
  15. for query in queries:
  16. q = []
  17. for i in range( len(query) // 2 ):
  18. q.append('{} == {}'.format( query['column_' + str(i)], query['value_' + str(i)]))
  19. q = ' & '.join(q)
  20. print('Running query {}'.format(q))
  21. print( df.query(q).count() )

Prints:

  1. Running query region == 13
  2. age1 0
  3. sex 0
  4. region 1
  5. dtype: int64
  6. Running query age1 == 7 & sex == 2
  7. age1 0
  8. sex 0
  9. region 0
  10. dtype: int64
  11. Running query age1 == 7 & sex == 1
  12. age1 0
  13. sex 0
  14. region 0
  15. dtype: int64

EDIT:

If you want to support string in your queries, you have to quote the values. For example:

  1. for i in range( len(query) // 2 ):
  2. col = query['column_' + str(i)]
  3. val = query['value_' + str(i)]
  4. if isinstance(val, str):
  5. val = '"{}"'.format(val)
  6. q.append('{} == {}'.format(col, val))

EDIT: I tried it with pd.Int64Dtype() and seems it works:

  1. df['age1'] = df['age1'].astype(pd.Int64Dtype())
  2. df['sex'] = df['sex'].astype(pd.Int64Dtype())
  3. df['region'] = df['region'].astype(pd.Int64Dtype())
  4. print(df.info())
  5. for query in queries:
  6. q = []
  7. for i in range( len(query) // 2 ):
  8. q.append('{} == {}'.format( query['column_' + str(i)], query['value_' + str(i)]))
  9. q = ' & '.join(q)
  10. print('Running query {}'.format(q))
  11. print( df.query(q).count() )

Prints:

  1. <class 'pandas.core.frame.DataFrame'>
  2. RangeIndex: 20 entries, 0 to 19
  3. Data columns (total 3 columns):
  4. age1 12 non-null Int64
  5. sex 11 non-null Int64
  6. region 14 non-null Int64
  7. dtypes: Int64(3)
  8. memory usage: 668.0 bytes
  9. None
  10. Running query region == 13
  11. age1 0
  12. sex 0
  13. region 1
  14. dtype: int64
  15. Running query age1 == 7 & sex == 2
  16. age1 0
  17. sex 0
  18. region 0
  19. dtype: int64
  20. Running query age1 == 7 & sex == 1
  21. age1 0
  22. sex 0
  23. region 0
  24. dtype: int64

And if you don't want to get series as result from queries but single count value for given dictionary conditions, than change:

  1. df.query(q).count()

to

  1. df.query(q).count()[col]

in last line.

huangapple
  • 本文由 发表于 2020年1月3日 22:38:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/59580445.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定