使用字典定义 Pandas 数据帧中计算出现次数的条件。

huangapple go评论79阅读模式
英文:

Using dictionaries to define condition for counting occurrences in pandas dataframe

问题

我正在尝试根据字典中定义的条件来统计DataFrame中的出现次数。

字典列表:

[{'value_0': 13, 'column_0': 'region'},
 {'value_0': 7,
  'value_1': 2,
  'column_0': 'age1',
  'column_1': 'sex'},
 {'value_0': 7,
  'value_1': 1,
  'column_0': 'age1',
  'column_1': 'sex'}]

每个字典应该从如下所示的DataFrame中生成计数结果:

+----+-------+-------+------+
|    | region| age1  | sex  |
+----+-------+-------+------+
|  0 | NaN   | NaN   | NaN  |
|  1 | NaN   | NaN   | NaN  |
|  2 | NaN   | NaN   | NaN  |
|  3 | 18    | NaN   | NaN  |
|  4 | 24    | 7     | NaN  |
|  5 | NaN   | NaN   | NaN  |
|  6 | 12    | 2     | 2    |
|  7 | 9     | 3     | 1    |
|  8 | 21    | 1     | 2    |
|  9 | NaN   | NaN   | NaN  |
| 10 | 12    | 6     | 2    |
| 11 | 21    | 6     | 1    |
| 12 | 4     | 6     | 2    |
| 13 | 9     | 6     | 2    |
| 14 | 21    | 6     | 2    |
| 15 | 12    | 6     | 1    |
| 16 | NaN   | NaN   | NaN  |
| 17 | 20    | 4     | 2    |
| 18 | 13    | NaN   | NaN  |
| 19 | 1     | 4     | 2    |
+----+-------+-------+------+

实际上,这个DataFrame还有许多其他列,但我只需要根据字典中定义的列/值组合来计数。

例如,上面的3个字典应该从pandas DataFrame(df)中生成以下计数值:

count1 = df[df.region == 13].count()
count2 = df[(df.age1 == 7) & (df.sex == 2)].count()
count3 = df[(df.age1 == 7) & (df.sex == 1)].count()

我需要一个迭代器,它将为每个字典生成单个计数值,并将其放入列表或pandas Series中。

英文:

I'm trying to count occurrences in dataframe based on conditions defined in dictionaries.

list of dictionaries:

[{'value_0': 13, 'column_0': 'region'},
 {'value_0': 7,
  'value_1': 2,
  'column_0': 'age1',
  'column_1': 'sex'},
 {'value_0': 7,
  'value_1': 1,
  'column_0': 'age1',
  'column_1': 'sex'}]

Each dictionary should produce count result from dataframe that looks like this:

+----+-------+-------+------+
|    | region| age1  | sex  |
+----+-------+-------+------+
|  0 | NaN   | NaN   | NaN  |
|  1 | NaN   | NaN   | NaN  |
|  2 | NaN   | NaN   | NaN  |
|  3 | 18    | NaN   | NaN  |
|  4 | 24    | 7     | NaN  |
|  5 | NaN   | NaN   | NaN  |
|  6 | 12    | 2     | 2    |
|  7 | 9     | 3     | 1    |
|  8 | 21    | 1     | 2    |
|  9 | NaN   | NaN   | NaN  |
| 10 | 12    | 6     | 2    |
| 11 | 21    | 6     | 1    |
| 12 | 4     | 6     | 2    |
| 13 | 9     | 6     | 2    |
| 14 | 21    | 6     | 2    |
| 15 | 12    | 6     | 1    |
| 16 | NaN   | NaN   | NaN  |
| 17 | 20    | 4     | 2    |
| 18 | 13    | NaN   | NaN  |
| 19 | 1     | 4     | 2    |
+----+-------+-------+------+

In reality this dataframe has many other columns but I need to count only based onc column/values combinations defined in dictionaries.

For example 3 dicts from above should generate following count value from pandas dataframe (df):

count1 = df[df.region == 13].count()
count2 = df[df.age1 == 7 & df.sex == 2].count()
count3 = df[df.age1 == 7 & df.sex == 1].count()

I need iterator which will produce single count value for each dictionary and put that into list or pandas Series.

答案1

得分: 3

If I understand you correctly, you are looking for a translation of the provided code snippet and explanations. Here's the translated code snippet:

如果我理解你的意思正确你正在寻找[`.query()`][1]的用法

queries = [{'value_0': 13, 'column_0': 'region'},
 { 'value_0': 7,
  'value_1': 2,
  'column_0': 'age1',
  'column_1': 'sex'},
 { 'value_0': 7,
  'value_1': 1,
  'column_0': 'age1',
  'column_1': 'sex'}]

df = pd.DataFrame({
    'age1': [np.nan,np.nan,np.nan,np.nan,7,np.nan,2,3,1,np.nan,6,6,6,6,6,6,np.nan,4,np.nan,4],
    'sex': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,1,2,np.nan,2,1,2,2,2,1,np.nan,2,np.nan,2],
    'region': [np.nan, np.nan, np.nan, 18, 24, np.nan, 12, 9, 21, np.nan, 12, 21, 4, 9, 21, 12, np.nan, 20, 13, 1],
})

for query in queries:
    q = []
    for i in range(len(query) // 2):
        q.append('{} == {}'.format(query['column_' + str(i)], query['value_' + str(i)]))
    q = ' & '.join(q)
    print('Running query {}'.format(q))
    print(df.query(q).count())

I hope this helps! If you have any more specific questions or need further assistance, please let me know.

英文:

If I understand you right, you are looking for .query():

queries = [{'value_0': 13, 'column_0': 'region'},
 {'value_0': 7,
  'value_1': 2,
  'column_0': 'age1',
  'column_1': 'sex'},
 {'value_0': 7,
  'value_1': 1,
  'column_0': 'age1',
  'column_1': 'sex'}]

df = pd.DataFrame({
    'age1': [np.nan,np.nan,np.nan,np.nan,7,np.nan,2,3,1,np.nan,6,6,6,6,6,6,np.nan,4,np.nan,4],
    'sex': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,1,2,np.nan,2,1,2,2,2,1,np.nan,2,np.nan,2],
    'region': [np.nan, np.nan, np.nan, 18, 24, np.nan, 12, 9, 21, np.nan, 12, 21, 4, 9, 21, 12, np.nan, 20, 13, 1],
})

for query in queries:
    q = []
    for i in range( len(query) // 2 ):
        q.append('{} == {}'.format( query['column_' + str(i)], query['value_' + str(i)]))
    q = ' & '.join(q)
    print('Running query {}'.format(q))
    print( df.query(q).count() )

Prints:

Running query region == 13
age1      0
sex       0
region    1
dtype: int64
Running query age1 == 7 & sex == 2
age1      0
sex       0
region    0
dtype: int64
Running query age1 == 7 & sex == 1
age1      0
sex       0
region    0
dtype: int64

EDIT:

If you want to support string in your queries, you have to quote the values. For example:

for i in range( len(query) // 2 ):
    col = query['column_' + str(i)]
    val = query['value_' + str(i)]
    if isinstance(val, str):
        val = '"{}"'.format(val)
    q.append('{} == {}'.format(col, val))

EDIT: I tried it with pd.Int64Dtype() and seems it works:

df['age1'] = df['age1'].astype(pd.Int64Dtype())
df['sex'] = df['sex'].astype(pd.Int64Dtype())
df['region'] = df['region'].astype(pd.Int64Dtype())

print(df.info())

for query in queries:
    q = []
    for i in range( len(query) // 2 ):
        q.append('{} == {}'.format( query['column_' + str(i)], query['value_' + str(i)]))
    q = ' & '.join(q)
    print('Running query {}'.format(q))
    print( df.query(q).count() )

Prints:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
age1      12 non-null Int64
sex       11 non-null Int64
region    14 non-null Int64
dtypes: Int64(3)
memory usage: 668.0 bytes
None
Running query region == 13
age1      0
sex       0
region    1
dtype: int64
Running query age1 == 7 & sex == 2
age1      0
sex       0
region    0
dtype: int64
Running query age1 == 7 & sex == 1
age1      0
sex       0
region    0
dtype: int64

And if you don't want to get series as result from queries but single count value for given dictionary conditions, than change:

df.query(q).count()

to

df.query(q).count()[col]

in last line.

huangapple
  • 本文由 发表于 2020年1月3日 22:38:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/59580445.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定