英文:
Using dictionaries to define condition for counting occurrences in pandas dataframe
问题
我正在尝试根据字典中定义的条件来统计DataFrame中的出现次数。
字典列表:
[{'value_0': 13, 'column_0': 'region'},
{'value_0': 7,
'value_1': 2,
'column_0': 'age1',
'column_1': 'sex'},
{'value_0': 7,
'value_1': 1,
'column_0': 'age1',
'column_1': 'sex'}]
每个字典应该从如下所示的DataFrame中生成计数结果:
+----+-------+-------+------+
| | region| age1 | sex |
+----+-------+-------+------+
| 0 | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN |
| 3 | 18 | NaN | NaN |
| 4 | 24 | 7 | NaN |
| 5 | NaN | NaN | NaN |
| 6 | 12 | 2 | 2 |
| 7 | 9 | 3 | 1 |
| 8 | 21 | 1 | 2 |
| 9 | NaN | NaN | NaN |
| 10 | 12 | 6 | 2 |
| 11 | 21 | 6 | 1 |
| 12 | 4 | 6 | 2 |
| 13 | 9 | 6 | 2 |
| 14 | 21 | 6 | 2 |
| 15 | 12 | 6 | 1 |
| 16 | NaN | NaN | NaN |
| 17 | 20 | 4 | 2 |
| 18 | 13 | NaN | NaN |
| 19 | 1 | 4 | 2 |
+----+-------+-------+------+
实际上,这个DataFrame还有许多其他列,但我只需要根据字典中定义的列/值组合来计数。
例如,上面的3个字典应该从pandas DataFrame(df)中生成以下计数值:
count1 = df[df.region == 13].count()
count2 = df[(df.age1 == 7) & (df.sex == 2)].count()
count3 = df[(df.age1 == 7) & (df.sex == 1)].count()
我需要一个迭代器,它将为每个字典生成单个计数值,并将其放入列表或pandas Series中。
英文:
I'm trying to count occurrences in dataframe based on conditions defined in dictionaries.
list of dictionaries:
[{'value_0': 13, 'column_0': 'region'},
{'value_0': 7,
'value_1': 2,
'column_0': 'age1',
'column_1': 'sex'},
{'value_0': 7,
'value_1': 1,
'column_0': 'age1',
'column_1': 'sex'}]
Each dictionary should produce count result from dataframe that looks like this:
+----+-------+-------+------+
| | region| age1 | sex |
+----+-------+-------+------+
| 0 | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN |
| 3 | 18 | NaN | NaN |
| 4 | 24 | 7 | NaN |
| 5 | NaN | NaN | NaN |
| 6 | 12 | 2 | 2 |
| 7 | 9 | 3 | 1 |
| 8 | 21 | 1 | 2 |
| 9 | NaN | NaN | NaN |
| 10 | 12 | 6 | 2 |
| 11 | 21 | 6 | 1 |
| 12 | 4 | 6 | 2 |
| 13 | 9 | 6 | 2 |
| 14 | 21 | 6 | 2 |
| 15 | 12 | 6 | 1 |
| 16 | NaN | NaN | NaN |
| 17 | 20 | 4 | 2 |
| 18 | 13 | NaN | NaN |
| 19 | 1 | 4 | 2 |
+----+-------+-------+------+
In reality this dataframe has many other columns but I need to count only based onc column/values combinations defined in dictionaries.
For example 3 dicts from above should generate following count value from pandas dataframe (df):
count1 = df[df.region == 13].count()
count2 = df[df.age1 == 7 & df.sex == 2].count()
count3 = df[df.age1 == 7 & df.sex == 1].count()
I need iterator which will produce single count value for each dictionary and put that into list or pandas Series.
答案1
得分: 3
If I understand you correctly, you are looking for a translation of the provided code snippet and explanations. Here's the translated code snippet:
如果我理解你的意思正确,你正在寻找[`.query()`][1]的用法:
queries = [{'value_0': 13, 'column_0': 'region'},
{ 'value_0': 7,
'value_1': 2,
'column_0': 'age1',
'column_1': 'sex'},
{ 'value_0': 7,
'value_1': 1,
'column_0': 'age1',
'column_1': 'sex'}]
df = pd.DataFrame({
'age1': [np.nan,np.nan,np.nan,np.nan,7,np.nan,2,3,1,np.nan,6,6,6,6,6,6,np.nan,4,np.nan,4],
'sex': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,1,2,np.nan,2,1,2,2,2,1,np.nan,2,np.nan,2],
'region': [np.nan, np.nan, np.nan, 18, 24, np.nan, 12, 9, 21, np.nan, 12, 21, 4, 9, 21, 12, np.nan, 20, 13, 1],
})
for query in queries:
q = []
for i in range(len(query) // 2):
q.append('{} == {}'.format(query['column_' + str(i)], query['value_' + str(i)]))
q = ' & '.join(q)
print('Running query {}'.format(q))
print(df.query(q).count())
I hope this helps! If you have any more specific questions or need further assistance, please let me know.
英文:
If I understand you right, you are looking for .query()
:
queries = [{'value_0': 13, 'column_0': 'region'},
{'value_0': 7,
'value_1': 2,
'column_0': 'age1',
'column_1': 'sex'},
{'value_0': 7,
'value_1': 1,
'column_0': 'age1',
'column_1': 'sex'}]
df = pd.DataFrame({
'age1': [np.nan,np.nan,np.nan,np.nan,7,np.nan,2,3,1,np.nan,6,6,6,6,6,6,np.nan,4,np.nan,4],
'sex': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2,1,2,np.nan,2,1,2,2,2,1,np.nan,2,np.nan,2],
'region': [np.nan, np.nan, np.nan, 18, 24, np.nan, 12, 9, 21, np.nan, 12, 21, 4, 9, 21, 12, np.nan, 20, 13, 1],
})
for query in queries:
q = []
for i in range( len(query) // 2 ):
q.append('{} == {}'.format( query['column_' + str(i)], query['value_' + str(i)]))
q = ' & '.join(q)
print('Running query {}'.format(q))
print( df.query(q).count() )
Prints:
Running query region == 13
age1 0
sex 0
region 1
dtype: int64
Running query age1 == 7 & sex == 2
age1 0
sex 0
region 0
dtype: int64
Running query age1 == 7 & sex == 1
age1 0
sex 0
region 0
dtype: int64
EDIT:
If you want to support string in your queries, you have to quote the values. For example:
for i in range( len(query) // 2 ):
col = query['column_' + str(i)]
val = query['value_' + str(i)]
if isinstance(val, str):
val = '"{}"'.format(val)
q.append('{} == {}'.format(col, val))
EDIT: I tried it with pd.Int64Dtype()
and seems it works:
df['age1'] = df['age1'].astype(pd.Int64Dtype())
df['sex'] = df['sex'].astype(pd.Int64Dtype())
df['region'] = df['region'].astype(pd.Int64Dtype())
print(df.info())
for query in queries:
q = []
for i in range( len(query) // 2 ):
q.append('{} == {}'.format( query['column_' + str(i)], query['value_' + str(i)]))
q = ' & '.join(q)
print('Running query {}'.format(q))
print( df.query(q).count() )
Prints:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
age1 12 non-null Int64
sex 11 non-null Int64
region 14 non-null Int64
dtypes: Int64(3)
memory usage: 668.0 bytes
None
Running query region == 13
age1 0
sex 0
region 1
dtype: int64
Running query age1 == 7 & sex == 2
age1 0
sex 0
region 0
dtype: int64
Running query age1 == 7 & sex == 1
age1 0
sex 0
region 0
dtype: int64
And if you don't want to get series as result from queries but single count value for given dictionary conditions, than change:
df.query(q).count()
to
df.query(q).count()[col]
in last line.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论