英文:
pandas frequency table with missing values
问题
我尝试计算pandas数据框中相等行的数量(即频率表),用于计算数据集的k-匿名性。
关于缺失值的计数,我有一个特殊要求:缺失值应该计入所有其他类别(因为缺失值“可能”是任何值)。此外,具有缺失值的记录的计数是关于缺失值的可能组合的数量。值应被视为分类。
给定这样的DataFrame,计数(以下标记为f_k)应该如下所示:
使用pandas的value_counts,我得到
d = {
'key1': [1,1,2,np.nan],
'key2': [1,1,1,1],
'key3': [3,np.nan,3,np.nan]
}
df = pd.DataFrame(data=d)
df["key1"] = df["key1"].astype("Int64").astype('category')
df["key2"] = df["key2"].astype('Int64').astype('category')
df["key3"] = df["key3"].astype('Int64').astype('category')
df
.value_counts(dropna=False)
.reset_index()
有没有办法在pandas中实现这个?
1: https://i.stack.imgur.com/Asurs.png
2: https://i.stack.imgur.com/uYdIP.png
英文:
I try to calculate the number of equal rows in a pandas dataframe (i.e. a frequency table)
which is used to calculate the k-anonymity of a dataset
I have a special requirement regarding the counting of missing values : A missing value should count towards all other classes (as the missing value "could" be any value). In addition, the count of the record with missing values is the number of possible combinations regarding the missing values. Values should be taken as categorical
Given such a DataFrame, the count (below denoted as f_k) should look like
With pandas value_counts, I get
d = {
'key1': [1,1,2,np.nan],
'key2': [1,1,1,1],
'key3': [3,np.nan,3,np.nan]
}
df = pd.DataFrame(data=d)
df["key1"] = df["key1"].astype("Int64").astype('category')
df["key2"] = df["key2"].astype('Int64').astype('category')
df["key3"] = df["key3"].astype('Int64').astype('category')
df
.value_counts(dropna=False)
.reset_index()
Any idea how to achieve this in pandas?
答案1
得分: 1
以下是翻译好的代码部分:
这个方法可行但耗时较长:
import pandas as pd
import numpy as np
data = {
'key1': [1, 1, 2, np.nan],
'key2': [1, 1, 1, 1],
'key3': [3, np.nan, 3, np.nan]
}
df = pd.DataFrame(data)
fk_lst = []
for index, row in df.iterrows():
non_nan_columns = row[row.notna()].index.tolist()
df = df[non_nan_columns]
for col in df.columns:
df[col] = df[col].fillna(row[col])
count = df.value_counts(dropna=False).reset_index()
count = int(count[['count']].iloc[0])
fk_lst.append(count)
df = pd.DataFrame(data)
df['f_k'] = fk_lst
英文:
This works but time consuming:
import pandas as pd
import numpy as np
data = {
'key1': [1, 1, 2, np.nan],
'key2': [1, 1, 1, 1],
'key3': [3, np.nan, 3, np.nan]
}
df = pd.DataFrame(data)
fk_lst = []
for index, row in df.iterrows():
non_nan_columns = row[row.notna()].index.tolist()
df = df[non_nan_columns]
for col in df.columns:
df[col] = df[col].fillna(row[col])
count = df.value_counts(dropna=False).reset_index()
count = int(count[['count']].iloc[0])
fk_lst.append(count)
df = pd.DataFrame(data)
df['f_k'] = fk_lst
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论