pandas 频率表与缺失值

huangapple go评论48阅读模式
英文:

pandas frequency table with missing values

问题

我尝试计算pandas数据框中相等行的数量(即频率表),用于计算数据集的k-匿名性

关于缺失值的计数,我有一个特殊要求:缺失值应该计入所有其他类别(因为缺失值“可能”是任何值)。此外,具有缺失值的记录的计数是关于缺失值的可能组合的数量。值应被视为分类。

给定这样的DataFrame,计数(以下标记为f_k)应该如下所示:

pandas 频率表与缺失值

使用pandas的value_counts,我得到

d = {
    'key1': [1,1,2,np.nan],
    'key2': [1,1,1,1],
    'key3': [3,np.nan,3,np.nan]
}

df = pd.DataFrame(data=d)
df["key1"] = df["key1"].astype("Int64").astype('category')
df["key2"] = df["key2"].astype('Int64').astype('category')
df["key3"] = df["key3"].astype('Int64').astype('category')

df
.value_counts(dropna=False)
.reset_index()

pandas 频率表与缺失值

有没有办法在pandas中实现这个?
1: https://i.stack.imgur.com/Asurs.png
2: https://i.stack.imgur.com/uYdIP.png

英文:

I try to calculate the number of equal rows in a pandas dataframe (i.e. a frequency table)
which is used to calculate the k-anonymity of a dataset

I have a special requirement regarding the counting of missing values : A missing value should count towards all other classes (as the missing value "could" be any value). In addition, the count of the record with missing values is the number of possible combinations regarding the missing values. Values should be taken as categorical

Given such a DataFrame, the count (below denoted as f_k) should look like

pandas 频率表与缺失值

With pandas value_counts, I get

d = {
    'key1': [1,1,2,np.nan],
    'key2': [1,1,1,1],
    'key3': [3,np.nan,3,np.nan]
    }

df = pd.DataFrame(data=d)
df["key1"] = df["key1"].astype("Int64").astype('category')
df["key2"] = df["key2"].astype('Int64').astype('category')
df["key3"] = df["key3"].astype('Int64').astype('category')

df
.value_counts(dropna=False)
.reset_index()

pandas 频率表与缺失值

Any idea how to achieve this in pandas?

答案1

得分: 1

以下是翻译好的代码部分:

这个方法可行但耗时较长

import pandas as pd
import numpy as np


data = {
    'key1': [1, 1, 2, np.nan],
    'key2': [1, 1, 1, 1],
    'key3': [3, np.nan, 3, np.nan]
}
df = pd.DataFrame(data)

fk_lst = []
for index, row in df.iterrows():
    non_nan_columns = row[row.notna()].index.tolist()
    df = df[non_nan_columns]
    for col in df.columns:
        df[col] = df[col].fillna(row[col])
    count = df.value_counts(dropna=False).reset_index()
    count = int(count[['count']].iloc[0])
    fk_lst.append(count)
    df = pd.DataFrame(data)

df['f_k'] = fk_lst
英文:

This works but time consuming:

import pandas as pd
import numpy as np


data = {
    'key1': [1, 1, 2, np.nan],
    'key2': [1, 1, 1, 1],
    'key3': [3, np.nan, 3, np.nan]
}
df = pd.DataFrame(data)

fk_lst = []
for index, row in df.iterrows():
    non_nan_columns = row[row.notna()].index.tolist()
    df = df[non_nan_columns]
    for col in df.columns:
        df[col] = df[col].fillna(row[col])
    count = df.value_counts(dropna=False).reset_index()
    count = int(count[['count']].iloc[0])
    fk_lst.append(count)
    df = pd.DataFrame(data)

df['f_k'] = fk_lst

huangapple
  • 本文由 发表于 2023年7月12日 22:20:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76671589.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定