2023年7月12日 22:20:28go评论71阅读模式

英文:

pandas frequency table with missing values

问题

我尝试计算pandas数据框中相等行的数量（即频率表），用于计算数据集的k-匿名性。

关于缺失值的计数，我有一个特殊要求：缺失值应该计入所有其他类别（因为缺失值“可能”是任何值）。此外，具有缺失值的记录的计数是关于缺失值的可能组合的数量。值应被视为分类。

给定这样的DataFrame，计数（以下标记为f_k）应该如下所示：

使用pandas的value_counts，我得到

d = {
    'key1': [1,1,2,np.nan],
    'key2': [1,1,1,1],
    'key3': [3,np.nan,3,np.nan]
}
df = pd.DataFrame(data=d)
df["key1"] = df["key1"].astype("Int64").astype('category')
df["key2"] = df["key2"].astype('Int64').astype('category')
df["key3"] = df["key3"].astype('Int64').astype('category')
df
.value_counts(dropna=False)
.reset_index()

有没有办法在pandas中实现这个？
1: https://i.stack.imgur.com/Asurs.png
2: https://i.stack.imgur.com/uYdIP.png

英文:

I try to calculate the number of equal rows in a pandas dataframe (i.e. a frequency table)
which is used to calculate the k-anonymity of a dataset

I have a special requirement regarding the counting of missing values : A missing value should count towards all other classes (as the missing value "could" be any value). In addition, the count of the record with missing values is the number of possible combinations regarding the missing values. Values should be taken as categorical

Given such a DataFrame, the count (below denoted as f_k) should look like

With pandas value_counts, I get

d = {
    &#39;key1&#39;: [1,1,2,np.nan],
    &#39;key2&#39;: [1,1,1,1],
    &#39;key3&#39;: [3,np.nan,3,np.nan]
    }
df = pd.DataFrame(data=d)
df[&quot;key1&quot;] = df[&quot;key1&quot;].astype(&quot;Int64&quot;).astype(&#39;category&#39;)
df[&quot;key2&quot;] = df[&quot;key2&quot;].astype(&#39;Int64&#39;).astype(&#39;category&#39;)
df[&quot;key3&quot;] = df[&quot;key3&quot;].astype(&#39;Int64&#39;).astype(&#39;category&#39;)
df
.value_counts(dropna=False)
.reset_index()

Any idea how to achieve this in pandas?

答案1

得分: 1

以下是翻译好的代码部分：

这个方法可行但耗时较长：
import pandas as pd
import numpy as np
data = {
    'key1': [1, 1, 2, np.nan],
    'key2': [1, 1, 1, 1],
    'key3': [3, np.nan, 3, np.nan]
}
df = pd.DataFrame(data)
fk_lst = []
for index, row in df.iterrows():
    non_nan_columns = row[row.notna()].index.tolist()
    df = df[non_nan_columns]
    for col in df.columns:
        df[col] = df[col].fillna(row[col])
    count = df.value_counts(dropna=False).reset_index()
    count = int(count[['count']].iloc[0])
    fk_lst.append(count)
    df = pd.DataFrame(data)
df['f_k'] = fk_lst

英文:

This works but time consuming:

import pandas as pd
import numpy as np
data = {
    &#39;key1&#39;: [1, 1, 2, np.nan],
    &#39;key2&#39;: [1, 1, 1, 1],
    &#39;key3&#39;: [3, np.nan, 3, np.nan]
}
df = pd.DataFrame(data)
fk_lst = []
for index, row in df.iterrows():
    non_nan_columns = row[row.notna()].index.tolist()
    df = df[non_nan_columns]
    for col in df.columns:
        df[col] = df[col].fillna(row[col])
    count = df.value_counts(dropna=False).reset_index()
    count = int(count[[&#39;count&#39;]].iloc[0])
    fk_lst.append(count)
    df = pd.DataFrame(data)
df[&#39;f_k&#39;] = fk_lst

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pandas 频率表与缺失值

问题

答案1

快速创建包含不同类型元素的嵌套列表的方法：numpy、pandas还是列表连接？

如何在不同列的值匹配时比较特定列中的数据？

Pandas DF: 创建新列，通过删除现有列的最后一个单词。

我可以计算基于电压从54到52和51到48的里程总和吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。