问题

我有一个庞大的稀疏数据集在一个DataFrame中，一直在使用df.to_sparse，但它将很快被弃用，所以想要切换到pd.Series(pd.SparseArray())，但不确定如何对整个DataFrame进行操作？

我的最终DataFrame有100,000行和49,000列，所以需要一种自动化的方法。

英文:

I have a huge sparse dataset in a dataframe and have been using df.to_sparse but it will be deprecated soon so wanted to switch to pd.Series(pd.SparseArray()) but not sure how to do that for an entire dataframe?

My final df is 100K rows and 49K columns so need an automated way.

答案1

得分: 1

你可以尝试类似这样的方式：

dtype = {key: pd.SparseDtype(df.dtypes[key].type, fill_value=df[key].value_counts().argmax()) for key in df.dtypes.keys()}

df = df.astype(dtype)

然后使用 df.sparse.density 检查稀疏度。

这将为每列创建稀疏数据，以最常见的值作为填充值。

（不确定是否是最佳方法）

英文:

You could try something like this :

dtype = {key: pd.SparseDtype(df.dtypes[key].type, fill_value=df[key].value_counts().argmax()) for key in df.dtypes.keys()}

df = df.astype(dtype)

And then check the density with df.sparse.density.

This will create sparse data for each column, taking most frequent value as filling value.

(not sure if it's the best approach though)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将pandas DataFrame转换为稀疏DataFrame

问题

答案1

mac终端错误，但不是VSCode的终端。

在pandas中为不同项目和不同的开始和结束日期插入连续日期的行。

在 mplfinance 图表中着色区域

如何操作这个带有时间序列数据的Pandas数据框，以使其更容易使用？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论