2023年6月26日 15:52:24go评论103阅读模式

英文:

heatmap for large dataframe after grouping

问题

我有一个类似下面所示的大型数据框。我想按列'sscinames'进行分组，因为它具有类似于'Acidobacteria bacterium'的条目（对每个样本进行求和），然后绘制热图。希望热图只显示基于样本计数的前20个'sscinames'。任何帮助都将不胜感激。

sscinames	S3_Day90_P3	S3_Day60_P3	S3_Day0_P1	S3_Day60_P1	S3_Day90_P1
    热浪菌门古菌	4	0	41	1	5
    浮游菌门细菌	5	3	0	1	40
    酸杆菌门细菌	6	15	0	8	13
    Trebonia kvetii	0	0	16	1	7
    Nonomuraea sp. RK-328	24	4	4	1	2
    硝酸菌门细菌	3	1	4	1	2
    酸杆菌门细菌	11	11	0	9	27

英文:

I have a huge data frame that looks something like what i've attached below. I would like to group-by the column 'sscinames' as it has similar entries like 'Acidobacteria bacterium' (get the sum of it for each sample) and then plot a heatmap. Would like the heat map to only show the top 20 'sscinames' based on the sample count. Any help would be appreciated.

sscinames	S3_Day90_P3	S3_Day60_P3	S3_Day0_P1	S3_Day60_P1	S3_Day90_P1
    Thermoplasmata archaeon	4	0	41	1	5
    Planctomycetes bacterium	5	3	0	1	40
    Acidobacteria bacterium	6	15	0	8	13
    Trebonia kvetii	0	0	16	1	7
    Nonomuraea sp. RK-328	24	4	4	1	2
    Nitrospirae bacterium	3	1	4	1	2
    Acidobacteria bacterium	11	11	0	9	27

答案1

得分: 3

# 首先对总和进行聚合，然后对每个聚合行使用 `DataFrame.reindex` 和 `Series.nlargest` 获取前 `N` 个值：
N = 3
df1 = df.groupby('sscinames').sum()
out = df1.reindex(df1.sum(axis=1).nlargest(N).index)
print (out)
# 最后用 `seaborn.heatmap` 绘制热图：
import seaborn as sns
sns.heatmap(out, annot=True)

英文:

First aggregate sum and then for top N value by sum per aggregate rows use DataFrame.reindex with Series.nlargest:

Last for heatmap use seaborn.heatmap:

import seaborn as sns
N = 3
df1 = df.groupby(&#39;sscinames&#39;).sum()
out = df1.reindex(df1.sum(axis=1).nlargest(N).index)
print (out)
                          S3_Day90_P3  S3_Day60_P3  S3_Day0_P1  S3_Day60_P1  \
sscinames                                                                     
Acidobacteria bacterium            17           26           0           17   
Thermoplasmata archaeon             4            0          41            1   
Planctomycetes bacterium            5            3           0            1   
                          S3_Day90_P1  
sscinames                              
Acidobacteria bacterium            40  
Thermoplasmata archaeon             5  
Planctomycetes bacterium           40  
sns.heatmap(out, annot=True)

答案2

得分: 1

This requires a few steps, but it is a perfect task for pandas and seaborn. I commented the example below to give you an idea of what is happening there.

import pandas as pd
import seaborn as sns
# This is just to create a dataframe from your table, replace with importing yours
df1 = pd.DataFrame([["Thermoplasmata archaeon",  4,  0, 41, 1,  5],
                    ["Planctomycetes bacterium", 5,  3,  0, 1, 40],
                    ["Acidobacteria bacterium",  6, 15,  0, 8, 13],
                    ["Trebonia kvetii",          0,  0, 16, 1,  7],
                    ["Nonomuraea sp. RK-328",   24,  4,  4, 1,  2],
                    ["Nitrospirae bacterium",    3,  1,  4, 1,  2],
                    ["Acidobacteria bacterium", 11, 11,  0, 9, 27]])
df1.columns = ["sscinames", "S3_Day90_P3", "S3_Day60_P3",  "S3_Day0_P1", "S3_Day60_P1", "S3_Day90_P1"]
# Create column with total sample count
df2 = pd.DataFrame(df1.iloc[:,1:].T.sum()) # This sums (along rows, hence the .T) all samples for each row
df2.columns = ["Total Samples"]
# ...and merge with your data frame to add the new column (axis=1)
df = pd.concat([df1, df2], axis=1)
# Now, turn the first column into a pandas index (then seaborn uses it immediately to label the axes)
df = df.set_index(['sscinames'])
# You can now sort your dataframe by the new column and proceed with the top entries
df = df.sort_values(by="Total Samples", ascending=False)
df_slice2plot = df.iloc[:5, :] # This takes the top five rows and all the columns, change as needed
# Now, on to plotting
sns.heatmap(df_slice2plot.iloc[:, 1:]) # Plot numeric columns (first column, with index 0, are the labels)

This is the result I get:

A heatmap of the example data

英文:

This requires a few steps, but it is a perfect task for pandas and seaborn. I commented the example below to give you an idea of what is happening there.

import pandas as pd
import seaborn as sns
# This is just to create a dataframe from your table, replace with importing yours
df1 = pd.DataFrame([[&quot;Thermoplasmata archaeon&quot;,  4,  0, 41, 1,  5],
                    [&quot;Planctomycetes bacterium&quot;, 5,  3,  0, 1, 40],
                    [&quot;Acidobacteria bacterium&quot;,  6, 15,  0, 8, 13],
                    [&quot;Trebonia kvetii&quot;,          0,  0, 16, 1,  7],
                    [&quot;Nonomuraea sp. RK-328&quot;,   24,  4,  4, 1,  2],
                    [&quot;Nitrospirae bacterium&quot;,    3,  1,  4, 1,  2],
                    [&quot;Acidobacteria bacterium&quot;, 11, 11,  0, 9, 27]])
df1.columns = [&quot;sscinames&quot;, &quot;S3_Day90_P3&quot;, &quot;S3_Day60_P3&quot;,  &quot;S3_Day0_P1&quot;, &quot;S3_Day60_P1&quot;, &quot;S3_Day90_P1&quot;]
# Create column with total sample count
df2 = pd.DataFrame(df1.iloc[:,1:].T.sum()) # This sums (along rows, hence the .T) all samples for each row
df2.columns = [&quot;Total Samples&quot;]
# ...and merge with your data frame to add the new column (axis=1)
df = pd.concat([df1, df2], axis=1)
# Now, turn the first column into a pandas index (then seaborn uses it immediately to label the axes)
df = df.set_index([&#39;sscinames&#39;])
# You can now sort your dataframe by the new column and proceed with the top entries
df = df.sort_values(by=&quot;Total Samples&quot;, ascending=False)
df_slice2plot = df.iloc[:5, :] # This takes the top five rows and all the columns, change as needed
# Now, on to plotting
sns.heatmap(df_slice2plot.iloc[:, 1:]) # Plot numeric columns (first column, with index 0, are the labels)

This is the result I get:

A heatmap of the example data

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

翻译结果：大数据框分组后的热力图

问题

答案1

答案2

将复杂的 JSON 转换为数据框（DataFrame）。

Difficulty initializing complex session_state for inputs in Streamlit.

No module named ‘discord.ext’; ‘discord’ is not a package

数据预处理阶段在机器学习中的正确顺序是什么？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。