翻译结果:大数据框分组后的热力图

huangapple go评论103阅读模式
英文:

heatmap for large dataframe after grouping

问题

我有一个类似下面所示的大型数据框。我想按列'sscinames'进行分组,因为它具有类似于'Acidobacteria bacterium'的条目(对每个样本进行求和),然后绘制热图。希望热图只显示基于样本计数的前20个'sscinames'。任何帮助都将不胜感激。

  1. sscinames S3_Day90_P3 S3_Day60_P3 S3_Day0_P1 S3_Day60_P1 S3_Day90_P1
  2. 热浪菌门古菌 4 0 41 1 5
  3. 浮游菌门细菌 5 3 0 1 40
  4. 酸杆菌门细菌 6 15 0 8 13
  5. Trebonia kvetii 0 0 16 1 7
  6. Nonomuraea sp. RK-328 24 4 4 1 2
  7. 硝酸菌门细菌 3 1 4 1 2
  8. 酸杆菌门细菌 11 11 0 9 27
英文:

I have a huge data frame that looks something like what i've attached below. I would like to group-by the column 'sscinames' as it has similar entries like 'Acidobacteria bacterium' (get the sum of it for each sample) and then plot a heatmap. Would like the heat map to only show the top 20 'sscinames' based on the sample count. Any help would be appreciated.

  1. sscinames S3_Day90_P3 S3_Day60_P3 S3_Day0_P1 S3_Day60_P1 S3_Day90_P1
  2. Thermoplasmata archaeon 4 0 41 1 5
  3. Planctomycetes bacterium 5 3 0 1 40
  4. Acidobacteria bacterium 6 15 0 8 13
  5. Trebonia kvetii 0 0 16 1 7
  6. Nonomuraea sp. RK-328 24 4 4 1 2
  7. Nitrospirae bacterium 3 1 4 1 2
  8. Acidobacteria bacterium 11 11 0 9 27

翻译结果:大数据框分组后的热力图

答案1

得分: 3

  1. # 首先对总和进行聚合,然后对每个聚合行使用 `DataFrame.reindex` 和 `Series.nlargest` 获取前 `N` 个值:
  2. N = 3
  3. df1 = df.groupby('sscinames').sum()
  4. out = df1.reindex(df1.sum(axis=1).nlargest(N).index)
  5. print (out)
  6. # 最后用 `seaborn.heatmap` 绘制热图:
  7. import seaborn as sns
  8. sns.heatmap(out, annot=True)
英文:

First aggregate sum and then for top N value by sum per aggregate rows use DataFrame.reindex with Series.nlargest:

Last for heatmap use seaborn.heatmap:

  1. import seaborn as sns
  2. N = 3
  3. df1 = df.groupby('sscinames').sum()
  4. out = df1.reindex(df1.sum(axis=1).nlargest(N).index)
  5. print (out)
  6. S3_Day90_P3 S3_Day60_P3 S3_Day0_P1 S3_Day60_P1 \
  7. sscinames
  8. Acidobacteria bacterium 17 26 0 17
  9. Thermoplasmata archaeon 4 0 41 1
  10. Planctomycetes bacterium 5 3 0 1
  11. S3_Day90_P1
  12. sscinames
  13. Acidobacteria bacterium 40
  14. Thermoplasmata archaeon 5
  15. Planctomycetes bacterium 40
  16. sns.heatmap(out, annot=True)

翻译结果:大数据框分组后的热力图

答案2

得分: 1

This requires a few steps, but it is a perfect task for pandas and seaborn. I commented the example below to give you an idea of what is happening there.

  1. import pandas as pd
  2. import seaborn as sns
  3. # This is just to create a dataframe from your table, replace with importing yours
  4. df1 = pd.DataFrame([["Thermoplasmata archaeon", 4, 0, 41, 1, 5],
  5. ["Planctomycetes bacterium", 5, 3, 0, 1, 40],
  6. ["Acidobacteria bacterium", 6, 15, 0, 8, 13],
  7. ["Trebonia kvetii", 0, 0, 16, 1, 7],
  8. ["Nonomuraea sp. RK-328", 24, 4, 4, 1, 2],
  9. ["Nitrospirae bacterium", 3, 1, 4, 1, 2],
  10. ["Acidobacteria bacterium", 11, 11, 0, 9, 27]])
  11. df1.columns = ["sscinames", "S3_Day90_P3", "S3_Day60_P3", "S3_Day0_P1", "S3_Day60_P1", "S3_Day90_P1"]
  12. # Create column with total sample count
  13. df2 = pd.DataFrame(df1.iloc[:,1:].T.sum()) # This sums (along rows, hence the .T) all samples for each row
  14. df2.columns = ["Total Samples"]
  15. # ...and merge with your data frame to add the new column (axis=1)
  16. df = pd.concat([df1, df2], axis=1)
  17. # Now, turn the first column into a pandas index (then seaborn uses it immediately to label the axes)
  18. df = df.set_index(['sscinames'])
  19. # You can now sort your dataframe by the new column and proceed with the top entries
  20. df = df.sort_values(by="Total Samples", ascending=False)
  21. df_slice2plot = df.iloc[:5, :] # This takes the top five rows and all the columns, change as needed
  22. # Now, on to plotting
  23. sns.heatmap(df_slice2plot.iloc[:, 1:]) # Plot numeric columns (first column, with index 0, are the labels)

This is the result I get:

A heatmap of the example data

英文:

This requires a few steps, but it is a perfect task for pandas and seaborn. I commented the example below to give you an idea of what is happening there.

  1. import pandas as pd
  2. import seaborn as sns
  3. # This is just to create a dataframe from your table, replace with importing yours
  4. df1 = pd.DataFrame([["Thermoplasmata archaeon", 4, 0, 41, 1, 5],
  5. ["Planctomycetes bacterium", 5, 3, 0, 1, 40],
  6. ["Acidobacteria bacterium", 6, 15, 0, 8, 13],
  7. ["Trebonia kvetii", 0, 0, 16, 1, 7],
  8. ["Nonomuraea sp. RK-328", 24, 4, 4, 1, 2],
  9. ["Nitrospirae bacterium", 3, 1, 4, 1, 2],
  10. ["Acidobacteria bacterium", 11, 11, 0, 9, 27]])
  11. df1.columns = ["sscinames", "S3_Day90_P3", "S3_Day60_P3", "S3_Day0_P1", "S3_Day60_P1", "S3_Day90_P1"]
  12. # Create column with total sample count
  13. df2 = pd.DataFrame(df1.iloc[:,1:].T.sum()) # This sums (along rows, hence the .T) all samples for each row
  14. df2.columns = ["Total Samples"]
  15. # ...and merge with your data frame to add the new column (axis=1)
  16. df = pd.concat([df1, df2], axis=1)
  17. # Now, turn the first column into a pandas index (then seaborn uses it immediately to label the axes)
  18. df = df.set_index(['sscinames'])
  19. # You can now sort your dataframe by the new column and proceed with the top entries
  20. df = df.sort_values(by="Total Samples", ascending=False)
  21. df_slice2plot = df.iloc[:5, :] # This takes the top five rows and all the columns, change as needed
  22. # Now, on to plotting
  23. sns.heatmap(df_slice2plot.iloc[:, 1:]) # Plot numeric columns (first column, with index 0, are the labels)

This is the result I get:

A heatmap of the example data

huangapple
  • 本文由 发表于 2023年6月26日 15:52:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76554612.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定