如何合并两个数据框并进行数据透视。

huangapple go评论102阅读模式
英文:

How to join 2 dataframes and pivot them

问题

如何合并两个数据框并对它们进行透视

我有一个名为dfICU的数据框,其中包含医院中ICU单位的列表

  1. ICU
  2. A1
  3. A2
  4. A3
  5. B1
  6. B2Closed
  7. B2Covid
  8. B7
  9. C1West
  10. C2South
  11. C3
  12. P53Child

另一个数据框dfPts包含患者信息

  1. PtsID VisitID ICU Frequency
  2. 934 15 A3 4
  3. 934 15 C2South 2
  4. 934 62 B2Covid 5
  5. 934 62 A2 6
  6. 882 35 C2South 7
  7. 882 35 C3 2
  8. 882 35 A2 9
  9. 882 91 P53Child 5
  10. 105 44 C2South 2
  11. 105 80 B7 8

我试图将它们合并成一个单一的透视数据框,如果dfPts中不存在该ICU单位,则显示0频率

类似于这样

  1. PtsID VisitID A1 A2 A3 B1 B2Closed B2Covid B7 C1West C2South C3 .... P53Child
  2. 934 15 0 0 4 0 0 0 0 0 2 0 0
  3. 934 62 0 6 0 0 0 5 0 0 0 0 0
  4. 882 35 0 0 0 0 0 0 0 0 7 2 0
  5. 882 91 0 0 0 0 0 0 0 0 0 0 5
  6. 105 44 0 0 0 0 0 0 0 0 2 0 0
  7. 105 80 0 0 0 0 0 0 8 0 0 0 0

我首先对dfPts进行透视,但这并没有添加所有dfICU中的ICU单位,因为某些ICUs对于所有患者都为0

到目前为止,我已经做了以下工作,但之后不知道该怎么做

  1. df = dfPts.set_index(['PtsID','VisitID']).pivot(columns='ICU')['Frequency']
  2. df[np.isnan(df)] = 0

如何实现这一目标?

英文:

how to join 2 dataframes and pivot them

I have this dfICU dataframe that has list of ICU units in a hospital

  1. ICU
  2. A1
  3. A2
  4. A3
  5. B1
  6. B2Closed
  7. B2Covid
  8. B7
  9. C1West
  10. C2South
  11. C3
  12. .
  13. .
  14. .
  15. P53Child

the other dataframe dfPts has Patients info

  1. PtsID VisitID ICU Frequency
  2. 934 15 A3 4
  3. 934 15 C2South 2
  4. 934 62 B2Covid 5
  5. 934 62 A2 6
  6. 882 35 C2South 7
  7. 882 35 C3 2
  8. 882 35 A2 9
  9. 882 91 P53Child 5
  10. 105 44 C2South 2
  11. 105 80 B7 8

I am trying to put them both in a single pivoted dataframe so if the ICU unit does not exit in the dfPts it shows 0 Frequency

Something like this

  1. PtsID VisitID A1 A2 A3 B1 B2Closed B2Covid B7 C1West C2South C3 .... P53Child
  2. 934 15 0 0 4 0 0 0 0 0 2 0 0
  3. 934 62 0 6 0 0 0 5 0 0 0 0 0
  4. 882 35 0 0 0 0 0 0 0 0 7 2 0
  5. 882 91 0 0 0 0 0 0 0 0 0 0 5
  6. 105 44 0 0 0 0 0 0 0 0 2 0 0
  7. 105 80 0 0 0 0 0 0 8 0 0 0 0

I start by pivoting the dfPts but that did not add all ICU units in dfICU because some ICUs are 0 for all patients

here is what i have done so far and did not know what to do after

  1. df = dfPts.set_index(['PtsID','VisitID']).pivot(columns='ICU')['Frequency']
  2. df[np.isnan(df)] = 0

How to do that?

答案1

得分: 2

将数据框进行透视,然后重新索引以确保所有的ICU都出现在列标题中,然后用0填充缺失的ICU的值。

结果

  1. ICU PtsID VisitID A1 A2 A3 B1 B2Closed B2Covid B7 C1West C2South C3 P53Child
  2. 0 105 44 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
  3. 1 105 80 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0 0.0 0.0 0.0
  4. 2 882 35 0.0 9.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 2.0 0.0
  5. 3 882 91 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0
  6. 4 934 15 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
  7. 5 934 62 0.0 6.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0
英文:

Pivot the dataframe then reindex to ensure all the icus are present in the column headers, then fill the values in missing icus with 0

  1. icus = dfICU['ICU'].unique()
  2. (
  3. dfPts
  4. .pivot(index=['PtsID', 'VisitID'], columns='ICU', values='Frequency')
  5. .reindex(columns=icus)
  6. .fillna(0).reset_index()
  7. )

Result

  1. ICU PtsID VisitID A1 A2 A3 B1 B2Closed B2Covid B7 C1West C2South C3 P53Child
  2. 0 105 44 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
  3. 1 105 80 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0 0.0 0.0 0.0
  4. 2 882 35 0.0 9.0 0.0 0.0 0.0 0.0 0.0 0.0 7.0 2.0 0.0
  5. 3 882 91 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0
  6. 4 934 15 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
  7. 5 934 62 0.0 6.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0

答案2

得分: 1

由于dfICU只包含已经存在于dfPtsICU列中的值的列表,因此您只需要使用dfPts来创建您想要的数据透视表。

在创建数据透视表之前,请确保Frequency是数值类型:

  1. # 获取列的信息
  2. dfPts.info()

然后创建数据透视表:

  1. # 创建数据透视表
  2. pivot_table = pd.pivot_table(dfPts, index=['PtsID', 'VisitID'],
  3. columns='ICU', values='Frequency',
  4. aggfunc='sum') # 您可以更改聚合函数
  5. #.reset_index() # 如果您想要索引作为列,请取消注释
  6. # 检查数据透视表
  7. pivot_table.head(10)

如果发现dfICU中的一些ICU在数据透视表的列中不存在,您可以像这样添加它们并将值设置为0:

  1. # 创建缺失列的列表
  2. ICU_list = list(dfICU.ICU.unique())
  3. missing_cols = [col for col in ICU_list
  4. if col not in pivot_table.columns]
  5. # 将缺失的列添加到数据透视表并设置值为0
  6. for col in missing_cols:
  7. pivot_table[col] = 0
  8. # 检查数据透视表
  9. pivot_table.head()
英文:

Since dfICU only contains a list of values that are already present in the column ICU in dfPts, you only need to use dfPts for the pivot table you want.

Before creating the pivot table, make sure Frequency is numeric:

  1. # get info of columns
  2. dfPts.info()

Then create pivot table:

  1. # create a pivot table
  2. pivot_table = (pd.pivot_table(dfPts, index=['PtsID','VisitID'],
  3. columns='ICU', values='Frequency',
  4. aggfunc='sum') # you can change the aggregation function
  5. #.reset_index() # uncomment if you want indices as columns
  6. )
  7. # check pivot table
  8. pivot_table.head(10)

If you find that some ICUs from dfICU are missing in the pivot table columns, you can add them with 0 values like this:

  1. # create list of missing columns
  2. ICU_list = list(dfICU.ICU.unique())
  3. missing_cols = [col for col in ICU_list
  4. if col not in pivot_table.columns]
  5. # add missing columns to pivot table
  6. for col in missing_cols:
  7. pivot_table[col] = 0
  8. # check pivot table
  9. pivot_table.head()

答案3

得分: 1

另一种可能的解决方案:

  1. (pd.concat([dfPts, dfICU])
  2. .pivot(index=['PtsID', 'VisitID'], columns='ICU', values='Frequency')
  3. .dropna(how='all').reset_index().fillna(0))

输出:

  1. ICU PtsID VisitID A1 A2 A3 B1 B2Closed B2Covid B7 C1West \
  2. 0 105.0 44.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  3. 1 105.0 80.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0
  4. 2 882.0 35.0 0.0 9.0 0.0 0.0 0.0 0.0 0.0 0.0
  5. 3 882.0 91.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  6. 4 934.0 15.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0
  7. 5 934.0 62.0 0.0 6.0 0.0 0.0 0.0 5.0 0.0 0.0
  8. ICU C2South C3 P53Child
  9. 0 2.0 0.0 0.0
  10. 1 0.0 0.0 0.0
  11. 2 7.0 2.0 0.0
  12. 3 0.0 0.0 5.0
  13. 4 2.0 0.0 0.0
  14. 5 0.0 0.0 0.0
英文:

Another possible solution:

  1. (pd.concat([dfPts, dfICU])
  2. .pivot(index=['PtsID', 'VisitID'], columns='ICU', values='Frequency')
  3. .dropna(how='all').reset_index().fillna(0))

Output:

  1. ICU PtsID VisitID A1 A2 A3 B1 B2Closed B2Covid B7 C1West \
  2. 0 105.0 44.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  3. 1 105.0 80.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0 0.0
  4. 2 882.0 35.0 0.0 9.0 0.0 0.0 0.0 0.0 0.0 0.0
  5. 3 882.0 91.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
  6. 4 934.0 15.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0
  7. 5 934.0 62.0 0.0 6.0 0.0 0.0 0.0 5.0 0.0 0.0
  8. ICU C2South C3 P53Child
  9. 0 2.0 0.0 0.0
  10. 1 0.0 0.0 0.0
  11. 2 7.0 2.0 0.0
  12. 3 0.0 0.0 5.0
  13. 4 2.0 0.0 0.0
  14. 5 0.0 0.0 0.0

huangapple
  • 本文由 发表于 2023年7月10日 12:18:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76650659.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定