问题:在连接三个具有相同列名称的数据集时,相同关键列的值被替换。

huangapple go评论100阅读模式
英文:

Problem in concatenating 3 datasets with same column names and values being replaced for the same key column

问题

我正在尝试连接具有相同4个列名的3个数据集,但结果集不是我预期的样子。数据集如下:

数据集 A

  1. | Col_A | Col_B | Col_C | Col_D |
  2. |-------|-------|-------|-------|
  3. | a10 | b10 | | |
  4. | a11 | b11 | | |
  5. | a12 | b12 | 10 | 11 |
  6. | a13 | b13 | | |

数据集 B

  1. | Col_A | Col_B | Col_C | Col_D |
  2. |-------|-------|-------|-------|
  3. | a10 | b10 | 12 | 13 |
  4. | a11 | b11 | 15 | 16 |
  5. | a12 | b12 | | |
  6. | a13 | b13 | | |

数据集 C

  1. | Col_A | Col_B | Col_C | Col_D |
  2. |-------|-------|-------|-------|
  3. | a10 | b10 | | |
  4. | a11 | b11 | | |
  5. | a12 | b12 | 17 | 18 |
  6. | a13 | b13 | 20 | 21 |

我需要的最终数据集看起来应该像这样,即连接数据集的顺序:

最终数据集

  1. | Col_A | Col_B | Col_C | Col_D |
  2. |-------|-------|-------|-------|
  3. | a10 | b10 | 12 | 13 |
  4. | a11 | b11 | 15 | 16 |
  5. | a12 | b12 | 17 | 18 |
  6. | a13 | b13 | 20 | 21 |

您尝试了常规的concat()方法,但无法解决连接后值被替换的问题。

我注意到您在创建数据集时似乎遗漏了数据帧(DataFrame)的变量名。在重现数据集时,请使用以下代码:

  1. import pandas as pd
  2. df_A = pd.DataFrame({
  3. 'Col_A': ['a10', 'a11', 'a12', 'a13'],
  4. 'Col_B': ['b10', 'b11', 'b12', 'b13'],
  5. 'Col_C': ['', '', '10', ''],
  6. 'Col_D': ['', '', '11', '']
  7. })
  8. df_B = pd.DataFrame({
  9. 'Col_A': ['a10', 'a11', 'a12', 'a13'],
  10. 'Col_B': ['b10', 'b11', 'b12', 'b13'],
  11. 'Col_C': ['12', '15', '', ''],
  12. 'Col_D': ['13', '16', '', '']
  13. })
  14. df_C = pd.DataFrame({
  15. 'Col_A': ['a10', 'a11', 'a12', 'a13'],
  16. 'Col_B': ['b10', 'b11', 'b12', 'b13'],
  17. 'Col_C': ['', '', '17', '20'],
  18. 'Col_D': ['', '', '18', '21']
  19. })

然后,您可以使用以下代码将它们连接起来以获得所需的最终数据集:

  1. concat_all = pd.concat([df_A, df_B, df_C], axis=0, ignore_index=True)

这将创建所需的最终数据集。希望这对您有所帮助!

英文:

I am trying to concatenate 3 datasets with same 4 column names and the resultant set is not what I am expecting. The datasets look like
**dataset A **

  1. | Col_A | Col_B |Col_C | Col_D|
  2. | ------ | ------|------|------|
  3. | a10 | b10 | | |
  4. | a11 | b11 | | |
  5. | a12 | b12 | 10 | 11 |
  6. | a13 | b13 | | |

**dataset B **

  1. | Col_A | Col_B |Col_C | Col_D|
  2. | ------| ----- |------|------|
  3. | a10 | b10 | 12 | 13 |
  4. | a11 | b11 | 15 | 16 |
  5. | a12 | b12 | | |
  6. | a13 | b13 | | |

**dataset C **

  1. | Col_A | Col_B |Col_C | Col_D|
  2. | ----- | ----- |------|------|
  3. | a10 | b10 | | |
  4. | a11 | b11 | | |
  5. | a12 | b12 |17 | 18 |
  6. | a13 | b13 |20 | 21 |

Resultant dataset I need to look like is after concatenating the datasets in the sequence:

**dataset final **

  1. | Col_A | Col_B |Col_C | Col_D|
  2. | ----- | ----- |------|------|
  3. | a10 | b10 |12 | 13 |
  4. | a11 | b11 | 15 | 16 |
  5. | a12 | b12 | 17 | 18 |
  6. | a13 | b13 | 20 | 21 |

I tried the normal concat() method but I am not able to solve the part where the values are being replaced when concatenation done in the sequence.

  1. concat_all = pd.concat([df_A,df_B,df_C], axis =0)

Recreating the datasets:

  1. import pandas as pd
  2. dfA = pd.DataFrame({
  3. 'Col_A': ['a10', 'a11', 'a12', 'a13'],
  4. 'Col_B': ['b10', 'b11', 'b12', 'b13'],
  5. 'Col_C': ['', '', '10', ''],
  6. 'Col_D': ['', '', '11', '']
  7. })
  8. dfB = pd.DataFrame({
  9. 'Col_A': ['a10', 'a11', 'a12', 'a13'],
  10. 'Col_B': ['b10', 'b11', 'b12', 'b13'],
  11. 'Col_C': ['12', '15', '', ''],
  12. 'Col_D': ['13', '16', '', '']
  13. })
  14. dfC = pd.DataFrame({
  15. 'Col_A': ['a10', 'a11', 'a12', 'a13'],
  16. 'Col_B': ['b10', 'b11', 'b12', 'b13'],
  17. 'Col_C': ['', '', '17', '20'],
  18. 'Col_D': ['', '', '18', '21']
  19. })

Any leads would be appreciated

答案1

得分: 0

你可以像这样做:

  1. df = (pd.concat([dfA, dfB, dfC], axis=0)
  2. .query('Col_C != ""')
  3. .groupby(['Col_A', 'Col_B'])
  4. .agg({'Col_C': 'last', 'Col_D': 'last'}))
  5. Col_C Col_D
  6. Col_A Col_B
  7. a10 b10 12 13
  8. a11 b11 15 16
  9. a12 b12 17 18
  10. a13 b13 20 21
英文:

You can do something like this:

  1. df = (pd.concat([dfA, dfB, dfC], axis=0)
  2. .query('Col_C != ""')
  3. .groupby(['Col_A', 'Col_B'])
  4. .agg({'Col_C': 'last', 'Col_D': 'last'}))
  5. Col_C Col_D
  6. Col_A Col_B
  7. a10 b10 12 13
  8. a11 b11 15 16
  9. a12 b12 17 18
  10. a13 b13 20 21

huangapple
  • 本文由 发表于 2023年7月13日 23:22:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76681033.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定