问题:在连接三个具有相同列名称的数据集时,相同关键列的值被替换。

huangapple go评论60阅读模式
英文:

Problem in concatenating 3 datasets with same column names and values being replaced for the same key column

问题

我正在尝试连接具有相同4个列名的3个数据集,但结果集不是我预期的样子。数据集如下:

数据集 A

| Col_A | Col_B | Col_C | Col_D |
|-------|-------|-------|-------|
| a10   | b10   |       |       |
| a11   | b11   |       |       |
| a12   | b12   | 10    | 11    |
| a13   | b13   |       |       |

数据集 B

| Col_A | Col_B | Col_C | Col_D |
|-------|-------|-------|-------|
| a10   | b10   | 12    | 13    |
| a11   | b11   | 15    | 16    |
| a12   | b12   |       |       |
| a13   | b13   |       |       |

数据集 C

| Col_A | Col_B | Col_C | Col_D |
|-------|-------|-------|-------|
| a10   | b10   |       |       |
| a11   | b11   |       |       |
| a12   | b12   | 17    | 18    |
| a13   | b13   | 20    | 21    |

我需要的最终数据集看起来应该像这样,即连接数据集的顺序:

最终数据集

| Col_A | Col_B | Col_C | Col_D |
|-------|-------|-------|-------|
| a10   | b10   | 12    | 13    |
| a11   | b11   | 15    | 16    |
| a12   | b12   | 17    | 18    |
| a13   | b13   | 20    | 21    |

您尝试了常规的concat()方法,但无法解决连接后值被替换的问题。

我注意到您在创建数据集时似乎遗漏了数据帧(DataFrame)的变量名。在重现数据集时,请使用以下代码:

import pandas as pd

df_A = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['', '', '10', ''],
    'Col_D': ['', '', '11', '']
})

df_B = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['12', '15', '', ''],
    'Col_D': ['13', '16', '', '']
})

df_C = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['', '', '17', '20'],
    'Col_D': ['', '', '18', '21']
})

然后,您可以使用以下代码将它们连接起来以获得所需的最终数据集:

concat_all = pd.concat([df_A, df_B, df_C], axis=0, ignore_index=True)

这将创建所需的最终数据集。希望这对您有所帮助!

英文:

I am trying to concatenate 3 datasets with same 4 column names and the resultant set is not what I am expecting. The datasets look like
**dataset A **

| Col_A  | Col_B |Col_C | Col_D|
| ------ | ------|------|------|
| a10    | b10   |      |      |
| a11    | b11   |      |      |      
| a12    | b12   | 10   | 11   |
| a13    | b13   |      |      |      

**dataset B **

| Col_A | Col_B |Col_C | Col_D|
| ------| ----- |------|------|
| a10    | b10  | 12   |  13  |
| a11    | b11  | 15   | 16   |
| a12    | b12  |      |      |
| a13    | b13  |      |      |

**dataset C **

| Col_A | Col_B |Col_C | Col_D|
| ----- | ----- |------|------|
| a10   | b10   |      |      |
| a11   | b11   |      |      |
| a12   | b12   |17    | 18   |
| a13   | b13   |20    |  21  |

Resultant dataset I need to look like is after concatenating the datasets in the sequence:

**dataset final **

| Col_A | Col_B |Col_C | Col_D|
| ----- | ----- |------|------|
| a10   | b10   |12    |   13 |
| a11   | b11   | 15   |  16  |
| a12   | b12   | 17   | 18   |
| a13   | b13   | 20   |  21  |

I tried the normal concat() method but I am not able to solve the part where the values are being replaced when concatenation done in the sequence.

concat_all = pd.concat([df_A,df_B,df_C], axis =0)

Recreating the datasets:

import pandas as pd

dfA = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['', '', '10', ''],
    'Col_D': ['', '', '11', '']
})

dfB = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['12', '15', '', ''],
    'Col_D': ['13', '16', '', '']
})

dfC = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['', '', '17', '20'],
    'Col_D': ['', '', '18', '21']
})

Any leads would be appreciated

答案1

得分: 0

你可以像这样做:

df = (pd.concat([dfA, dfB, dfC], axis=0)
        .query('Col_C != ""')
        .groupby(['Col_A', 'Col_B'])
        .agg({'Col_C': 'last', 'Col_D': 'last'}))

            Col_C Col_D
Col_A Col_B            
a10   b10      12    13
a11   b11      15    16
a12   b12      17    18
a13   b13      20    21
英文:

You can do something like this:

df = (pd.concat([dfA, dfB, dfC], axis=0)
        .query('Col_C != ""')
        .groupby(['Col_A', 'Col_B'])
        .agg({'Col_C': 'last', 'Col_D': 'last'}))

            Col_C Col_D
Col_A Col_B            
a10   b10      12    13
a11   b11      15    16
a12   b12      17    18
a13   b13      20    21

huangapple
  • 本文由 发表于 2023年7月13日 23:22:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76681033.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定