英文:
Problem in concatenating 3 datasets with same column names and values being replaced for the same key column
问题
我正在尝试连接具有相同4个列名的3个数据集,但结果集不是我预期的样子。数据集如下:
数据集 A
| Col_A | Col_B | Col_C | Col_D |
|-------|-------|-------|-------|
| a10 | b10 | | |
| a11 | b11 | | |
| a12 | b12 | 10 | 11 |
| a13 | b13 | | |
数据集 B
| Col_A | Col_B | Col_C | Col_D |
|-------|-------|-------|-------|
| a10 | b10 | 12 | 13 |
| a11 | b11 | 15 | 16 |
| a12 | b12 | | |
| a13 | b13 | | |
数据集 C
| Col_A | Col_B | Col_C | Col_D |
|-------|-------|-------|-------|
| a10 | b10 | | |
| a11 | b11 | | |
| a12 | b12 | 17 | 18 |
| a13 | b13 | 20 | 21 |
我需要的最终数据集看起来应该像这样,即连接数据集的顺序:
最终数据集
| Col_A | Col_B | Col_C | Col_D |
|-------|-------|-------|-------|
| a10 | b10 | 12 | 13 |
| a11 | b11 | 15 | 16 |
| a12 | b12 | 17 | 18 |
| a13 | b13 | 20 | 21 |
您尝试了常规的concat()
方法,但无法解决连接后值被替换的问题。
我注意到您在创建数据集时似乎遗漏了数据帧(DataFrame)的变量名。在重现数据集时,请使用以下代码:
import pandas as pd
df_A = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['', '', '10', ''],
'Col_D': ['', '', '11', '']
})
df_B = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['12', '15', '', ''],
'Col_D': ['13', '16', '', '']
})
df_C = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['', '', '17', '20'],
'Col_D': ['', '', '18', '21']
})
然后,您可以使用以下代码将它们连接起来以获得所需的最终数据集:
concat_all = pd.concat([df_A, df_B, df_C], axis=0, ignore_index=True)
这将创建所需的最终数据集。希望这对您有所帮助!
英文:
I am trying to concatenate 3 datasets with same 4 column names and the resultant set is not what I am expecting. The datasets look like
**dataset A **
| Col_A | Col_B |Col_C | Col_D|
| ------ | ------|------|------|
| a10 | b10 | | |
| a11 | b11 | | |
| a12 | b12 | 10 | 11 |
| a13 | b13 | | |
**dataset B **
| Col_A | Col_B |Col_C | Col_D|
| ------| ----- |------|------|
| a10 | b10 | 12 | 13 |
| a11 | b11 | 15 | 16 |
| a12 | b12 | | |
| a13 | b13 | | |
**dataset C **
| Col_A | Col_B |Col_C | Col_D|
| ----- | ----- |------|------|
| a10 | b10 | | |
| a11 | b11 | | |
| a12 | b12 |17 | 18 |
| a13 | b13 |20 | 21 |
Resultant dataset I need to look like is after concatenating the datasets in the sequence:
**dataset final **
| Col_A | Col_B |Col_C | Col_D|
| ----- | ----- |------|------|
| a10 | b10 |12 | 13 |
| a11 | b11 | 15 | 16 |
| a12 | b12 | 17 | 18 |
| a13 | b13 | 20 | 21 |
I tried the normal concat()
method but I am not able to solve the part where the values are being replaced when concatenation done in the sequence.
concat_all = pd.concat([df_A,df_B,df_C], axis =0)
Recreating the datasets:
import pandas as pd
dfA = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['', '', '10', ''],
'Col_D': ['', '', '11', '']
})
dfB = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['12', '15', '', ''],
'Col_D': ['13', '16', '', '']
})
dfC = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['', '', '17', '20'],
'Col_D': ['', '', '18', '21']
})
Any leads would be appreciated
答案1
得分: 0
你可以像这样做:
df = (pd.concat([dfA, dfB, dfC], axis=0)
.query('Col_C != ""')
.groupby(['Col_A', 'Col_B'])
.agg({'Col_C': 'last', 'Col_D': 'last'}))
Col_C Col_D
Col_A Col_B
a10 b10 12 13
a11 b11 15 16
a12 b12 17 18
a13 b13 20 21
英文:
You can do something like this:
df = (pd.concat([dfA, dfB, dfC], axis=0)
.query('Col_C != ""')
.groupby(['Col_A', 'Col_B'])
.agg({'Col_C': 'last', 'Col_D': 'last'}))
Col_C Col_D
Col_A Col_B
a10 b10 12 13
a11 b11 15 16
a12 b12 17 18
a13 b13 20 21
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论