英文:
How to fill Null values in a Pandas DataFrame based on a value from a different column?
问题
目前,我正在为一个投资组合项目工作,并且有一个包含两列的大型数据框,其中之一是“邻里群组”,另一列是“邻里”。问题是在第一列中存在一些NaN
值。我需要找出如何用相应的城市区域填充这些值。
示例:
居民 邻里群组 邻里
约翰 布鲁克林 克林顿希尔
梅 布鲁克林 克林顿希尔
理查德 曼哈顿 东哈莱姆
克拉克 曼哈顿 上西城
...
克莱尔 NaN 克林顿希尔
苏珊 NaN 东哈莱姆
这只是一个简单的示例。我有数百个这样的情况,不可能手动更改这么多。有没有办法将正确的城市区域分配给正确的邻里?
我已经尝试过构建不同的函数来解决这个问题,但我无法理解如何处理它。
英文:
So I'm currently working for a portfolio project and I have a large Dataframe with (among many other columns) two columns, one for the "neighbourhood group" and the other for the "neighbourhood". The thing is in the first column there are some NaN
values. And I need to find out how to fill those with the respective city area.
Example:
resident neighbourhood group neighbourhood
John Brooklyn Clinton Hill
Mae Brooklyn Clinton Hill
Richard Manhattan East Harlem
Clark Manhattan Upper West Side
...
Claire NaN Clinton Hill
Susan NaN East Harlem
This is just a simple example. I have hundreds of these cases impossible to manually change so many of them. Any idea how to assigned the proper area to the right neighbourhood?
I've tried to no avail to build different functions to fixed the issue, but I can't manage to wrap my head around it.
答案1
得分: 1
我建议建立一个参考字典,将邻近区域与邻近区域组进行匹配。假设这是原始数据框:
import pandas as pd
data = {'resident': {0: 'John', 1: 'Mae', 2: 'Richard', 3: 'Clark', 4: 'Claire', 5: 'Susan'}, 'neighbourhoodgroup': {0: 'Brooklyn', 1: 'Brooklyn', 2: 'Manhattan', 3: 'Manhattan', 4: None, 5: None}, 'neighbourhood': {0: 'ClintonHill', 1: 'ClintonHill', 2: 'EastHarlem', 3: 'UpperWestSide', 4: 'ClintonHill', 5: 'EastHarlem'}}
df = pd.DataFrame(data)
'''
resident neighbourhoodgroup neighbourhood
0 John Brooklyn ClintonHill
1 Mae Brooklyn ClintonHill
2 Richard Manhattan EastHarlem
3 Clark Manhattan UpperWestSide
4 Claire None ClintonHill
5 Susan None EastHarlem
'''
首先,创建一个具有从 'neighbourhood' 列中获取的键和从 'neighbourhoodgroup' 列中获取的值的参考字典 reference
。
df_ref = df.dropna().drop_duplicates(['neighbourhoodgroup', 'neighbourhood'])
reference = {}
for k, v in list(zip(df_ref.neighbourhood, df_ref.neighbourhoodgroup)):
reference[k] = v
'''
{'ClintonHill': 'Brooklyn',
'EastHarlem': 'Manhattan',
'UpperWestSide': 'Manhattan'}
'''
接下来,将字典参考应用于数据框:
df['result'] = df.neighbourhood.apply(lambda x: reference[x])
print(df)
'''
resident neighbourhoodgroup neighbourhood result
0 John Brooklyn ClintonHill Brooklyn
1 Mae Brooklyn ClintonHill Brooklyn
2 Richard Manhattan EastHarlem Manhattan
3 Clark Manhattan UpperWestSide Manhattan
4 Claire None ClintonHill Brooklyn
5 Susan None EastHarlem Manhattan
'''
英文:
I recommend to build a reference dictionary to match the neighbourhood to the neighbourhood group. Assuming this is the original dataframe:
import pandas as pd
data = {'resident': {0: 'John', 1: 'Mae', 2: 'Richard', 3: 'Clark', 4: 'Claire', 5: 'Susan'}, 'neighbourhoodgroup': {0: 'Brooklyn', 1: 'Brooklyn', 2: 'Manhattan', 3: 'Manhattan', 4: None, 5: None}, 'neighbourhood': {0: 'ClintonHill', 1: 'ClintonHill', 2: 'EastHarlem', 3: 'UpperWestSide', 4: 'ClintonHill', 5: 'EastHarlem'}}
df = pd.DataFrame(data)
'''
resident neighbourhoodgroup neighbourhood
0 John Brooklyn ClintonHill
1 Mae Brooklyn ClintonHill
2 Richard Manhattan EastHarlem
3 Clark Manhattan UpperWestSide
4 Claire None ClintonHill
5 Susan None EastHarlem
'''
First create a reference dictionary reference
with key from 'neighbourhood' column and value from the 'neighbourhoodgroup' column.
df_ref = df.dropna().drop_duplicates(['neighbourhoodgroup', 'neighbourhood'])
reference = {}
for k, v in list(zip(df_ref.neighbourhood, df_ref.neighbourhoodgroup)):
reference[k] = v
'''
{'ClintonHill': 'Brooklyn',
'EastHarlem': 'Manhattan',
'UpperWestSide': 'Manhattan'}
'''
Next, apply the dictionary reference to the dataframe
df['result'] = df.neighbourhood.apply(lambda x: reference[x])
print(df)
'''
resident neighbourhoodgroup neighbourhood result
0 John Brooklyn ClintonHill Brooklyn
1 Mae Brooklyn ClintonHill Brooklyn
2 Richard Manhattan EastHarlem Manhattan
3 Clark Manhattan UpperWestSide Manhattan
4 Claire None ClintonHill Brooklyn
5 Susan None EastHarlem Manhattan
'''
答案2
得分: 1
resident neighbourhoodgroup neighbourhood
0 John 布鲁克林 克林顿山
1 Mae 布鲁克林 克林顿山
2 Richard 曼哈顿 东哈莱姆
3 Clark 曼哈顿 上西区
4 Claire <NA> 克林顿山
5 Susan <NA> 东哈莱姆
我们可以通过去重和删除NaN值,然后转换为字典的方式,从邻居到邻居组进行查找:
lookup_df = df[['neighbourhoodgroup', 'neighbourhood']].drop_duplicates().dropna()
lookup_dict = {x.neighbourhood: x.neighbourhoodgroup for x in lookup_df.itertuples()}
{'克林顿山': '布鲁克林', '东哈莱姆': '曼哈顿', '上西区': '曼哈顿'}
然后,我们使用DataFrame.replace()
将邻居与邻居组匹配:
neighbourhood_fill = df.neighbourhood.replace(lookup_dict)
然后,我们使用fillna
将neighbourhoodgroup
列中的NaN值替换为新的填充值:
df.neighbourhoodgroup.fillna(neighbourhood_fill, inplace=True)
resident neighbourhoodgroup neighbourhood
0 John 布鲁克林 克林顿山
1 Mae 布鲁克林 克林顿山
2 Richard 曼哈顿 东哈莱姆
3 Clark 曼哈顿 上西区
4 Claire 布鲁克林 克林顿山
5 Susan 曼哈顿 东哈莱姆
英文:
resident neighbourhoodgroup neighbourhood
0 John Brooklyn ClintonHill
1 Mae Brooklyn ClintonHill
2 Richard Manhattan EastHarlem
3 Clark Manhattan UpperWestSide
4 Claire <NA> ClintonHill
5 Susan <NA> EastHarlem
We can make a lookup from neighbourhood to neighbourhoodgroup by dropping duplicates and NaNs, and converting to a dictionary:
lookup_df = df[['neighbourhoodgroup','neighbourhood']].drop_duplicates().dropna()
lookup_dict = {x.neighbourhood: x.neighbourhoodgroup for x in lookup_df.itertuples()}
{'ClintonHill': 'Brooklyn', 'EastHarlem': 'Manhattan', 'UpperWestSide': 'Manhattan'}
Then we match the neighbourhoods to the neighbourhoodgroups using DataFrame.replace()
:
neighbourhood_fill = df.neighbourhood.replace(lookup_dict)
Then we replace the NaNs in the neighbourhoodgroup
column with our new fill values, using fillna
:
df.neighbourhoodgroup.fillna(neighbourhood_fill, inplace=True)
resident neighbourhoodgroup neighbourhood
0 John Brooklyn ClintonHill
1 Mae Brooklyn ClintonHill
2 Richard Manhattan EastHarlem
3 Clark Manhattan UpperWestSide
4 Claire Brooklyn ClintonHill
5 Susan Manhattan EastHarlem
答案3
得分: 0
另一个可能的解决方案:
d = df.iloc[:, 1:].dropna().drop_duplicates()
(df.update(
df['neighbourhood'][df['neighbourhood group'].isna()]
.map(dict(zip(d.iloc[:,1], d.iloc[:,0])))
.rename('neighbourhood group')))
输出:
resident neighbourhood group neighbourhood
0 John Brooklyn Clinton Hill
1 Mae Brooklyn Clinton Hill
2 Richard Manhattan East Harlem
3 Clark Manhattan Upper West Side
4 Claire Brooklyn Clinton Hill
5 Susan Manhattan East Harlem
英文:
Another possible solution:
d = df.iloc[:, 1:].dropna().drop_duplicates()
(df.update(
df['neighbourhood'][df['neighbourhood group'].isna()]
.map(dict(zip(d.iloc[:,1], d.iloc[:,0])))
.rename('neighbourhood group')))
Output:
resident neighbourhood group neighbourhood
0 John Brooklyn Clinton Hill
1 Mae Brooklyn Clinton Hill
2 Richard Manhattan East Harlem
3 Clark Manhattan Upper West Side
4 Claire Brooklyn Clinton Hill
5 Susan Manhattan East Harlem
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论