如何根据不同列中的值填充 Pandas DataFrame 中的空值?

huangapple go评论77阅读模式
英文:

How to fill Null values in a Pandas DataFrame based on a value from a different column?

问题

目前,我正在为一个投资组合项目工作,并且有一个包含两列的大型数据框,其中之一是“邻里群组”,另一列是“邻里”。问题是在第一列中存在一些NaN值。我需要找出如何用相应的城市区域填充这些值。

示例:

居民       邻里群组      邻里
约翰        布鲁克林       克林顿希尔
梅         布鲁克林       克林顿希尔
理查德     曼哈顿       东哈莱姆
克拉克     曼哈顿       上西城
...
克莱尔        NaN      克林顿希尔
苏珊         NaN      东哈莱姆

这只是一个简单的示例。我有数百个这样的情况,不可能手动更改这么多。有没有办法将正确的城市区域分配给正确的邻里?

我已经尝试过构建不同的函数来解决这个问题,但我无法理解如何处理它。

英文:

So I'm currently working for a portfolio project and I have a large Dataframe with (among many other columns) two columns, one for the "neighbourhood group" and the other for the "neighbourhood". The thing is in the first column there are some NaN values. And I need to find out how to fill those with the respective city area.

Example:

resident   neighbourhood group    neighbourhood
John       Brooklyn               Clinton Hill
Mae        Brooklyn               Clinton Hill
Richard    Manhattan              East Harlem
Clark      Manhattan              Upper West Side
...
Claire          NaN               Clinton Hill
Susan           NaN               East Harlem

This is just a simple example. I have hundreds of these cases impossible to manually change so many of them. Any idea how to assigned the proper area to the right neighbourhood?

I've tried to no avail to build different functions to fixed the issue, but I can't manage to wrap my head around it.

答案1

得分: 1

我建议建立一个参考字典,将邻近区域与邻近区域组进行匹配。假设这是原始数据框:

import pandas as pd

data = {'resident': {0: 'John', 1: 'Mae', 2: 'Richard', 3: 'Clark', 4: 'Claire', 5: 'Susan'}, 'neighbourhoodgroup': {0: 'Brooklyn', 1: 'Brooklyn', 2: 'Manhattan', 3: 'Manhattan', 4: None, 5: None}, 'neighbourhood': {0: 'ClintonHill', 1: 'ClintonHill', 2: 'EastHarlem', 3: 'UpperWestSide', 4: 'ClintonHill', 5: 'EastHarlem'}}
df = pd.DataFrame(data)

'''
  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire               None    ClintonHill
5    Susan               None     EastHarlem
'''

首先,创建一个具有从 'neighbourhood' 列中获取的键和从 'neighbourhoodgroup' 列中获取的值的参考字典 reference

df_ref = df.dropna().drop_duplicates(['neighbourhoodgroup', 'neighbourhood'])
reference = {}
for k, v in list(zip(df_ref.neighbourhood, df_ref.neighbourhoodgroup)):
    reference[k] = v

'''
{'ClintonHill': 'Brooklyn',
 'EastHarlem': 'Manhattan',
 'UpperWestSide': 'Manhattan'}
'''

接下来,将字典参考应用于数据框:

df['result'] = df.neighbourhood.apply(lambda x: reference[x])
print(df)

'''
  resident neighbourhoodgroup  neighbourhood     result
0     John           Brooklyn    ClintonHill   Brooklyn
1      Mae           Brooklyn    ClintonHill   Brooklyn
2  Richard          Manhattan     EastHarlem  Manhattan
3    Clark          Manhattan  UpperWestSide  Manhattan
4   Claire               None    ClintonHill   Brooklyn
5    Susan               None     EastHarlem  Manhattan
'''
英文:

I recommend to build a reference dictionary to match the neighbourhood to the neighbourhood group. Assuming this is the original dataframe:

import pandas as pd

data = {'resident': {0: 'John', 1: 'Mae', 2: 'Richard', 3: 'Clark', 4: 'Claire', 5: 'Susan'}, 'neighbourhoodgroup': {0: 'Brooklyn', 1: 'Brooklyn', 2: 'Manhattan', 3: 'Manhattan', 4: None, 5: None}, 'neighbourhood': {0: 'ClintonHill', 1: 'ClintonHill', 2: 'EastHarlem', 3: 'UpperWestSide', 4: 'ClintonHill', 5: 'EastHarlem'}}
df = pd.DataFrame(data)

'''
  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire               None    ClintonHill
5    Susan               None     EastHarlem
'''

First create a reference dictionary reference with key from 'neighbourhood' column and value from the 'neighbourhoodgroup' column.

df_ref = df.dropna().drop_duplicates(['neighbourhoodgroup', 'neighbourhood'])
reference = {}
for k, v in list(zip(df_ref.neighbourhood, df_ref.neighbourhoodgroup)):
    reference[k] = v

'''
{'ClintonHill': 'Brooklyn',
 'EastHarlem': 'Manhattan',
 'UpperWestSide': 'Manhattan'}
'''

Next, apply the dictionary reference to the dataframe

df['result'] = df.neighbourhood.apply(lambda x: reference[x])
print(df)

'''
  resident neighbourhoodgroup  neighbourhood     result
0     John           Brooklyn    ClintonHill   Brooklyn
1      Mae           Brooklyn    ClintonHill   Brooklyn
2  Richard          Manhattan     EastHarlem  Manhattan
3    Clark          Manhattan  UpperWestSide  Manhattan
4   Claire               None    ClintonHill   Brooklyn
5    Susan               None     EastHarlem  Manhattan
'''

答案2

得分: 1

  resident neighbourhoodgroup  neighbourhood
0     John             布鲁克林     克林顿山
1      Mae             布鲁克林     克林顿山
2  Richard             曼哈顿     东哈莱姆
3    Clark             曼哈顿  上西区
4   Claire             <NA>     克林顿山
5    Susan             <NA>     东哈莱姆

我们可以通过去重和删除NaN值,然后转换为字典的方式,从邻居到邻居组进行查找:

lookup_df = df[['neighbourhoodgroup', 'neighbourhood']].drop_duplicates().dropna()
lookup_dict = {x.neighbourhood: x.neighbourhoodgroup for x in lookup_df.itertuples()}

{'克林顿山': '布鲁克林', '东哈莱姆': '曼哈顿', '上西区': '曼哈顿'}

然后,我们使用DataFrame.replace()将邻居与邻居组匹配:

neighbourhood_fill = df.neighbourhood.replace(lookup_dict)

然后,我们使用fillnaneighbourhoodgroup列中的NaN值替换为新的填充值:

df.neighbourhoodgroup.fillna(neighbourhood_fill, inplace=True)

  resident neighbourhoodgroup  neighbourhood
0     John             布鲁克林     克林顿山
1      Mae             布鲁克林     克林顿山
2  Richard             曼哈顿     东哈莱姆
3    Clark             曼哈顿  上西区
4   Claire             布鲁克林     克林顿山
5    Susan             曼哈顿     东哈莱姆
英文:
  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire               &lt;NA&gt;    ClintonHill
5    Susan               &lt;NA&gt;     EastHarlem

We can make a lookup from neighbourhood to neighbourhoodgroup by dropping duplicates and NaNs, and converting to a dictionary:

lookup_df = df[[&#39;neighbourhoodgroup&#39;,&#39;neighbourhood&#39;]].drop_duplicates().dropna()
lookup_dict = {x.neighbourhood: x.neighbourhoodgroup for x in lookup_df.itertuples()}

{&#39;ClintonHill&#39;: &#39;Brooklyn&#39;, &#39;EastHarlem&#39;: &#39;Manhattan&#39;, &#39;UpperWestSide&#39;: &#39;Manhattan&#39;}

Then we match the neighbourhoods to the neighbourhoodgroups using DataFrame.replace():

neighbourhood_fill = df.neighbourhood.replace(lookup_dict)

Then we replace the NaNs in the neighbourhoodgroup column with our new fill values, using fillna:

df.neighbourhoodgroup.fillna(neighbourhood_fill, inplace=True)

  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire           Brooklyn    ClintonHill
5    Susan          Manhattan     EastHarlem

答案3

得分: 0

另一个可能的解决方案:

d = df.iloc[:, 1:].dropna().drop_duplicates()

(df.update(
    df['neighbourhood'][df['neighbourhood group'].isna()]
    .map(dict(zip(d.iloc[:,1], d.iloc[:,0])))
    .rename('neighbourhood group')))

输出:

  resident neighbourhood group    neighbourhood
0     John            Brooklyn     Clinton Hill
1      Mae            Brooklyn     Clinton Hill
2  Richard           Manhattan      East Harlem
3    Clark           Manhattan  Upper West Side
4   Claire            Brooklyn     Clinton Hill
5    Susan           Manhattan      East Harlem
英文:

Another possible solution:

d = df.iloc[:, 1:].dropna().drop_duplicates()

(df.update(
    df[&#39;neighbourhood&#39;][df[&#39;neighbourhood group&#39;].isna()]
    .map(dict(zip(d.iloc[:,1], d.iloc[:,0])))
    .rename(&#39;neighbourhood group&#39;)))

Output:

  resident neighbourhood group    neighbourhood
0     John            Brooklyn     Clinton Hill
1      Mae            Brooklyn     Clinton Hill
2  Richard           Manhattan      East Harlem
3    Clark           Manhattan  Upper West Side
4   Claire            Brooklyn     Clinton Hill
5    Susan           Manhattan      East Harlem

huangapple
  • 本文由 发表于 2023年7月20日 11:51:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76726560.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定