2023年7月20日 11:51:41go评论104阅读模式

英文:

How to fill Null values in a Pandas DataFrame based on a value from a different column?

问题

目前，我正在为一个投资组合项目工作，并且有一个包含两列的大型数据框，其中之一是“邻里群组”，另一列是“邻里”。问题是在第一列中存在一些NaN值。我需要找出如何用相应的城市区域填充这些值。

示例：

居民       邻里群组      邻里
约翰        布鲁克林       克林顿希尔
梅         布鲁克林       克林顿希尔
理查德     曼哈顿       东哈莱姆
克拉克     曼哈顿       上西城
...
克莱尔        NaN      克林顿希尔
苏珊         NaN      东哈莱姆

这只是一个简单的示例。我有数百个这样的情况，不可能手动更改这么多。有没有办法将正确的城市区域分配给正确的邻里？

我已经尝试过构建不同的函数来解决这个问题，但我无法理解如何处理它。

英文:

So I'm currently working for a portfolio project and I have a large Dataframe with (among many other columns) two columns, one for the "neighbourhood group" and the other for the "neighbourhood". The thing is in the first column there are some NaN values. And I need to find out how to fill those with the respective city area.

Example:

resident   neighbourhood group    neighbourhood
John       Brooklyn               Clinton Hill
Mae        Brooklyn               Clinton Hill
Richard    Manhattan              East Harlem
Clark      Manhattan              Upper West Side
...
Claire          NaN               Clinton Hill
Susan           NaN               East Harlem

This is just a simple example. I have hundreds of these cases impossible to manually change so many of them. Any idea how to assigned the proper area to the right neighbourhood?

I've tried to no avail to build different functions to fixed the issue, but I can't manage to wrap my head around it.

答案1

得分: 1

我建议建立一个参考字典，将邻近区域与邻近区域组进行匹配。假设这是原始数据框：

import pandas as pd
data = {'resident': {0: 'John', 1: 'Mae', 2: 'Richard', 3: 'Clark', 4: 'Claire', 5: 'Susan'}, 'neighbourhoodgroup': {0: 'Brooklyn', 1: 'Brooklyn', 2: 'Manhattan', 3: 'Manhattan', 4: None, 5: None}, 'neighbourhood': {0: 'ClintonHill', 1: 'ClintonHill', 2: 'EastHarlem', 3: 'UpperWestSide', 4: 'ClintonHill', 5: 'EastHarlem'}}
df = pd.DataFrame(data)
'''
  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire               None    ClintonHill
5    Susan               None     EastHarlem
'''

首先，创建一个具有从 'neighbourhood' 列中获取的键和从 'neighbourhoodgroup' 列中获取的值的参考字典 reference。

df_ref = df.dropna().drop_duplicates(['neighbourhoodgroup', 'neighbourhood'])
reference = {}
for k, v in list(zip(df_ref.neighbourhood, df_ref.neighbourhoodgroup)):
    reference[k] = v
'''
{'ClintonHill': 'Brooklyn',
 'EastHarlem': 'Manhattan',
 'UpperWestSide': 'Manhattan'}
'''

接下来，将字典参考应用于数据框：

df['result'] = df.neighbourhood.apply(lambda x: reference[x])
print(df)
'''
  resident neighbourhoodgroup  neighbourhood     result
0     John           Brooklyn    ClintonHill   Brooklyn
1      Mae           Brooklyn    ClintonHill   Brooklyn
2  Richard          Manhattan     EastHarlem  Manhattan
3    Clark          Manhattan  UpperWestSide  Manhattan
4   Claire               None    ClintonHill   Brooklyn
5    Susan               None     EastHarlem  Manhattan
'''

英文:

I recommend to build a reference dictionary to match the neighbourhood to the neighbourhood group. Assuming this is the original dataframe:

import pandas as pd
data = {&#39;resident&#39;: {0: &#39;John&#39;, 1: &#39;Mae&#39;, 2: &#39;Richard&#39;, 3: &#39;Clark&#39;, 4: &#39;Claire&#39;, 5: &#39;Susan&#39;}, &#39;neighbourhoodgroup&#39;: {0: &#39;Brooklyn&#39;, 1: &#39;Brooklyn&#39;, 2: &#39;Manhattan&#39;, 3: &#39;Manhattan&#39;, 4: None, 5: None}, &#39;neighbourhood&#39;: {0: &#39;ClintonHill&#39;, 1: &#39;ClintonHill&#39;, 2: &#39;EastHarlem&#39;, 3: &#39;UpperWestSide&#39;, 4: &#39;ClintonHill&#39;, 5: &#39;EastHarlem&#39;}}
df = pd.DataFrame(data)
&#39;&#39;&#39;
  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire               None    ClintonHill
5    Susan               None     EastHarlem
&#39;&#39;&#39;

First create a reference dictionary reference with key from 'neighbourhood' column and value from the 'neighbourhoodgroup' column.

df_ref = df.dropna().drop_duplicates([&#39;neighbourhoodgroup&#39;, &#39;neighbourhood&#39;])
reference = {}
for k, v in list(zip(df_ref.neighbourhood, df_ref.neighbourhoodgroup)):
    reference[k] = v
&#39;&#39;&#39;
{&#39;ClintonHill&#39;: &#39;Brooklyn&#39;,
 &#39;EastHarlem&#39;: &#39;Manhattan&#39;,
 &#39;UpperWestSide&#39;: &#39;Manhattan&#39;}
&#39;&#39;&#39;

Next, apply the dictionary reference to the dataframe

df[&#39;result&#39;] = df.neighbourhood.apply(lambda x: reference[x])
print(df)
&#39;&#39;&#39;
  resident neighbourhoodgroup  neighbourhood     result
0     John           Brooklyn    ClintonHill   Brooklyn
1      Mae           Brooklyn    ClintonHill   Brooklyn
2  Richard          Manhattan     EastHarlem  Manhattan
3    Clark          Manhattan  UpperWestSide  Manhattan
4   Claire               None    ClintonHill   Brooklyn
5    Susan               None     EastHarlem  Manhattan
&#39;&#39;&#39;

答案2

得分: 1

  resident neighbourhoodgroup  neighbourhood
0     John             布鲁克林     克林顿山
1      Mae             布鲁克林     克林顿山
2  Richard             曼哈顿     东哈莱姆
3    Clark             曼哈顿  上西区
4   Claire             <NA>     克林顿山
5    Susan             <NA>     东哈莱姆

我们可以通过去重和删除NaN值，然后转换为字典的方式，从邻居到邻居组进行查找：

lookup_df = df[['neighbourhoodgroup', 'neighbourhood']].drop_duplicates().dropna()
lookup_dict = {x.neighbourhood: x.neighbourhoodgroup for x in lookup_df.itertuples()}
{'克林顿山': '布鲁克林', '东哈莱姆': '曼哈顿', '上西区': '曼哈顿'}

然后，我们使用DataFrame.replace()将邻居与邻居组匹配：

neighbourhood_fill = df.neighbourhood.replace(lookup_dict)

然后，我们使用fillna将neighbourhoodgroup列中的NaN值替换为新的填充值：

df.neighbourhoodgroup.fillna(neighbourhood_fill, inplace=True)
  resident neighbourhoodgroup  neighbourhood
0     John             布鲁克林     克林顿山
1      Mae             布鲁克林     克林顿山
2  Richard             曼哈顿     东哈莱姆
3    Clark             曼哈顿  上西区
4   Claire             布鲁克林     克林顿山
5    Susan             曼哈顿     东哈莱姆

英文:

  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire               &lt;NA&gt;    ClintonHill
5    Susan               &lt;NA&gt;     EastHarlem

We can make a lookup from neighbourhood to neighbourhoodgroup by dropping duplicates and NaNs, and converting to a dictionary:

lookup_df = df[[&#39;neighbourhoodgroup&#39;,&#39;neighbourhood&#39;]].drop_duplicates().dropna()
lookup_dict = {x.neighbourhood: x.neighbourhoodgroup for x in lookup_df.itertuples()}
{&#39;ClintonHill&#39;: &#39;Brooklyn&#39;, &#39;EastHarlem&#39;: &#39;Manhattan&#39;, &#39;UpperWestSide&#39;: &#39;Manhattan&#39;}

Then we match the neighbourhoods to the neighbourhoodgroups using DataFrame.replace():

neighbourhood_fill = df.neighbourhood.replace(lookup_dict)

Then we replace the NaNs in the neighbourhoodgroup column with our new fill values, using fillna:

df.neighbourhoodgroup.fillna(neighbourhood_fill, inplace=True)
  resident neighbourhoodgroup  neighbourhood
0     John           Brooklyn    ClintonHill
1      Mae           Brooklyn    ClintonHill
2  Richard          Manhattan     EastHarlem
3    Clark          Manhattan  UpperWestSide
4   Claire           Brooklyn    ClintonHill
5    Susan          Manhattan     EastHarlem

答案3

得分: 0

另一个可能的解决方案：

d = df.iloc[:, 1:].dropna().drop_duplicates()
(df.update(
    df['neighbourhood'][df['neighbourhood group'].isna()]
    .map(dict(zip(d.iloc[:,1], d.iloc[:,0])))
    .rename('neighbourhood group')))

输出：

  resident neighbourhood group    neighbourhood
0     John            Brooklyn     Clinton Hill
1      Mae            Brooklyn     Clinton Hill
2  Richard           Manhattan      East Harlem
3    Clark           Manhattan  Upper West Side
4   Claire            Brooklyn     Clinton Hill
5    Susan           Manhattan      East Harlem

英文:

Another possible solution:

d = df.iloc[:, 1:].dropna().drop_duplicates()
(df.update(
    df[&#39;neighbourhood&#39;][df[&#39;neighbourhood group&#39;].isna()]
    .map(dict(zip(d.iloc[:,1], d.iloc[:,0])))
    .rename(&#39;neighbourhood group&#39;)))

Output:

  resident neighbourhood group    neighbourhood
0     John            Brooklyn     Clinton Hill
1      Mae            Brooklyn     Clinton Hill
2  Richard           Manhattan      East Harlem
3    Clark           Manhattan  Upper West Side
4   Claire            Brooklyn     Clinton Hill
5    Susan           Manhattan      East Harlem

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何根据不同列中的值填充 Pandas DataFrame 中的空值？

问题

答案1

答案2

答案3

使用Keras中的conv2d层替换池化层。

PicoDVI和Adafruit GFX：从8位PNG提取调色板的代码。

ValueError: DataFrame constructor not properly called! (WebScraping)

如何在pyspark中根据另一列将列转换为列表

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。