英文:
How to fill in a dataframe column based on another column of the same dataframe using a dictionary
问题
我正在使用Python处理一个包含'Country Name'和'Region Name'列的数据框。'Country Name'列没有任何缺失值,而'Region Name'列有缺失值。
我已经创建了一个字典:
dict = {
"Central and Eastern Europe": [
"Albania",
"Bosnia and Herzegovina",
"Bulgaria",
"Croatia",
"Czech Republic",
"Estonia",
"Hungary",
"Kosovo",
"Latvia",
"Lithuania",
"Montenegro",
"North Macedonia",
"Poland",
"Romania",
"Serbia",
"Slovakia",
"Slovenia",
],
"East Asia": [
"China",
"Hong Kong S.A.R. of China",
"Japan",
"Mongolia",
"South Korea",
"Taiwan Province of China",
],
...
}
我想要根据同一行的'Country Name'来填充'Region Name'列中的缺失值,使用这个字典。你能提供给我一个解决方案吗?
我已经尝试了以下代码行,但它没有起作用:
df.loc[df['Country name'].isnull(), 'Country name'] = df['Regional indicator'].map(dict)
英文:
I am working with a dataframe in Python with a 'Country Name' and a 'Region Name' column. The Country Name does not have any nans, while the Region Name has.
I have created a dictionary:
dict = {
"Central and Eastern Europe": [
"Albania",
"Bosnia and Herzegovina",
"Bulgaria",
"Croatia",
"Czech Republic",
"Estonia",
"Hungary",
"Kosovo",
"Latvia",
"Lithuania",
"Montenegro",
"North Macedonia",
"Poland",
"Romania",
"Serbia",
"Slovakia",
"Slovenia",
],
"East Asia": [
"China",
"Hong Kong S.A.R. of China",
"Japan",
"Mongolia",
"South Korea",
"Taiwan Province of China",
],
...
}
and I want to fill the nans of the 'Region Name' column based on the 'Country Name' of the same row by using this dictionary.
Can you provide me with a solution ?
I have tried the following line of code but it didn't work:
df.loc[df['Country name'].isnull(), 'Country name'] = df['Regional indicator'].map(dict)
答案1
得分: 1
如果我是你,我会将我的备份查找字典转换为倒排索引结构。这意味着我会将国家名称映射到地区,而不是反过来,然后我只需将映射应用于NaN值。这比需要检查所有地区,然后检查国家是否属于该地区要快得多。
import pandas as pd
country = {"egypt": "Africa", "Libia": "Africa", "China": "Asia"}
df = pd.DataFrame({
'Country Name': ['Albania', 'Japan', 'United States', 'China'],
'Region Name': ['Central and Eastern Europe', 'East Asia', pd.NA, pd.NA]
})
df['Region Name'] = df['Region Name'].fillna(df['Country Name'].map(country))
print(df)
在fillna行之前:
Country Name Region Name
0 Albania Central and Eastern Europe
1 Japan East Asia
2 United States <NA>
3 China <NA>
在之后,你可以看到中国被映射到了亚洲:
Country Name Region Name
0 Albania Central and Eastern Europe
1 Japan East Asia
2 United States NaN
3 China Asia
任何不在国家到地区映射中的国家将保留为NaN。
英文:
If I were you, I would transform my backup lookup dict to an inverted_index structure.
This means that I would map the country names to regions, instead of the other way around, and then I would simply apply the map to the nans only. This would be much faster than the need to check all regions and then check if the country lies in this region or not.
import pandas as pd
country = {"egypt": "Africa", "Libia": "Africa", "China": "Asia"}
df = pd.DataFrame({
'Country Name': ['Albania', 'Japan', 'United States', 'China'],
'Region Name': ['Central and Eastern Europe', 'East Asia', pd.NA, pd.NA]
})
df['Region Name'] = df['Region Name'].fillna(df['Country Name'].map(country))
print(df)
Before the fillna line:
Country Name Region Name
0 Albania Central and Eastern Europe
1 Japan East Asia
2 United States <NA>
3 China <NA>
and after it, you can see China was mapped to Asia:
Country Name Region Name
0 Albania Central and Eastern Europe
1 Japan East Asia
2 United States NaN
3 China Asia
Any country that doesn't exist in the country to regions map, would be left as NaN.
答案2
得分: 0
以下是翻译好的代码部分:
# 原始字典
dct = {
"Central and Eastern Europe": [
"Albania",
"Bosnia and Herzegovina",
# ...
],
"East Asia": [
"China",
"Japan",
# ...
],
# ...
}
# 反转字典
revdct = {c: r for r, lst in dct.items() for c in lst}
# 创建数据框
nan = float('NaN')
df = pd.DataFrame({
'Country Name': ['Albania', 'Japan'],
'Region Name': [nan, nan],
})
# 填充缺失值
newdf = df.set_index('Country Name')['Region Name'].fillna(revdct).reset_index()
# 输出新数据框
>>> newdf
Country Name Region Name
0 Albania Central and Eastern Europe
1 Japan East Asia
希望这对您有所帮助。如果您有任何其他问题或需要进一步的翻译,请随时提问。
英文:
The dict
you have is not directly usable for mapping. As @MinaAshraf correctly said, you need to invert the definitions you have. Here is a way to do this (also, please do not override the keyword dict
):
dct = {
"Central and Eastern Europe": [
"Albania",
"Bosnia and Herzegovina",
# ...
],
"East Asia": [
"China",
"Japan",
# ...
],
# ...
}
revdct = {c: r for r, lst in dct.items() for c in lst}
Now, there are several ways to fill in the missing values. A simple one is as follows. But first, let's write a reproducible example:
nan = float('NaN')
df = pd.DataFrame({
'Country Name': ['Albania', 'Japan'],
'Region Name': [nan, nan],
})
Now:
newdf = df.set_index('Country Name')['Region Name'].fillna(revdct).reset_index()
>>> newdf
Country Name Region Name
0 Albania Central and Eastern Europe
1 Japan East Asia
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论