如何根据数据框中的另一列, 使用字典来填充数据框列。

huangapple go评论67阅读模式
英文:

How to fill in a dataframe column based on another column of the same dataframe using a dictionary

问题

我正在使用Python处理一个包含'Country Name'和'Region Name'列的数据框。'Country Name'列没有任何缺失值,而'Region Name'列有缺失值。

我已经创建了一个字典:

dict = {
    "Central and Eastern Europe": [
        "Albania",
        "Bosnia and Herzegovina",
        "Bulgaria",
        "Croatia",
        "Czech Republic",
        "Estonia",
        "Hungary",
        "Kosovo",
        "Latvia",
        "Lithuania",
        "Montenegro",
        "North Macedonia",
        "Poland",
        "Romania",
        "Serbia",
        "Slovakia",
        "Slovenia",
    ],
    "East Asia": [
        "China",
        "Hong Kong S.A.R. of China",
        "Japan",
        "Mongolia",
        "South Korea",
        "Taiwan Province of China",
    ],
    ...
}

我想要根据同一行的'Country Name'来填充'Region Name'列中的缺失值,使用这个字典。你能提供给我一个解决方案吗?

我已经尝试了以下代码行,但它没有起作用:

df.loc[df['Country name'].isnull(), 'Country name'] = df['Regional indicator'].map(dict)
英文:

I am working with a dataframe in Python with a 'Country Name' and a 'Region Name' column. The Country Name does not have any nans, while the Region Name has.

I have created a dictionary:

dict = {
    "Central and Eastern Europe": [
        "Albania",
        "Bosnia and Herzegovina",
        "Bulgaria",
        "Croatia",
        "Czech Republic",
        "Estonia",
        "Hungary",
        "Kosovo",
        "Latvia",
        "Lithuania",
        "Montenegro",
        "North Macedonia",
        "Poland",
        "Romania",
        "Serbia",
        "Slovakia",
        "Slovenia",
    ],
    "East Asia": [
        "China",
        "Hong Kong S.A.R. of China",
        "Japan",
        "Mongolia",
        "South Korea",
        "Taiwan Province of China",
    ],
    ...
}

and I want to fill the nans of the 'Region Name' column based on the 'Country Name' of the same row by using this dictionary.

Can you provide me with a solution ?

I have tried the following line of code but it didn't work:

df.loc[df['Country name'].isnull(), 'Country name'] = df['Regional indicator'].map(dict)

答案1

得分: 1

如果我是你,我会将我的备份查找字典转换为倒排索引结构。这意味着我会将国家名称映射到地区,而不是反过来,然后我只需将映射应用于NaN值。这比需要检查所有地区,然后检查国家是否属于该地区要快得多。

import pandas as pd
country = {"egypt": "Africa", "Libia": "Africa", "China": "Asia"}
df = pd.DataFrame({
    'Country Name': ['Albania', 'Japan', 'United States', 'China'],
    'Region Name': ['Central and Eastern Europe', 'East Asia', pd.NA, pd.NA]
})
df['Region Name'] = df['Region Name'].fillna(df['Country Name'].map(country))
print(df)

在fillna行之前:

Country Name                 Region Name
0        Albania  Central and Eastern Europe
1          Japan                   East Asia
2  United States                        <NA>
3          China                        <NA>

在之后,你可以看到中国被映射到了亚洲:

Country Name                 Region Name
0        Albania  Central and Eastern Europe
1          Japan                   East Asia
2  United States                         NaN
3          China                        Asia

任何不在国家到地区映射中的国家将保留为NaN。

英文:

If I were you, I would transform my backup lookup dict to an inverted_index structure.
This means that I would map the country names to regions, instead of the other way around, and then I would simply apply the map to the nans only. This would be much faster than the need to check all regions and then check if the country lies in this region or not.


import pandas as pd
country = {"egypt": "Africa", "Libia": "Africa", "China": "Asia"}
df = pd.DataFrame({
    'Country Name': ['Albania', 'Japan', 'United States', 'China'],
    'Region Name': ['Central and Eastern Europe', 'East Asia', pd.NA, pd.NA]
})
df['Region Name'] = df['Region Name'].fillna(df['Country Name'].map(country))
print(df)

Before the fillna line:

Country Name                 Region Name
0        Albania  Central and Eastern Europe
1          Japan                   East Asia
2  United States                        <NA>
3          China                        <NA>

and after it, you can see China was mapped to Asia:

Country Name                 Region Name
0        Albania  Central and Eastern Europe
1          Japan                   East Asia
2  United States                         NaN
3          China                        Asia

Any country that doesn't exist in the country to regions map, would be left as NaN.

答案2

得分: 0

以下是翻译好的代码部分:

# 原始字典
dct = {
    "Central and Eastern Europe": [
        "Albania",
        "Bosnia and Herzegovina",
        # ...
    ],
    "East Asia": [
        "China",
        "Japan",
        # ...
    ],
    # ...
}

# 反转字典
revdct = {c: r for r, lst in dct.items() for c in lst}

# 创建数据框
nan = float('NaN')
df = pd.DataFrame({
    'Country Name': ['Albania', 'Japan'],
    'Region Name': [nan, nan],
})

# 填充缺失值
newdf = df.set_index('Country Name')['Region Name'].fillna(revdct).reset_index()

# 输出新数据框
>>> newdf
  Country Name                 Region Name
0      Albania  Central and Eastern Europe
1        Japan                   East Asia

希望这对您有所帮助。如果您有任何其他问题或需要进一步的翻译,请随时提问。

英文:

The dict you have is not directly usable for mapping. As @MinaAshraf correctly said, you need to invert the definitions you have. Here is a way to do this (also, please do not override the keyword dict):

dct = {
    "Central and Eastern Europe": [
        "Albania",
        "Bosnia and Herzegovina",
        # ...
    ],
    "East Asia": [
        "China",
        "Japan",
        # ...
    ],
    # ...
}

revdct = {c: r for r, lst in dct.items() for c in lst}

Now, there are several ways to fill in the missing values. A simple one is as follows. But first, let's write a reproducible example:

nan = float('NaN')
df = pd.DataFrame({
    'Country Name': ['Albania', 'Japan'],
    'Region Name': [nan, nan],
})

Now:

newdf = df.set_index('Country Name')['Region Name'].fillna(revdct).reset_index()

>>> newdf
  Country Name                 Region Name
0      Albania  Central and Eastern Europe
1        Japan                   East Asia

huangapple
  • 本文由 发表于 2023年3月3日 23:57:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75629290.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定