2023年3月3日 23:57:46go评论74阅读模式

英文:

How to fill in a dataframe column based on another column of the same dataframe using a dictionary

问题

我正在使用Python处理一个包含'Country Name'和'Region Name'列的数据框。'Country Name'列没有任何缺失值，而'Region Name'列有缺失值。

我已经创建了一个字典：

dict = {
    "Central and Eastern Europe": [
        "Albania",
        "Bosnia and Herzegovina",
        "Bulgaria",
        "Croatia",
        "Czech Republic",
        "Estonia",
        "Hungary",
        "Kosovo",
        "Latvia",
        "Lithuania",
        "Montenegro",
        "North Macedonia",
        "Poland",
        "Romania",
        "Serbia",
        "Slovakia",
        "Slovenia",
    ],
    "East Asia": [
        "China",
        "Hong Kong S.A.R. of China",
        "Japan",
        "Mongolia",
        "South Korea",
        "Taiwan Province of China",
    ],
    ...
}

我想要根据同一行的'Country Name'来填充'Region Name'列中的缺失值，使用这个字典。你能提供给我一个解决方案吗？

我已经尝试了以下代码行，但它没有起作用：

df.loc[df['Country name'].isnull(), 'Country name'] = df['Regional indicator'].map(dict)

英文:

I am working with a dataframe in Python with a 'Country Name' and a 'Region Name' column. The Country Name does not have any nans, while the Region Name has.

I have created a dictionary:

dict = {
    &quot;Central and Eastern Europe&quot;: [
        &quot;Albania&quot;,
        &quot;Bosnia and Herzegovina&quot;,
        &quot;Bulgaria&quot;,
        &quot;Croatia&quot;,
        &quot;Czech Republic&quot;,
        &quot;Estonia&quot;,
        &quot;Hungary&quot;,
        &quot;Kosovo&quot;,
        &quot;Latvia&quot;,
        &quot;Lithuania&quot;,
        &quot;Montenegro&quot;,
        &quot;North Macedonia&quot;,
        &quot;Poland&quot;,
        &quot;Romania&quot;,
        &quot;Serbia&quot;,
        &quot;Slovakia&quot;,
        &quot;Slovenia&quot;,
    ],
    &quot;East Asia&quot;: [
        &quot;China&quot;,
        &quot;Hong Kong S.A.R. of China&quot;,
        &quot;Japan&quot;,
        &quot;Mongolia&quot;,
        &quot;South Korea&quot;,
        &quot;Taiwan Province of China&quot;,
    ],
    ...
}

and I want to fill the nans of the 'Region Name' column based on the 'Country Name' of the same row by using this dictionary.

Can you provide me with a solution ?

I have tried the following line of code but it didn't work:

df.loc[df[&#39;Country name&#39;].isnull(), &#39;Country name&#39;] = df[&#39;Regional indicator&#39;].map(dict)

答案1

得分: 1

如果我是你，我会将我的备份查找字典转换为倒排索引结构。这意味着我会将国家名称映射到地区，而不是反过来，然后我只需将映射应用于NaN值。这比需要检查所有地区，然后检查国家是否属于该地区要快得多。

import pandas as pd
country = {"egypt": "Africa", "Libia": "Africa", "China": "Asia"}
df = pd.DataFrame({
    'Country Name': ['Albania', 'Japan', 'United States', 'China'],
    'Region Name': ['Central and Eastern Europe', 'East Asia', pd.NA, pd.NA]
})
df['Region Name'] = df['Region Name'].fillna(df['Country Name'].map(country))
print(df)

在fillna行之前：

Country Name                 Region Name
0        Albania  Central and Eastern Europe
1          Japan                   East Asia
2  United States                        &lt;NA&gt;
3          China                        &lt;NA&gt;

在之后，你可以看到中国被映射到了亚洲：

Country Name                 Region Name
0        Albania  Central and Eastern Europe
1          Japan                   East Asia
2  United States                         NaN
3          China                        Asia

任何不在国家到地区映射中的国家将保留为NaN。

英文:

If I were you, I would transform my backup lookup dict to an inverted_index structure.
This means that I would map the country names to regions, instead of the other way around, and then I would simply apply the map to the nans only. This would be much faster than the need to check all regions and then check if the country lies in this region or not.


import pandas as pd
country = {&quot;egypt&quot;: &quot;Africa&quot;, &quot;Libia&quot;: &quot;Africa&quot;, &quot;China&quot;: &quot;Asia&quot;}
df = pd.DataFrame({
    &#39;Country Name&#39;: [&#39;Albania&#39;, &#39;Japan&#39;, &#39;United States&#39;, &#39;China&#39;],
    &#39;Region Name&#39;: [&#39;Central and Eastern Europe&#39;, &#39;East Asia&#39;, pd.NA, pd.NA]
})
df[&#39;Region Name&#39;] = df[&#39;Region Name&#39;].fillna(df[&#39;Country Name&#39;].map(country))
print(df)

Before the fillna line:

Country Name                 Region Name
0        Albania  Central and Eastern Europe
1          Japan                   East Asia
2  United States                        &lt;NA&gt;
3          China                        &lt;NA&gt;

and after it, you can see China was mapped to Asia:

Country Name                 Region Name
0        Albania  Central and Eastern Europe
1          Japan                   East Asia
2  United States                         NaN
3          China                        Asia

Any country that doesn't exist in the country to regions map, would be left as NaN.

答案2

得分: 0

以下是翻译好的代码部分：

# 原始字典
dct = {
    "Central and Eastern Europe": [
        "Albania",
        "Bosnia and Herzegovina",
        # ...
    ],
    "East Asia": [
        "China",
        "Japan",
        # ...
    ],
    # ...
}

# 反转字典
revdct = {c: r for r, lst in dct.items() for c in lst}

# 创建数据框
nan = float('NaN')
df = pd.DataFrame({
    'Country Name': ['Albania', 'Japan'],
    'Region Name': [nan, nan],
})

# 填充缺失值
newdf = df.set_index('Country Name')['Region Name'].fillna(revdct).reset_index()

# 输出新数据框
>>> newdf
  Country Name                 Region Name
0      Albania  Central and Eastern Europe
1        Japan                   East Asia

希望这对您有所帮助。如果您有任何其他问题或需要进一步的翻译，请随时提问。

英文:

The dict you have is not directly usable for mapping. As @MinaAshraf correctly said, you need to invert the definitions you have. Here is a way to do this (also, please do not override the keyword dict):

dct = {
    &quot;Central and Eastern Europe&quot;: [
        &quot;Albania&quot;,
        &quot;Bosnia and Herzegovina&quot;,
        # ...
    ],
    &quot;East Asia&quot;: [
        &quot;China&quot;,
        &quot;Japan&quot;,
        # ...
    ],
    # ...
}

revdct = {c: r for r, lst in dct.items() for c in lst}

Now, there are several ways to fill in the missing values. A simple one is as follows. But first, let's write a reproducible example:

nan = float(&#39;NaN&#39;)
df = pd.DataFrame({
    &#39;Country Name&#39;: [&#39;Albania&#39;, &#39;Japan&#39;],
    &#39;Region Name&#39;: [nan, nan],
})

Now:

newdf = df.set_index(&#39;Country Name&#39;)[&#39;Region Name&#39;].fillna(revdct).reset_index()

&gt;&gt;&gt; newdf
  Country Name                 Region Name
0      Albania  Central and Eastern Europe
1        Japan                   East Asia

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何根据数据框中的另一列, 使用字典来填充数据框列。

问题

答案1

答案2

Python签署EIP-712消息用于blur.io

创建基于另一个数据集的作者的数据集。

设置Python中Excel文件的所有工作表的标题颜色和边框。

在Jupyter Notebook中，if-else块内部未显示HTML类。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论