2023年6月5日 21:43:46go评论100阅读模式

英文:

How to get missing percentages values when scrapping Wikipedia table with Pandas read_html?

问题

当我从网页导入表格到Python时，列（Population.1）显示为NaN，而在原始网页中不是NaN。

英文:

When I import a table from a webpage to Python then the column (Population.1) shows as NaN while it is not NaN in the original webpage

import requests
pop_url = (
    &quot;https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population&quot;
)
r = requests.get(pop_url)
wiki_tables = pd.read_html(r.text, header=0)
len(wiki_tables)
cont_pop = wiki_tables[1]
cont_pop.head()

答案1

得分: 1

以下是使用Beautiful Soup进行的一种方法：

import pandas as pd
import requests
from bs4 import BeautifulSoup
pop_url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
r = requests.get(pop_url)
# 导入表格并移除第一行（重复的表头）
cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)

然后：

# 查找并添加缺失值
raw = BeautifulSoup(r.content, "html.parser").find_all("table")
rows = raw[1].text.split("\n")[14:]
rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
cont_pop["Population.1"] = [value for row in rows for value in row if "%" in value]

最后：

print(cont_pop)
# 输出
    Rank                 Country / Dependency  Population Population.1   
1      –                                World  8035105000         100%  \
2      1                                China  1411750000        17.6%   
3      2                                India  1392329000        17.3%   
4      3                        United States   334869000        4.17%   
5      4                            Indonesia   277749853        3.46%   
..   ...                                  ...         ...          ...   
238    –                Tokelau (New Zealand)        1647           0%   
239    –                                 Niue        1549           0%   
240  195                         Vatican City         825           0%   
241    –  Cocos (Keeling) Islands (Australia)         593           0%   
242    –    Pitcairn Islands (United Kingdom)          47           0%   
            Date Source (official or from the United Nations) Notes  
1    10 Jun 2023                             UN projection[3]   NaN  
2    31 Dec 2022                         Official estimate[4]   [b]  
3     1 Mar 2023                       Official projection[5]   [c]  
4    10 Jun 2023                 National population clock[7]   [d]  
5    31 Dec 2022                         Official estimate[8]   NaN  
..           ...                                          ...   ...  
238   1 Jan 2019                            2019 Census [211]   NaN  
239   1 Jul 2021               National annual projection[96]   NaN  
240   1 Feb 2019               Monthly national estimate[212]  [af]  
241  30 Jun 2020                             2021 Census[213]   NaN  
242   1 Jul 2021                       Official estimate[214]   NaN  
[242 rows x 7 columns]

英文:

Here is one way to do it with Beautiful Soup:

import pandas as pd
import requests
from bs4 import BeautifulSoup
pop_url = (
    &quot;https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population&quot;
)
r = requests.get(pop_url)
# Import table and remove first row (duplicated header)
cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)

Then:

# Find and add missing values
raw = BeautifulSoup(r.content, &quot;html.parser&quot;).find_all(&quot;table&quot;)
rows = raw[1].text.split(&quot;\n&quot;)[14:]
rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
cont_pop[&quot;Population.1&quot;] = [value for row in rows for value in row if &quot;%&quot; in value]

Finally:

print(cont_pop)
# Output
    Rank                 Country / Dependency  Population Population.1   
1      –                                World  8035105000         100%  \
2      1                                China  1411750000        17.6%   
3      2                                India  1392329000        17.3%   
4      3                        United States   334869000        4.17%   
5      4                            Indonesia   277749853        3.46%   
..   ...                                  ...         ...          ...   
238    –                Tokelau (New Zealand)        1647           0%   
239    –                                 Niue        1549           0%   
240  195                         Vatican City         825           0%   
241    –  Cocos (Keeling) Islands (Australia)         593           0%   
242    –    Pitcairn Islands (United Kingdom)          47           0%   
            Date Source (official or from the United Nations) Notes  
1    10 Jun 2023                             UN projection[3]   NaN  
2    31 Dec 2022                         Official estimate[4]   [b]  
3     1 Mar 2023                       Official projection[5]   [c]  
4    10 Jun 2023                 National population clock[7]   [d]  
5    31 Dec 2022                         Official estimate[8]   NaN  
..           ...                                          ...   ...  
238   1 Jan 2019                            2019 Census [211]   NaN  
239   1 Jul 2021               National annual projection[96]   NaN  
240   1 Feb 2019               Monthly national estimate[212]  [af]  
241  30 Jun 2020                             2021 Census[213]   NaN  
242   1 Jul 2021                       Official estimate[214]   NaN  
[242 rows x 7 columns]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用Pandas的read_html来抓取Wikipedia表格时获取缺失的百分比数值？

问题

答案1

将字符串列表转换为（对象）列表在Pandas中如何做？

PyScript：在HTML段落之间运行代码块？

Custom CSS Grid Border Using Image

在VSC中如何通过浏览器进行LivePreview

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。