如何使用Pandas的read_html来抓取Wikipedia表格时获取缺失的百分比数值?

huangapple go评论63阅读模式
英文:

How to get missing percentages values when scrapping Wikipedia table with Pandas read_html?

问题

当我从网页导入表格到Python时,列(Population.1)显示为NaN,而在原始网页中不是NaN。

英文:

When I import a table from a webpage to Python then the column (Population.1) shows as NaN while it is not NaN in the original webpage

如何使用Pandas的read_html来抓取Wikipedia表格时获取缺失的百分比数值?

import requests


pop_url = (
    "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
)

r = requests.get(pop_url)

wiki_tables = pd.read_html(r.text, header=0)

len(wiki_tables)

cont_pop = wiki_tables[1]

cont_pop.head()

答案1

得分: 1

以下是使用Beautiful Soup进行的一种方法:

import pandas as pd
import requests
from bs4 import BeautifulSoup

pop_url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
r = requests.get(pop_url)

# 导入表格并移除第一行(重复的表头)
cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)

然后:

# 查找并添加缺失值
raw = BeautifulSoup(r.content, "html.parser").find_all("table")
rows = raw[1].text.split("\n")[14:]
rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
cont_pop["Population.1"] = [value for row in rows for value in row if "%" in value]

最后:

print(cont_pop)
# 输出

    Rank                 Country / Dependency  Population Population.1   
1                                      World  8035105000         100%  \
2      1                                China  1411750000        17.6%   
3      2                                India  1392329000        17.3%   
4      3                        United States   334869000        4.17%   
5      4                            Indonesia   277749853        3.46%   
..   ...                                  ...         ...          ...   
238                    Tokelau (New Zealand)        1647           0%   
239                                     Niue        1549           0%   
240  195                         Vatican City         825           0%   
241      Cocos (Keeling) Islands (Australia)         593           0%   
242        Pitcairn Islands (United Kingdom)          47           0%   

            Date Source (official or from the United Nations) Notes  
1    10 Jun 2023                             UN projection[3]   NaN  
2    31 Dec 2022                         Official estimate[4]   [b]  
3     1 Mar 2023                       Official projection[5]   [c]  
4    10 Jun 2023                 National population clock[7]   [d]  
5    31 Dec 2022                         Official estimate[8]   NaN  
..           ...                                          ...   ...  
238   1 Jan 2019                            2019 Census [211]   NaN  
239   1 Jul 2021               National annual projection[96]   NaN  
240   1 Feb 2019               Monthly national estimate[212]  [af]  
241  30 Jun 2020                             2021 Census[213]   NaN  
242   1 Jul 2021                       Official estimate[214]   NaN  

[242 rows x 7 columns]
英文:

Here is one way to do it with Beautiful Soup:

import pandas as pd
import requests
from bs4 import BeautifulSoup

pop_url = (
    "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
)
r = requests.get(pop_url)

# Import table and remove first row (duplicated header)
cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)

Then:

# Find and add missing values
raw = BeautifulSoup(r.content, "html.parser").find_all("table")
rows = raw[1].text.split("\n")[14:]
rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
cont_pop["Population.1"] = [value for row in rows for value in row if "%" in value]

Finally:

print(cont_pop)
# Output

    Rank                 Country / Dependency  Population Population.1   
1                                      World  8035105000         100%  \
2      1                                China  1411750000        17.6%   
3      2                                India  1392329000        17.3%   
4      3                        United States   334869000        4.17%   
5      4                            Indonesia   277749853        3.46%   
..   ...                                  ...         ...          ...   
238                    Tokelau (New Zealand)        1647           0%   
239                                     Niue        1549           0%   
240  195                         Vatican City         825           0%   
241      Cocos (Keeling) Islands (Australia)         593           0%   
242        Pitcairn Islands (United Kingdom)          47           0%   

            Date Source (official or from the United Nations) Notes  
1    10 Jun 2023                             UN projection[3]   NaN  
2    31 Dec 2022                         Official estimate[4]   [b]  
3     1 Mar 2023                       Official projection[5]   [c]  
4    10 Jun 2023                 National population clock[7]   [d]  
5    31 Dec 2022                         Official estimate[8]   NaN  
..           ...                                          ...   ...  
238   1 Jan 2019                            2019 Census [211]   NaN  
239   1 Jul 2021               National annual projection[96]   NaN  
240   1 Feb 2019               Monthly national estimate[212]  [af]  
241  30 Jun 2020                             2021 Census[213]   NaN  
242   1 Jul 2021                       Official estimate[214]   NaN  

[242 rows x 7 columns]

huangapple
  • 本文由 发表于 2023年6月5日 21:43:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76407049.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定