英文:
How to get missing percentages values when scrapping Wikipedia table with Pandas read_html?
问题
当我从网页导入表格到Python时,列(Population.1)显示为NaN,而在原始网页中不是NaN。
英文:
When I import a table from a webpage to Python then the column (Population.1) shows as NaN while it is not NaN in the original webpage
import requests
pop_url = (
"https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
)
r = requests.get(pop_url)
wiki_tables = pd.read_html(r.text, header=0)
len(wiki_tables)
cont_pop = wiki_tables[1]
cont_pop.head()
答案1
得分: 1
以下是使用Beautiful Soup进行的一种方法:
import pandas as pd
import requests
from bs4 import BeautifulSoup
pop_url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
r = requests.get(pop_url)
# 导入表格并移除第一行(重复的表头)
cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)
然后:
# 查找并添加缺失值
raw = BeautifulSoup(r.content, "html.parser").find_all("table")
rows = raw[1].text.split("\n")[14:]
rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
cont_pop["Population.1"] = [value for row in rows for value in row if "%" in value]
最后:
print(cont_pop)
# 输出
Rank Country / Dependency Population Population.1
1 – World 8035105000 100% \
2 1 China 1411750000 17.6%
3 2 India 1392329000 17.3%
4 3 United States 334869000 4.17%
5 4 Indonesia 277749853 3.46%
.. ... ... ... ...
238 – Tokelau (New Zealand) 1647 0%
239 – Niue 1549 0%
240 195 Vatican City 825 0%
241 – Cocos (Keeling) Islands (Australia) 593 0%
242 – Pitcairn Islands (United Kingdom) 47 0%
Date Source (official or from the United Nations) Notes
1 10 Jun 2023 UN projection[3] NaN
2 31 Dec 2022 Official estimate[4] [b]
3 1 Mar 2023 Official projection[5] [c]
4 10 Jun 2023 National population clock[7] [d]
5 31 Dec 2022 Official estimate[8] NaN
.. ... ... ...
238 1 Jan 2019 2019 Census [211] NaN
239 1 Jul 2021 National annual projection[96] NaN
240 1 Feb 2019 Monthly national estimate[212] [af]
241 30 Jun 2020 2021 Census[213] NaN
242 1 Jul 2021 Official estimate[214] NaN
[242 rows x 7 columns]
英文:
Here is one way to do it with Beautiful Soup:
import pandas as pd
import requests
from bs4 import BeautifulSoup
pop_url = (
"https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
)
r = requests.get(pop_url)
# Import table and remove first row (duplicated header)
cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)
Then:
# Find and add missing values
raw = BeautifulSoup(r.content, "html.parser").find_all("table")
rows = raw[1].text.split("\n")[14:]
rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
cont_pop["Population.1"] = [value for row in rows for value in row if "%" in value]
Finally:
print(cont_pop)
# Output
Rank Country / Dependency Population Population.1
1 – World 8035105000 100% \
2 1 China 1411750000 17.6%
3 2 India 1392329000 17.3%
4 3 United States 334869000 4.17%
5 4 Indonesia 277749853 3.46%
.. ... ... ... ...
238 – Tokelau (New Zealand) 1647 0%
239 – Niue 1549 0%
240 195 Vatican City 825 0%
241 – Cocos (Keeling) Islands (Australia) 593 0%
242 – Pitcairn Islands (United Kingdom) 47 0%
Date Source (official or from the United Nations) Notes
1 10 Jun 2023 UN projection[3] NaN
2 31 Dec 2022 Official estimate[4] [b]
3 1 Mar 2023 Official projection[5] [c]
4 10 Jun 2023 National population clock[7] [d]
5 31 Dec 2022 Official estimate[8] NaN
.. ... ... ...
238 1 Jan 2019 2019 Census [211] NaN
239 1 Jul 2021 National annual projection[96] NaN
240 1 Feb 2019 Monthly national estimate[212] [af]
241 30 Jun 2020 2021 Census[213] NaN
242 1 Jul 2021 Official estimate[214] NaN
[242 rows x 7 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论