如何使用Pandas的read_html来抓取Wikipedia表格时获取缺失的百分比数值?

huangapple go评论100阅读模式
英文:

How to get missing percentages values when scrapping Wikipedia table with Pandas read_html?

问题

当我从网页导入表格到Python时,列(Population.1)显示为NaN,而在原始网页中不是NaN。

英文:

When I import a table from a webpage to Python then the column (Population.1) shows as NaN while it is not NaN in the original webpage

如何使用Pandas的read_html来抓取Wikipedia表格时获取缺失的百分比数值?

  1. import requests
  2. pop_url = (
  3. "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
  4. )
  5. r = requests.get(pop_url)
  6. wiki_tables = pd.read_html(r.text, header=0)
  7. len(wiki_tables)
  8. cont_pop = wiki_tables[1]
  9. cont_pop.head()

答案1

得分: 1

以下是使用Beautiful Soup进行的一种方法:

  1. import pandas as pd
  2. import requests
  3. from bs4 import BeautifulSoup
  4. pop_url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
  5. r = requests.get(pop_url)
  6. # 导入表格并移除第一行(重复的表头)
  7. cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)

然后:

  1. # 查找并添加缺失值
  2. raw = BeautifulSoup(r.content, "html.parser").find_all("table")
  3. rows = raw[1].text.split("\n")[14:]
  4. rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
  5. cont_pop["Population.1"] = [value for row in rows for value in row if "%" in value]

最后:

  1. print(cont_pop)
  2. # 输出
  3. Rank Country / Dependency Population Population.1
  4. 1 World 8035105000 100% \
  5. 2 1 China 1411750000 17.6%
  6. 3 2 India 1392329000 17.3%
  7. 4 3 United States 334869000 4.17%
  8. 5 4 Indonesia 277749853 3.46%
  9. .. ... ... ... ...
  10. 238 Tokelau (New Zealand) 1647 0%
  11. 239 Niue 1549 0%
  12. 240 195 Vatican City 825 0%
  13. 241 Cocos (Keeling) Islands (Australia) 593 0%
  14. 242 Pitcairn Islands (United Kingdom) 47 0%
  15. Date Source (official or from the United Nations) Notes
  16. 1 10 Jun 2023 UN projection[3] NaN
  17. 2 31 Dec 2022 Official estimate[4] [b]
  18. 3 1 Mar 2023 Official projection[5] [c]
  19. 4 10 Jun 2023 National population clock[7] [d]
  20. 5 31 Dec 2022 Official estimate[8] NaN
  21. .. ... ... ...
  22. 238 1 Jan 2019 2019 Census [211] NaN
  23. 239 1 Jul 2021 National annual projection[96] NaN
  24. 240 1 Feb 2019 Monthly national estimate[212] [af]
  25. 241 30 Jun 2020 2021 Census[213] NaN
  26. 242 1 Jul 2021 Official estimate[214] NaN
  27. [242 rows x 7 columns]
英文:

Here is one way to do it with Beautiful Soup:

  1. import pandas as pd
  2. import requests
  3. from bs4 import BeautifulSoup
  4. pop_url = (
  5. "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
  6. )
  7. r = requests.get(pop_url)
  8. # Import table and remove first row (duplicated header)
  9. cont_pop = pd.read_html(r.text, header=0)[1].drop(index=0)

Then:

  1. # Find and add missing values
  2. raw = BeautifulSoup(r.content, "html.parser").find_all("table")
  3. rows = raw[1].text.split("\n")[14:]
  4. rows = [rows[i : i + 11][1:] for i in range(0, len(rows), 11)][:-1]
  5. cont_pop["Population.1"] = [value for row in rows for value in row if "%" in value]

Finally:

  1. print(cont_pop)
  2. # Output
  3. Rank Country / Dependency Population Population.1
  4. 1 World 8035105000 100% \
  5. 2 1 China 1411750000 17.6%
  6. 3 2 India 1392329000 17.3%
  7. 4 3 United States 334869000 4.17%
  8. 5 4 Indonesia 277749853 3.46%
  9. .. ... ... ... ...
  10. 238 Tokelau (New Zealand) 1647 0%
  11. 239 Niue 1549 0%
  12. 240 195 Vatican City 825 0%
  13. 241 Cocos (Keeling) Islands (Australia) 593 0%
  14. 242 Pitcairn Islands (United Kingdom) 47 0%
  15. Date Source (official or from the United Nations) Notes
  16. 1 10 Jun 2023 UN projection[3] NaN
  17. 2 31 Dec 2022 Official estimate[4] [b]
  18. 3 1 Mar 2023 Official projection[5] [c]
  19. 4 10 Jun 2023 National population clock[7] [d]
  20. 5 31 Dec 2022 Official estimate[8] NaN
  21. .. ... ... ...
  22. 238 1 Jan 2019 2019 Census [211] NaN
  23. 239 1 Jul 2021 National annual projection[96] NaN
  24. 240 1 Feb 2019 Monthly national estimate[212] [af]
  25. 241 30 Jun 2020 2021 Census[213] NaN
  26. 242 1 Jul 2021 Official estimate[214] NaN
  27. [242 rows x 7 columns]

huangapple
  • 本文由 发表于 2023年6月5日 21:43:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76407049.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定