如何使用Python3从网站提取表格

huangapple go评论83阅读模式
英文:

How to extract table from website using python3

问题

I want to get/export table from https://www.ethernodes.org/nodes
to a txt file to access with bashscript.

OpenAI help me with this Python3 code but it get nothing

  1. import requests
  2. from bs4 import BeautifulSoup
  3. url = 'https://www.ethernodes.org/nodes?page=8'
  4. response = requests.get(url)
  5. soup = BeautifulSoup(response.text, 'html.parser')
  6. host_ips = []
  7. node_list = soup.find('ul', class_='nodes-list')
  8. if node_list is not None:
  9. for li in node_list.find_all('li'):
  10. host_ip = li.find('div', class_='node-host').text.strip()
  11. host_ips.append(host_ip)
  12. print(host_ips)
英文:

I want to get/export table from https://www.ethernodes.org/nodes
to a txt file to access with bashscript.

OpenAI help me with this Python3 code but it get nothing

  1. import requests
  2. from bs4 import BeautifulSoup
  3. url = 'https://www.ethernodes.org/nodes?page=8'
  4. response = requests.get(url)
  5. soup = BeautifulSoup(response.text, 'html.parser')
  6. host_ips = []
  7. node_list = soup.find('ul', class_='nodes-list')
  8. if node_list is not None:
  9. for li in node_list.find_all('li'):
  10. host_ip = li.find('div', class_='node-host').text.strip()
  11. host_ips.append(host_ip)
  12. print(host_ips)

答案1

得分: 1

以下是您可以获取数据并将其转储到.csv文件的代码部分:

  1. import time
  2. import pandas as pd
  3. import requests
  4. url = "https://www.ethernodes.org/data?"
  5. payload = {
  6. "draw": "2",
  7. "columns[0][data]": "id",
  8. "columns[0][name]": "",
  9. "columns[0][searchable]": "true",
  10. "columns[0][orderable]": "true",
  11. "columns[0][search][value]": "",
  12. "columns[0][search][regex]": "false",
  13. "columns[1][data]": "host",
  14. "columns[1][name]": "",
  15. "columns[1][searchable]": "true",
  16. "columns[1][orderable]": "true",
  17. "columns[1][search][value]": "",
  18. "columns[1][search][regex]": "false",
  19. "columns[2][data]": "isp",
  20. "columns[2][name]": "",
  21. "columns[2][searchable]": "true",
  22. "columns[2][orderable]": "true",
  23. "columns[2][search][value]": "",
  24. "columns[2][search][regex]": "false",
  25. "columns[3][data]": "country",
  26. "columns[3][name]": "",
  27. "columns[3][searchable]": "true",
  28. "columns[3][orderable]": "true",
  29. "columns[3][search][value]": "",
  30. "columns[3][search][regex]": "false",
  31. "columns[4][data]": "client",
  32. "columns[4][name]": "",
  33. "columns[4][searchable]": "true",
  34. "columns[4][orderable]": "true",
  35. "columns[4][search][value]": "",
  36. "columns[4][search][regex]": "false",
  37. "columns[5][data]": "clientVersion",
  38. "columns[5][name]": "",
  39. "columns[5][searchable]": "true",
  40. "columns[5][orderable]": "true",
  41. "columns[5][search][value]": "",
  42. "columns[5][search][regex]": "false",
  43. "columns[6][data]": "os",
  44. "columns[6][name]": "",
  45. "columns[6][searchable]": "true",
  46. "columns[6][orderable]": "true",
  47. "columns[6][search][value]": "",
  48. "columns[6][search][regex]": "false",
  49. "columns[7][data]": "lastUpdate",
  50. "columns[7][name]": "",
  51. "columns[7][searchable]": "true",
  52. "columns[7][orderable]": "true",
  53. "columns[7][search][value]": "",
  54. "columns[7][search][regex]": "false",
  55. "columns[8][data]": "inSync",
  56. "columns[8][name]": "",
  57. "columns[8][searchable]": "true",
  58. "columns[8][orderable]": "true",
  59. "columns[8][search][value]": "",
  60. "columns[8][search][regex]": "false",
  61. "order[0][column]": "0",
  62. "order[0][dir]": "asc",
  63. "start": "0",
  64. "length": "100",
  65. "search[value]": "",
  66. "search[regex]": "false",
  67. "_": time.time()
  68. }
  69. headers = {
  70. "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48",
  71. "Accept": "application/json, text/javascript, */*; q=0.01",
  72. "X-Requested-With": "XMLHttpRequest",
  73. }
  74. data = requests.get(url, headers=headers, params=payload).json()["data"]
  75. df = pd.DataFrame(data)
  76. df.to_csv("nodes.csv", index=False)

输出:

如何使用Python3从网站提取表格

如果您只需要主机IP,可以添加以下内容:

  1. hosts = df["host"].values
  2. with open("hosts.txt", "w") as f:
  3. f.write("\n".join(hosts))

然后您可以执行:

  1. $ cat hosts.txt
英文:

Here's how you can get the data and dump it to a .csv file.

  1. import time
  2. import pandas as pd
  3. import requests
  4. url = "https://www.ethernodes.org/data?"
  5. payload = {
  6. "draw": "2",
  7. "columns[0][data]": "id",
  8. "columns[0][name]": "",
  9. "columns[0][searchable]": "true",
  10. "columns[0][orderable]": "true",
  11. "columns[0][search][value]": "",
  12. "columns[0][search][regex]": "false",
  13. "columns[1][data]": "host",
  14. "columns[1][name]": "",
  15. "columns[1][searchable]": "true",
  16. "columns[1][orderable]": "true",
  17. "columns[1][search][value]": "",
  18. "columns[1][search][regex]": "false",
  19. "columns[2][data]": "isp",
  20. "columns[2][name]": "",
  21. "columns[2][searchable]": "true",
  22. "columns[2][orderable]": "true",
  23. "columns[2][search][value]": "",
  24. "columns[2][search][regex]": "false",
  25. "columns[3][data]": "country",
  26. "columns[3][name]": "",
  27. "columns[3][searchable]": "true",
  28. "columns[3][orderable]": "true",
  29. "columns[3][search][value]": "",
  30. "columns[3][search][regex]": "false",
  31. "columns[4][data]": "client",
  32. "columns[4][name]": "",
  33. "columns[4][searchable]": "true",
  34. "columns[4][orderable]": "true",
  35. "columns[4][search][value]": "",
  36. "columns[4][search][regex]": "false",
  37. "columns[5][data]": "clientVersion",
  38. "columns[5][name]": "",
  39. "columns[5][searchable]": "true",
  40. "columns[5][orderable]": "true",
  41. "columns[5][search][value]": "",
  42. "columns[5][search][regex]": "false",
  43. "columns[6][data]": "os",
  44. "columns[6][name]": "",
  45. "columns[6][searchable]": "true",
  46. "columns[6][orderable]": "true",
  47. "columns[6][search][value]": "",
  48. "columns[6][search][regex]": "false",
  49. "columns[7][data]": "lastUpdate",
  50. "columns[7][name]": "",
  51. "columns[7][searchable]": "true",
  52. "columns[7][orderable]": "true",
  53. "columns[7][search][value]": "",
  54. "columns[7][search][regex]": "false",
  55. "columns[8][data]": "inSync",
  56. "columns[8][name]": "",
  57. "columns[8][searchable]": "true",
  58. "columns[8][orderable]": "true",
  59. "columns[8][search][value]": "",
  60. "columns[8][search][regex]": "false",
  61. "order[0][column]": "0",
  62. "order[0][dir]": "asc",
  63. "start": "0",
  64. "length": "100",
  65. "search[value]": "",
  66. "search[regex]": "false",
  67. "_": time.time()
  68. }
  69. headers = {
  70. "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48",
  71. "Accept": "application/json, text/javascript, */*; q=0.01",
  72. "X-Requested-With": "XMLHttpRequest",
  73. }
  74. data = requests.get(url, headers=headers, params=payload).json()["data"]
  75. df = pd.DataFrame(data)
  76. df.to_csv("nodes.csv", index=False)

Output:

如何使用Python3从网站提取表格

And if all you need is the hosts IPs add this:

  1. hosts = df["host"].values
  2. with open("hosts.txt", "w") as f:
  3. f.write("\n".join(hosts))

Then you can

  1. $ cat hosts.txt

huangapple
  • 本文由 发表于 2023年4月19日 16:59:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76052595.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定