如何使用Python3从网站提取表格

huangapple go评论53阅读模式
英文:

How to extract table from website using python3

问题

I want to get/export table from https://www.ethernodes.org/nodes
to a txt file to access with bashscript.

OpenAI help me with this Python3 code but it get nothing

import requests
from bs4 import BeautifulSoup

url = 'https://www.ethernodes.org/nodes?page=8'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

host_ips = []
node_list = soup.find('ul', class_='nodes-list')
if node_list is not None:
    for li in node_list.find_all('li'):
        host_ip = li.find('div', class_='node-host').text.strip()
        host_ips.append(host_ip)

print(host_ips)
英文:

I want to get/export table from https://www.ethernodes.org/nodes
to a txt file to access with bashscript.

OpenAI help me with this Python3 code but it get nothing

import requests
from bs4 import BeautifulSoup

url = 'https://www.ethernodes.org/nodes?page=8'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

host_ips = []
node_list = soup.find('ul', class_='nodes-list')
if node_list is not None:
    for li in node_list.find_all('li'):
        host_ip = li.find('div', class_='node-host').text.strip()
        host_ips.append(host_ip)

print(host_ips)

答案1

得分: 1

以下是您可以获取数据并将其转储到.csv文件的代码部分:

import time
import pandas as pd
import requests

url = "https://www.ethernodes.org/data?"

payload = {
    "draw": "2",
    "columns[0][data]": "id",
    "columns[0][name]": "",
    "columns[0][searchable]": "true",
    "columns[0][orderable]": "true",
    "columns[0][search][value]": "",
    "columns[0][search][regex]": "false",
    "columns[1][data]": "host",
    "columns[1][name]": "",
    "columns[1][searchable]": "true",
    "columns[1][orderable]": "true",
    "columns[1][search][value]": "",
    "columns[1][search][regex]": "false",
    "columns[2][data]": "isp",
    "columns[2][name]": "",
    "columns[2][searchable]": "true",
    "columns[2][orderable]": "true",
    "columns[2][search][value]": "",
    "columns[2][search][regex]": "false",
    "columns[3][data]": "country",
    "columns[3][name]": "",
    "columns[3][searchable]": "true",
    "columns[3][orderable]": "true",
    "columns[3][search][value]": "",
    "columns[3][search][regex]": "false",
    "columns[4][data]": "client",
    "columns[4][name]": "",
    "columns[4][searchable]": "true",
    "columns[4][orderable]": "true",
    "columns[4][search][value]": "",
    "columns[4][search][regex]": "false",
    "columns[5][data]": "clientVersion",
    "columns[5][name]": "",
    "columns[5][searchable]": "true",
    "columns[5][orderable]": "true",
    "columns[5][search][value]": "",
    "columns[5][search][regex]": "false",
    "columns[6][data]": "os",
    "columns[6][name]": "",
    "columns[6][searchable]": "true",
    "columns[6][orderable]": "true",
    "columns[6][search][value]": "",
    "columns[6][search][regex]": "false",
    "columns[7][data]": "lastUpdate",
    "columns[7][name]": "",
    "columns[7][searchable]": "true",
    "columns[7][orderable]": "true",
    "columns[7][search][value]": "",
    "columns[7][search][regex]": "false",
    "columns[8][data]": "inSync",
    "columns[8][name]": "",
    "columns[8][searchable]": "true",
    "columns[8][orderable]": "true",
    "columns[8][search][value]": "",
    "columns[8][search][regex]": "false",
    "order[0][column]": "0",
    "order[0][dir]": "asc",
    "start": "0",
    "length": "100",
    "search[value]": "",
    "search[regex]": "false",
    "_": time.time()
}

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48",
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "X-Requested-With": "XMLHttpRequest",
}

data = requests.get(url, headers=headers, params=payload).json()["data"]
df = pd.DataFrame(data)
df.to_csv("nodes.csv", index=False)

输出:

如何使用Python3从网站提取表格

如果您只需要主机IP,可以添加以下内容:

hosts = df["host"].values
with open("hosts.txt", "w") as f:
    f.write("\n".join(hosts))

然后您可以执行:

$ cat hosts.txt
英文:

Here's how you can get the data and dump it to a .csv file.

import time

import pandas as pd
import requests

url = "https://www.ethernodes.org/data?"

payload = {
    "draw": "2",
    "columns[0][data]": "id",
    "columns[0][name]": "",
    "columns[0][searchable]": "true",
    "columns[0][orderable]": "true",
    "columns[0][search][value]": "",
    "columns[0][search][regex]": "false",
    "columns[1][data]": "host",
    "columns[1][name]": "",
    "columns[1][searchable]": "true",
    "columns[1][orderable]": "true",
    "columns[1][search][value]": "",
    "columns[1][search][regex]": "false",
    "columns[2][data]": "isp",
    "columns[2][name]": "",
    "columns[2][searchable]": "true",
    "columns[2][orderable]": "true",
    "columns[2][search][value]": "",
    "columns[2][search][regex]": "false",
    "columns[3][data]": "country",
    "columns[3][name]": "",
    "columns[3][searchable]": "true",
    "columns[3][orderable]": "true",
    "columns[3][search][value]": "",
    "columns[3][search][regex]": "false",
    "columns[4][data]": "client",
    "columns[4][name]": "",
    "columns[4][searchable]": "true",
    "columns[4][orderable]": "true",
    "columns[4][search][value]": "",
    "columns[4][search][regex]": "false",
    "columns[5][data]": "clientVersion",
    "columns[5][name]": "",
    "columns[5][searchable]": "true",
    "columns[5][orderable]": "true",
    "columns[5][search][value]": "",
    "columns[5][search][regex]": "false",
    "columns[6][data]": "os",
    "columns[6][name]": "",
    "columns[6][searchable]": "true",
    "columns[6][orderable]": "true",
    "columns[6][search][value]": "",
    "columns[6][search][regex]": "false",
    "columns[7][data]": "lastUpdate",
    "columns[7][name]": "",
    "columns[7][searchable]": "true",
    "columns[7][orderable]": "true",
    "columns[7][search][value]": "",
    "columns[7][search][regex]": "false",
    "columns[8][data]": "inSync",
    "columns[8][name]": "",
    "columns[8][searchable]": "true",
    "columns[8][orderable]": "true",
    "columns[8][search][value]": "",
    "columns[8][search][regex]": "false",
    "order[0][column]": "0",
    "order[0][dir]": "asc",
    "start": "0",
    "length": "100",
    "search[value]": "",
    "search[regex]": "false",
    "_": time.time()
}

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48",
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "X-Requested-With": "XMLHttpRequest",
}

data = requests.get(url, headers=headers, params=payload).json()["data"]
df = pd.DataFrame(data)
df.to_csv("nodes.csv", index=False)

Output:

如何使用Python3从网站提取表格

And if all you need is the hosts IPs add this:

hosts = df["host"].values
with open("hosts.txt", "w") as f:
    f.write("\n".join(hosts))

Then you can

$ cat hosts.txt

huangapple
  • 本文由 发表于 2023年4月19日 16:59:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/76052595.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定