网站抓取对医生的工作没有正常运行。

huangapple go评论64阅读模式
英文:

Website scraping not working properly for Medics

问题

我开发了这个脚本来收集外科医生的信息(公开来源),并将其放入电子表格中。

该脚本自动从网站中提取整形外科医生的信息。
循环遍历各州列表。
选择网站表单上的一个州并执行搜索。
它收集每页上找到的外科医生的信息,然后转到下一页(如果有的话)。
将数据存储在Excel文件中。
调整Excel文件中的列宽以提高可读性。
为所有州重复此过程并结束脚本。
问题是它出现了一个小错误,我找不到错误在哪里。

它首先扫描并收集来自第一个站点的数据(它没有前进按钮)。到目前为止还不错。然后它正常切换到下一个州并继续收集数据。然后问题就来了,当它完成收集下一个州第一页的数据时,它会前进到下一页,但不会收集下一页的数据。而且,它卡在这里并带有以下错误:

info_content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'surgeon-info'))).text

我需要它继续收集所有页面的数据。我已经尝试了5个小时找错误在哪里。

如果有人能帮忙我会非常感激。以下是脚本:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import time
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter

driver = webdriver.Chrome()
url = "http://www2.cirurgiaplastica.org.br/encontre-um-cirurgiao/#busca-cirurgiao"
driver.get(url)
time.sleep(2)

estados = ["AC", "AL", "AP", "AM", "BA", "CE", "DF", "ES", "GO", "MA", "MT", "MS", "MG", "PA", "PB", "PR", "PE", "PI", "RJ", "RN", "RS", "RO", "RR", "SC", "SP", "SE", "TO"]

try:
    data = pd.read_excel("cirurgioes.xlsx")
except FileNotFoundError:
    data = pd.DataFrame(columns=["Nome", "Email", "Cidade/Estado", "CRM", "Telefone", "Endereço"])

for estado in estados:

    select_element = driver.find_element(By.NAME, 'cirurgiao_uf')
    select_element.click()
    select_state = driver.find_element(By.XPATH, f'//option[@value="{estado}"]')
    select_state.click()

    search_button = driver.find_element(By.ID, 'cirurgiao_submit')
    search_button.click()

    time.sleep(2)

    count = 0
    while True:
        num_links = len(driver.find_elements(By.XPATH, '//a[@href="#0"]'))

        for i in range(num_links):
            links = driver.find_elements(By.XPATH, '//a[@href="#0"]')
            links[i].click()
            time.sleep(2)

            name = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h3'))).text
            info_content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'cirurgiao-info'))).text

            info_lines = info_content.split('\n')

            email = info_lines[3]
            state = info_lines[1]
            crm = ""
            phone = ""
            address = ""

            for line in info_lines[4:]:
                if re.match(r'^\d+\.?\d*\/[A-Z]{2}$', line):
                    crm = line
                elif line.startswith(("Rua ", "Avenida ", "RUA ", "AVENIDA ")):
                    address = line
                elif line.startswith("Comercial: (55) "):
                    phone = line.replace("Comercial: (55) ", "")

            data = data.append({"Nome": name, "Email": email, "Cidade/Estado": state, "CRM": crm, "Telefone": phone, "Endereço": address}, ignore_index=True)
            count += 1

            if count % 10 == 0:
                data.to_excel("cirurgioes.xlsx", index=False)

            body = driver.find_element(By.TAG_NAME, 'body')
            body.send_keys(Keys.ESCAPE)
            time.sleep(2)

        next_buttons = driver.find_elements(By.CSS_SELECTOR, 'a.cirurgiao-pagination-link')
        if len(next_buttons) == 0:
            break
        next_button = next_buttons[-1]
        next_button.click()
        time.sleep(4)

    print(f"Estado {estado} salvo com sucesso.")
    data.to_excel("cirurgioes.xlsx", index=False)

driver.quit()
英文:

I developed this script here to collect information from surgeons (public source) and put it in a spreadsheet.

The script automates the extraction of plastic surgeon information from a website.
Cycles through a list of states.
Select a state on the website form and perform a search.
It collects the information of the surgeons found on each page and goes to the next page (if any).
Store data in an Excel file.
Adjusts column width in Excel file for better readability.
Repeat the process for all states and end the script.
The problem is that it is giving a small error and I am not finding where.

It starts by scanning and collecting data from the first site (it doesn't have a forward button). So far so good. Then it switches to the next state normally and continues collecting data. Then the problem comes here, when it finishes collecting the data from this first page of the next state, it advances to the next page but does not collect the data from the next one. And besides, it hangs in it and returns to the IDE screen with this error:

info_content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'surgeon-info'))).text

I need it to keep collecting data from all pages. I've been trying to find where the error is for 5 hours now.

If anyone can help me I'd be very grateful. Here's the script:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import time
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
driver = webdriver.Chrome()
url = "http://www2.cirurgiaplastica.org.br/encontre-um-cirurgiao/#busca-cirurgiao"
driver.get(url)
time.sleep(2)
estados = ["AC", "AL", "AP", "AM", "BA", "CE", "DF", "ES", "GO", "MA", "MT", "MS", "MG", "PA", "PB", "PR", "PE", "PI", "RJ", "RN", "RS", "RO", "RR", "SC", "SP", "SE", "TO"]
try:
data = pd.read_excel("cirurgioes.xlsx")
except FileNotFoundError:
data = pd.DataFrame(columns=["Nome", "Email", "Cidade/Estado", "CRM", "Telefone", "Endereço"])
for estado in estados:
select_element = driver.find_element(By.NAME, 'cirurgiao_uf')
select_element.click()
select_state = driver.find_element(By.XPATH, f'//option[@value="{estado}"]')
select_state.click()
search_button = driver.find_element(By.ID, 'cirurgiao_submit')
search_button.click()
time.sleep(2)
count = 0
while True:
num_links = len(driver.find_elements(By.XPATH, '//a[@href="#0"]'))
for i in range(num_links):
links = driver.find_elements(By.XPATH, '//a[@href="#0"]')
links[i].click()
time.sleep(2)
name = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h3'))).text
info_content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'cirurgiao-info'))).text
info_lines = info_content.split('\n')
email = info_lines[3]
state = info_lines[1]
crm = ""
phone = ""
address = ""
for line in info_lines[4:]:
if re.match(r'^\d+\.?\d*\/[A-Z]{2}$', line):
crm = line
elif line.startswith(("Rua ", "Avenida ", "RUA ", "AVENIDA ")):
address = line
elif line.startswith("Comercial: (55) "):
phone = line.replace("Comercial: (55) ", "")
data = data.append({"Nome": name, "Email": email, "Cidade/Estado": state, "CRM": crm, "Telefone": phone, "Endereço": address}, ignore_index=True)
count += 1
if count % 10 == 0:
data.to_excel("cirurgioes.xlsx", index=False)
body = driver.find_element(By.TAG_NAME, 'body')
body.send_keys(Keys.ESCAPE)
time.sleep(2)
next_buttons = driver.find_elements(By.CSS_SELECTOR, 'a.cirurgiao-pagination-link')
if len(next_buttons) == 0:
break
next_button = next_buttons[-1]
next_button.click()
time.sleep(4)
print(f"Estado {estado} salvo com sucesso.")
data.to_excel("cirurgioes.xlsx", index=False)
driver.quit()

答案1

得分: 0

我建议使用他们的Ajax API来获取信息。似乎一个Ajax调用可以加载来自该州的所有医生:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup

home_url = 'http://www2.cirurgiaplastica.org.br/encontre-um-cirurgiao/#busca-cirurgiao'
api_url = 'https://icase.sbcp.itarget.com.br/api/localiza-profissional/?format=json&nome=&categoria_profissional_id=&uf_descricao={state}&pag=1'

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}

soup = BeautifulSoup(requests.get(home_url).content, 'html.parser')
states = [o['value'] for o in soup.select('#cirurgiao_uf option')[1:]]

all_data = []
for s in states:
    print(f'State {s}...')
    all_data.extend(requests.get(api_url.format(state=s), headers=headers).json()['data'])

df = pd.DataFrame(all_data)

m = df['telefone'].isna()
df['telefone'] = df.loc[~m, 'telefone'].apply(json.loads)
df = df.explode('telefone')
df['telefone'] = df['telefone'].str['fone']

print(df.head())

这会打印带有数据的df。

英文:

I suggest to use their Ajax API to get the information. It seems that one Ajax call loads all the doctors from the state:

import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
home_url = 'http://www2.cirurgiaplastica.org.br/encontre-um-cirurgiao/#busca-cirurgiao'
api_url = 'https://icase.sbcp.itarget.com.br/api/localiza-profissional/?format=json&nome=&categoria_profissional_id=&uf_descricao={state}&pag=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
soup = BeautifulSoup(requests.get(home_url).content, 'html.parser')
states = [o['value'] for o in soup.select('#cirurgiao_uf option')[1:]]
all_data = []
for s in states:
print(f'State {s}...')
all_data.extend(requests.get(api_url.format(state=s), headers=headers).json()['data'])
df = pd.DataFrame(all_data)
m = df['telefone'].isna()
df['telefone'] = df.loc[~m, 'telefone'].apply(json.loads)
df = df.explode('telefone')
df['telefone'] = df['telefone'].str['fone']
print(df.head())

This prints the df with the data.

huangapple
  • 本文由 发表于 2023年6月9日 04:26:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76435472.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定