英文:
Website scraping not working properly for Medics
问题
我开发了这个脚本来收集外科医生的信息(公开来源),并将其放入电子表格中。
该脚本自动从网站中提取整形外科医生的信息。
循环遍历各州列表。
选择网站表单上的一个州并执行搜索。
它收集每页上找到的外科医生的信息,然后转到下一页(如果有的话)。
将数据存储在Excel文件中。
调整Excel文件中的列宽以提高可读性。
为所有州重复此过程并结束脚本。
问题是它出现了一个小错误,我找不到错误在哪里。
它首先扫描并收集来自第一个站点的数据(它没有前进按钮)。到目前为止还不错。然后它正常切换到下一个州并继续收集数据。然后问题就来了,当它完成收集下一个州第一页的数据时,它会前进到下一页,但不会收集下一页的数据。而且,它卡在这里并带有以下错误:
info_content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'surgeon-info'))).text
我需要它继续收集所有页面的数据。我已经尝试了5个小时找错误在哪里。
如果有人能帮忙我会非常感激。以下是脚本:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import time
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
driver = webdriver.Chrome()
url = "http://www2.cirurgiaplastica.org.br/encontre-um-cirurgiao/#busca-cirurgiao"
driver.get(url)
time.sleep(2)
estados = ["AC", "AL", "AP", "AM", "BA", "CE", "DF", "ES", "GO", "MA", "MT", "MS", "MG", "PA", "PB", "PR", "PE", "PI", "RJ", "RN", "RS", "RO", "RR", "SC", "SP", "SE", "TO"]
try:
data = pd.read_excel("cirurgioes.xlsx")
except FileNotFoundError:
data = pd.DataFrame(columns=["Nome", "Email", "Cidade/Estado", "CRM", "Telefone", "Endereço"])
for estado in estados:
select_element = driver.find_element(By.NAME, 'cirurgiao_uf')
select_element.click()
select_state = driver.find_element(By.XPATH, f'//option[@value="{estado}"]')
select_state.click()
search_button = driver.find_element(By.ID, 'cirurgiao_submit')
search_button.click()
time.sleep(2)
count = 0
while True:
num_links = len(driver.find_elements(By.XPATH, '//a[@href="#0"]'))
for i in range(num_links):
links = driver.find_elements(By.XPATH, '//a[@href="#0"]')
links[i].click()
time.sleep(2)
name = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h3'))).text
info_content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'cirurgiao-info'))).text
info_lines = info_content.split('\n')
email = info_lines[3]
state = info_lines[1]
crm = ""
phone = ""
address = ""
for line in info_lines[4:]:
if re.match(r'^\d+\.?\d*\/[A-Z]{2}$', line):
crm = line
elif line.startswith(("Rua ", "Avenida ", "RUA ", "AVENIDA ")):
address = line
elif line.startswith("Comercial: (55) "):
phone = line.replace("Comercial: (55) ", "")
data = data.append({"Nome": name, "Email": email, "Cidade/Estado": state, "CRM": crm, "Telefone": phone, "Endereço": address}, ignore_index=True)
count += 1
if count % 10 == 0:
data.to_excel("cirurgioes.xlsx", index=False)
body = driver.find_element(By.TAG_NAME, 'body')
body.send_keys(Keys.ESCAPE)
time.sleep(2)
next_buttons = driver.find_elements(By.CSS_SELECTOR, 'a.cirurgiao-pagination-link')
if len(next_buttons) == 0:
break
next_button = next_buttons[-1]
next_button.click()
time.sleep(4)
print(f"Estado {estado} salvo com sucesso.")
data.to_excel("cirurgioes.xlsx", index=False)
driver.quit()
英文:
I developed this script here to collect information from surgeons (public source) and put it in a spreadsheet.
The script automates the extraction of plastic surgeon information from a website.
Cycles through a list of states.
Select a state on the website form and perform a search.
It collects the information of the surgeons found on each page and goes to the next page (if any).
Store data in an Excel file.
Adjusts column width in Excel file for better readability.
Repeat the process for all states and end the script.
The problem is that it is giving a small error and I am not finding where.
It starts by scanning and collecting data from the first site (it doesn't have a forward button). So far so good. Then it switches to the next state normally and continues collecting data. Then the problem comes here, when it finishes collecting the data from this first page of the next state, it advances to the next page but does not collect the data from the next one. And besides, it hangs in it and returns to the IDE screen with this error:
info_content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'surgeon-info'))).text
I need it to keep collecting data from all pages. I've been trying to find where the error is for 5 hours now.
If anyone can help me I'd be very grateful. Here's the script:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import time
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
driver = webdriver.Chrome()
url = "http://www2.cirurgiaplastica.org.br/encontre-um-cirurgiao/#busca-cirurgiao"
driver.get(url)
time.sleep(2)
estados = ["AC", "AL", "AP", "AM", "BA", "CE", "DF", "ES", "GO", "MA", "MT", "MS", "MG", "PA", "PB", "PR", "PE", "PI", "RJ", "RN", "RS", "RO", "RR", "SC", "SP", "SE", "TO"]
try:
data = pd.read_excel("cirurgioes.xlsx")
except FileNotFoundError:
data = pd.DataFrame(columns=["Nome", "Email", "Cidade/Estado", "CRM", "Telefone", "Endereço"])
for estado in estados:
select_element = driver.find_element(By.NAME, 'cirurgiao_uf')
select_element.click()
select_state = driver.find_element(By.XPATH, f'//option[@value="{estado}"]')
select_state.click()
search_button = driver.find_element(By.ID, 'cirurgiao_submit')
search_button.click()
time.sleep(2)
count = 0
while True:
num_links = len(driver.find_elements(By.XPATH, '//a[@href="#0"]'))
for i in range(num_links):
links = driver.find_elements(By.XPATH, '//a[@href="#0"]')
links[i].click()
time.sleep(2)
name = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h3'))).text
info_content = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'cirurgiao-info'))).text
info_lines = info_content.split('\n')
email = info_lines[3]
state = info_lines[1]
crm = ""
phone = ""
address = ""
for line in info_lines[4:]:
if re.match(r'^\d+\.?\d*\/[A-Z]{2}$', line):
crm = line
elif line.startswith(("Rua ", "Avenida ", "RUA ", "AVENIDA ")):
address = line
elif line.startswith("Comercial: (55) "):
phone = line.replace("Comercial: (55) ", "")
data = data.append({"Nome": name, "Email": email, "Cidade/Estado": state, "CRM": crm, "Telefone": phone, "Endereço": address}, ignore_index=True)
count += 1
if count % 10 == 0:
data.to_excel("cirurgioes.xlsx", index=False)
body = driver.find_element(By.TAG_NAME, 'body')
body.send_keys(Keys.ESCAPE)
time.sleep(2)
next_buttons = driver.find_elements(By.CSS_SELECTOR, 'a.cirurgiao-pagination-link')
if len(next_buttons) == 0:
break
next_button = next_buttons[-1]
next_button.click()
time.sleep(4)
print(f"Estado {estado} salvo com sucesso.")
data.to_excel("cirurgioes.xlsx", index=False)
driver.quit()
答案1
得分: 0
我建议使用他们的Ajax API来获取信息。似乎一个Ajax调用可以加载来自该州的所有医生:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
home_url = 'http://www2.cirurgiaplastica.org.br/encontre-um-cirurgiao/#busca-cirurgiao'
api_url = 'https://icase.sbcp.itarget.com.br/api/localiza-profissional/?format=json&nome=&categoria_profissional_id=&uf_descricao={state}&pag=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
soup = BeautifulSoup(requests.get(home_url).content, 'html.parser')
states = [o['value'] for o in soup.select('#cirurgiao_uf option')[1:]]
all_data = []
for s in states:
print(f'State {s}...')
all_data.extend(requests.get(api_url.format(state=s), headers=headers).json()['data'])
df = pd.DataFrame(all_data)
m = df['telefone'].isna()
df['telefone'] = df.loc[~m, 'telefone'].apply(json.loads)
df = df.explode('telefone')
df['telefone'] = df['telefone'].str['fone']
print(df.head())
这会打印带有数据的df。
英文:
I suggest to use their Ajax API to get the information. It seems that one Ajax call loads all the doctors from the state:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
home_url = 'http://www2.cirurgiaplastica.org.br/encontre-um-cirurgiao/#busca-cirurgiao'
api_url = 'https://icase.sbcp.itarget.com.br/api/localiza-profissional/?format=json&nome=&categoria_profissional_id=&uf_descricao={state}&pag=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'}
soup = BeautifulSoup(requests.get(home_url).content, 'html.parser')
states = [o['value'] for o in soup.select('#cirurgiao_uf option')[1:]]
all_data = []
for s in states:
print(f'State {s}...')
all_data.extend(requests.get(api_url.format(state=s), headers=headers).json()['data'])
df = pd.DataFrame(all_data)
m = df['telefone'].isna()
df['telefone'] = df.loc[~m, 'telefone'].apply(json.loads)
df = df.explode('telefone')
df['telefone'] = df['telefone'].str['fone']
print(df.head())
This prints the df with the data.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论