英文:
Python selenium not taking tables , please review
问题
以下是代码的翻译部分:
# 导入必要的库
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
# 指定网页链接
url = 'https://www.zaubacorp.com/company-list'
# 配置Selenium选项
options = Options()
options.add_argument('--headless')
# 创建Chrome驱动程序的新实例
driver = webdriver.Chrome(options=options)
# 跳转到网页
driver.get(url)
# 等待页面加载
driver.implicitly_wait(10)
# 使用'tag_name'定位策略找到页面上的所有表格元素
tables = driver.find_elements('tag name', 'table')
# 遍历表格以找到需要的表格
table = None
for t in tables:
if 'list-group-item' in t.get_attribute('class'):
table = t
break
if table:
# 提取表格数据
data = []
for row in table.find_elements('tag name', 'tr'):
rowData = []
for cell in row.find_elements('tag name', 'td'):
rowData.append(cell.text)
data.append(rowData)
# 将表格数据存储在DataFrame中
results = pd.DataFrame(data)
# 打印结果
print(results)
else:
print('未找到表格.')
# 关闭Selenium驱动程序
driver.quit()
以上是您提供的代码的中文翻译。
英文:
so below the main code i have written, website is https://www.zaubacorp.com/company-list
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
url = 'https://www.zaubacorp.com/company-list'
# Set up Selenium options
options = Options()
options.add_argument('--headless')
# Create a new instance of the Chrome driver
driver = webdriver.Chrome(options=options)
# Navigate to the webpage
driver.get(url)
# Wait for the page to load
driver.implicitly_wait(10)
# Find all table elements on the page using the 'tag_name' locator strategy
tables = driver.find_elements('tag name', 'table')
# Iterate through the tables to find the one you need
table = None
for t in tables:
if 'list-group-item' in t.get_attribute('class'):
table = t
break
if table:
# Extract the table data
data = []
for row in table.find_elements('tag name', 'tr'):
rowData = []
for cell in row.find_elements('tag name', 'td'):
rowData.append(cell.text)
data.append(rowData)
# Store the table data in a DataFrame
results = pd.DataFrame(data)
# Print the results
print(results)
else:
print('Table not found.')
# Close the Selenium driver
driver.quit()
So the above code is not working to get details of the table , i am not even looping it to get details of others pages yet, please check and let me know where i am wrong?
答案1
得分: 1
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
url = 'https://www.zaubacorp.com/company-list/p-1-company.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
last_page = soup.find_all('a', text=lambda text: text and '>>' in text)[0]['href']
match = int(re.search(r'p-(\d+)', last_page).group(1))
dfs = []
tot = match
for page in range(1, match+1):
url = f'https://www.zaubacorp.com/company-list/p-{page}-company.html'
print(f'Page: {page} of {tot}')
dfs.append(pd.read_html(url)[0])
df = pd.concat(dfs)
英文:
Any reason you're using selenium? You can just have pandas parse the tables. Will take a while to go through all the pages though.
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
url = 'https://www.zaubacorp.com/company-list/p-1-company.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
last_page = soup.find_all('a', text=lambda text: text and '>>' in text)[0]['href']
match = int(re.search(r'p-(\d+)',last_page).group(1))
dfs = []
tot = match
for page in range(1, match+1):
url = f'https://www.zaubacorp.com/company-list/p-{page}-company.html'
print(f'Page: {page} of {tot}')
dfs.append(pd.read_html(url)[0])
df = pd.concat(dfs)
答案2
得分: 0
你犯了一个小错误。Find_Elements不接受两个字符串,而是一个By选项和一个字符串:
tables = driver.find_elements(By.TAG_NAME, 'table')
英文:
You made a small mistake. Find_Elements does not take 2 strings, but a By option and a string:
tables = driver.find_elements(By.TAG_NAME, 'table')
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论