英文:
Unwanted result web scrapping
问题
I want to scrap data from the page which get opens by clicking these quest no-8526724 of the table but the click is not working quest no is getting printed
enter code here
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
Open the website in the browser
driver.get('https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787')
Wait for the table to load
table = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, 'table_id')))
Find all the quest numbers in the table
quest_numbers = table.find_elements(By.XPATH, '//table//tbody//tr//td[2]')
Click on each quest number and print the resulting page
for quest_number in quest_numbers:
quest_number_text = quest_number.text
print(quest_number_text)
current_url = driver.current_url
driver.execute_script("arguments[0].click();", quest_number)
WebDriverWait(driver,40).
until(EC.presence_of_element_located((By.CLASS_NAME, 'posting-second-header')))
page_source = driver.page_source
print('Quest Number:', quest_number_text)
print('Page Source:', page_source)
time.sleep(2)
# Go back to the table
driver.back()
Close the browser
driver.quit()
英文:
I want to scrap data from the page which get opens by clicking these quest no-8526724 of the table but the click is not working quest no is getting printed
enter code here
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
Open the website in the browser
driver.get('https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787')
Wait for the table to load
table = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, 'table_id')))
Find all the quest numbers in the table
quest_numbers = table.find_elements(By.XPATH, '//table//tbody//tr//td[2]')
Click on each quest number and print the resulting page
for quest_number in quest_numbers:
quest_number_text = quest_number.text
print(quest_number_text)
current_url = driver.current_url
driver.execute_script("arguments[0].click();", quest_number)
WebDriverWait(driver,40).
until(EC.presence_of_element_located((By.CLASS_NAME, 'posting-second-
header')))
page_source = driver.page_source
print('Quest Number:', quest_number_text)
print('Page Source:', page_source)
time.sleep(2)
# Go back to the table
driver.back()
Close the browser
driver.quit()
答案1
得分: 1
以下是您提供的代码的翻译:
表格中的数据是动态的,并且从另一个API端点获取。以下是直接从该端点抓取数据的一种方法:
import pandas as pd
import requests
headers = {
'Referer': 'https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
r = s.get('https://qcpi.questcdn.com/cdn/browse_posting/?search_id=&postings_since_last_login=&draw=1&columns[0][data]=render_my_posting&columns[0][name]=&columns[0][searchable]=false&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=render_post_date&columns[1][name]=&columns[1][searchable]=false&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=render_project_id&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=render_category_search_string&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=render_name&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=bid_date_str&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=render_city&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=render_county&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&columns[8][data]=state_code&columns[8][name]=&columns[8][searchable]=true&columns[8][orderable]=true&columns[8][search][value]=&columns[8][search][regex]=false&columns[9][data]=render_owner&columns[9][name]=&columns[9][searchable]=true&columns[9][orderable]=true&columns[9][search][value]=&columns[9][search][regex]=false&columns[10][data]=render_solicitor&columns[10][name]=&columns[10][searchable]=true&columns[10][orderable]=true&columns[10][search][value]=&columns[10][search][regex]=false&columns[11][data]=posting_type&columns[11][name]=&columns[11][searchable]=true&columns[11][orderable]=true&columns[11][search][value]=&columns[11][search][regex]=false&columns[12][data]=render_empty&columns[12][name]=&columns[12][searchable]=true&columns[12][orderable]=true&columns[12][search][value]=&columns[12][search][regex]=false&columns[13][data]=render_empty&columns[13][name]=&columns[13][searchable]=true&columns[13][orderable]=true&columns[13][search][value]=&columns[13][search][regex]=false&columns[14][data]=render_empty&columns[14][name]=&columns[14][searchable]=true&columns[14][orderable]=true&columns[14][search][value]=&columns[14][search][regex]=false&columns[15][data]=render_empty&columns[15][name]=&columns[15][searchable]=true&columns[15][orderable]=true&columns[15][search][value]=&columns[15][search][regex]=false&columns[16][data]=project_id&columns[16][name]=&columns[16][searchable]=true&columns[16][orderable]=true&columns[16][search][value]=&columns[16][search][regex]=false&start=0&length=25&search[value]=&search[regex]=false&_=1685987743241')
df = pd.json_normalize(r.json()['data'])
for col in df.columns:
df[col] = df[col].replace(r'<[^<>]*>', '', regex=True)
print(df)
终端中的结果:
DT_RowId render_my_posting render_post_date render_project_id render_category_search_string render_name bid_date_str render_city render_county state_code render_owner render_solicitor posting_type render_empty project_id
0 0 05/12/2023 8526724 Street/Roadway Reconstruction Key No. 22408 3000 E & FOOTHILL RD CURVE, TWI... 06/06/2023 02:00 PM MDT N/A Twin Falls ID Idaho Transportation... Idaho Transportati... Construction Project 8526724
1 1 05/16/2023 8529878 Traffic Control Devices (Signa... Key No. 24192 SH-75, Ohio Gulch Road Intersec... 06
<details>
<summary>英文:</summary>
Data in that table is dynamic, and is hydrated from another api endpoint. Here is one way of obtaining that data, by scraping that endpoint directly:
import pandas as pd
import requests
headers = {
'Referer': 'https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
r = s.get('https://qcpi.questcdn.com/cdn/browse_posting/?search_id=&postings_since_last_login=&draw=1&columns[0][data]=render_my_posting&columns[0][name]=&columns[0][searchable]=false&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=render_post_date&columns[1][name]=&columns[1][searchable]=false&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=render_project_id&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=render_category_search_string&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=render_name&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=bid_date_str&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=render_city&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=render_county&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&columns[8][data]=state_code&columns[8][name]=&columns[8][searchable]=true&columns[8][orderable]=true&columns[8][search][value]=&columns[8][search][regex]=false&columns[9][data]=render_owner&columns[9][name]=&columns[9][searchable]=true&columns[9][orderable]=true&columns[9][search][value]=&columns[9][search][regex]=false&columns[10][data]=render_solicitor&columns[10][name]=&columns[10][searchable]=true&columns[10][orderable]=true&columns[10][search][value]=&columns[10][search][regex]=false&columns[11][data]=posting_type&columns[11][name]=&columns[11][searchable]=true&columns[11][orderable]=true&columns[11][search][value]=&columns[11][search][regex]=false&columns[12][data]=render_empty&columns[12][name]=&columns[12][searchable]=true&columns[12][orderable]=true&columns[12][search][value]=&columns[12][search][regex]=false&columns[13][data]=render_empty&columns[13][name]=&columns[13][searchable]=true&columns[13][orderable]=true&columns[13][search][value]=&columns[13][search][regex]=false&columns[14][data]=render_empty&columns[14][name]=&columns[14][searchable]=true&columns[14][orderable]=true&columns[14][search][value]=&columns[14][search][regex]=false&columns[15][data]=render_empty&columns[15][name]=&columns[15][searchable]=true&columns[15][orderable]=true&columns[15][search][value]=&columns[15][search][regex]=false&columns[16][data]=project_id&columns[16][name]=&columns[16][searchable]=true&columns[16][orderable]=true&columns[16][search][value]=&columns[16][search][regex]=false&start=0&length=25&search[value]=&search[regex]=false&_=1685987743241')
df = pd.json_normalize(r.json()['data'])
for col in df.columns:
df[col] = df[col].replace(r'<[^<>]*>', '', regex=True)
print(df)
Result in terminal:
DT_RowId render_my_posting render_post_date render_project_id render_category_search_string render_name bid_date_str render_city render_county state_code render_owner render_solicitor posting_type render_empty project_id
0 0 05/12/2023 8526724 Street/Roadway Reconstruction Key No. 22408 3000 E & FOOTHILL RD CURVE, TWI... 06/06/2023 02:00 PM MDT N/A Twin Falls ID Idaho Transportation... Idaho Transportati... Construction Project 8526724
1 1 05/16/2023 8529878 Traffic Control Devices (Signa... Key No. 24192 SH-75, Ohio Gulch Road Intersec... 06/06/2023 02:00 PM MDT N/A Blaine ID Idaho Transportation... Idaho Transportati... Construction Project 8529878
2 2 05/18/2023 8534098 Roadway Pavement Markings Key No. 23815 FY24 D6 STRIPING 06/06/2023 02:00 PM MDT N/A Bonneville, Fremont,... ID Idaho Transportation... Idaho Transportati... Construction Project 8534098
3 3 05/19/2023 8536176 Roadway Pavement Markings Key No. 21842, I-84, FY23 D4 Interstate Striping 06/06/2023 02:00 PM MDT N/A Various ID Idaho Transportation... Idaho Transportati... Construction Project 8536176
4 4 05/22/2023 8539402 Seal Coating Key No. 20592 / 20482 SH-3, CDA RV BR to I-90... 06/06/2023 02:00 PM MDT N/A Kootenai ID Idaho Transportation... Idaho Transportati... Construction Project 8539402
5 5 05/22/2023 8539418 Bridges/Overpasses Key No. 23474 US-20; EXIT 343 INTERCHANGE 06/13/2023 02:00 PM MDT N/A Fremont ID Idaho Transportation... Idaho Transportati... Construction Project 8539418
6 6 05/25/2023 8544737 Pavement - Marking Key No. 23791, FY24 D1 STRIPING 06/13/2023 02:00 PM MDT N/A Kootenai and Shoshon... ID Idaho Transportation... Idaho Transportati... Construction Project 8544737
7 7 05/25/2023 8544742 Bridge (Replacement or Rehabil... Key No. 20487 FY24 D1 BRIDGE REPAIR 06/13/2023 02:00 PM MDT N/A Kootenai and Shoshon... ID Idaho Transportation... Idaho Transportati... Construction Project 8544742
8 8 06/05/2023 8554006 Street/Roadway Reconstruction Key No. 24249 SH-11 PIERCE TO GRANGEMONT ROAD... 06/27/2023 02:00 PM MDT N/A Clearwater ID Idaho Transportation... Idaho Transportati... Construction Project 8554006
</details>
# 答案2
**得分**: 0
以下是您要翻译的内容:
首先我注意到的是:更改您的引号。我运行了您的脚本,它正常工作。
这一行:
```python
# URL of the website to scrape
url = ´https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787´
应更改为:
# URL of the website to scrape
url = "https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787"
我得到的输出是:
[<tr>
<th class="""">Saved Search</th>
<th class="sorting" id="id_searchptbl_th1" onclick="sortTable(1)">Name</th>
<th class="sorting" id="id_searchptbl_th2" onclick="sortTable(2)">Search Criteria</th>
<th class="sorting" id="id_searchptbl_th3" onclick="sortTable(3)">Days Notified</th>
<th class="sorting" id="id_searchptbl_th4" onclick="sortTable(4)">Default</th>
<th class="""">Delete</th>
</tr>]
英文:
First thing I noticed: change your quotes. I ran you script and it works.
The line:
# URL of the website to scrape
url = ´https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787´
Must be:
# URL of the website to scrape
url = "https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787"
The output I got:
[<tr>
<th class="">Saved Search</th>
<th class="sorting" id="id_searchptbl_th1" onclick="sortTable(1)">Name</th>
<th class="sorting" id="id_searchptbl_th2" onclick="sortTable(2)">Search Criteria</th>
<th class="sorting" id="id_searchptbl_th3" onclick="sortTable(3)">Days Notified</th>
<th class="sorting" id="id_searchptbl_th4" onclick="sortTable(4)">Default</th>
<th class="">Delete</th>
</tr>]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论