英文:
Unwanted result web scrapping
问题
I want to scrap data from the page which get opens by clicking these quest no-8526724 of the table but the click is not working quest no is getting printed
enter code here
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
Open the website in the browser
driver.get('https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787')
Wait for the table to load
table = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, 'table_id')))
Find all the quest numbers in the table
quest_numbers = table.find_elements(By.XPATH, '//table//tbody//tr//td[2]')
Click on each quest number and print the resulting page
for quest_number in quest_numbers:
quest_number_text = quest_number.text
print(quest_number_text)
current_url = driver.current_url
driver.execute_script("arguments[0].click();", quest_number)
WebDriverWait(driver,40).
until(EC.presence_of_element_located((By.CLASS_NAME, 'posting-second-header'))) 
page_source = driver.page_source
print('Quest Number:', quest_number_text)
print('Page Source:', page_source)
time.sleep(2) 
# Go back to the table
driver.back()
Close the browser
driver.quit()
英文:
I want to scrap data from the page which get opens by clicking these quest no-8526724 of the table but the click is not working quest no is getting printed
enter code here
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
Open the website in the browser
driver.get('https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787')
Wait for the table to load
table = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, 'table_id')))
Find all the quest numbers in the table
quest_numbers = table.find_elements(By.XPATH, '//table//tbody//tr//td[2]')
Click on each quest number and print the resulting page
for quest_number in quest_numbers:
quest_number_text = quest_number.text
print(quest_number_text)
current_url = driver.current_url
driver.execute_script("arguments[0].click();", quest_number)
WebDriverWait(driver,40).
until(EC.presence_of_element_located((By.CLASS_NAME, 'posting-second- 
header'))) 
page_source = driver.page_source
print('Quest Number:', quest_number_text)
print('Page Source:', page_source)
time.sleep(2) 
# Go back to the table
driver.back()
Close the browser
driver.quit()
答案1
得分: 1
以下是您提供的代码的翻译:
表格中的数据是动态的,并且从另一个API端点获取。以下是直接从该端点抓取数据的一种方法:
import pandas as pd
import requests
headers = {
    'Referer': 'https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
r = s.get('https://qcpi.questcdn.com/cdn/browse_posting/?search_id=&postings_since_last_login=&draw=1&columns[0][data]=render_my_posting&columns[0][name]=&columns[0][searchable]=false&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=render_post_date&columns[1][name]=&columns[1][searchable]=false&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=render_project_id&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=render_category_search_string&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=render_name&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=bid_date_str&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=render_city&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=render_county&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&columns[8][data]=state_code&columns[8][name]=&columns[8][searchable]=true&columns[8][orderable]=true&columns[8][search][value]=&columns[8][search][regex]=false&columns[9][data]=render_owner&columns[9][name]=&columns[9][searchable]=true&columns[9][orderable]=true&columns[9][search][value]=&columns[9][search][regex]=false&columns[10][data]=render_solicitor&columns[10][name]=&columns[10][searchable]=true&columns[10][orderable]=true&columns[10][search][value]=&columns[10][search][regex]=false&columns[11][data]=posting_type&columns[11][name]=&columns[11][searchable]=true&columns[11][orderable]=true&columns[11][search][value]=&columns[11][search][regex]=false&columns[12][data]=render_empty&columns[12][name]=&columns[12][searchable]=true&columns[12][orderable]=true&columns[12][search][value]=&columns[12][search][regex]=false&columns[13][data]=render_empty&columns[13][name]=&columns[13][searchable]=true&columns[13][orderable]=true&columns[13][search][value]=&columns[13][search][regex]=false&columns[14][data]=render_empty&columns[14][name]=&columns[14][searchable]=true&columns[14][orderable]=true&columns[14][search][value]=&columns[14][search][regex]=false&columns[15][data]=render_empty&columns[15][name]=&columns[15][searchable]=true&columns[15][orderable]=true&columns[15][search][value]=&columns[15][search][regex]=false&columns[16][data]=project_id&columns[16][name]=&columns[16][searchable]=true&columns[16][orderable]=true&columns[16][search][value]=&columns[16][search][regex]=false&start=0&length=25&search[value]=&search[regex]=false&_=1685987743241')
df = pd.json_normalize(r.json()['data'])
for col in df.columns:
    df[col] = df[col].replace(r'<[^<>]*>', '', regex=True)
print(df)
终端中的结果:
     	DT_RowId 	render_my_posting 	render_post_date 	render_project_id 	render_category_search_string 	render_name 	bid_date_str 	render_city 	render_county 	state_code 	render_owner 	render_solicitor 	posting_type 	render_empty 	project_id
0 	0 		05/12/2023 	8526724 	Street/Roadway Reconstruction 	Key No. 22408 3000 E & FOOTHILL RD CURVE, TWI... 	06/06/2023 02:00 PM MDT 	N/A 	Twin Falls 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8526724
1 	1 		05/16/2023 	8529878 	Traffic Control Devices (Signa... 	Key No. 24192 SH-75, Ohio Gulch Road Intersec... 	06
<details>
<summary>英文:</summary>
Data in that table is dynamic, and is hydrated from another api endpoint. Here is one way of obtaining that data, by scraping that endpoint directly:
    import pandas as pd
    import requests
    
    headers = {
        'Referer': 'https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
    }
    
    s = requests.Session()
    s.headers.update(headers)
    r = s.get('https://qcpi.questcdn.com/cdn/browse_posting/?search_id=&postings_since_last_login=&draw=1&columns[0][data]=render_my_posting&columns[0][name]=&columns[0][searchable]=false&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=render_post_date&columns[1][name]=&columns[1][searchable]=false&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=render_project_id&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=render_category_search_string&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=render_name&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=bid_date_str&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=render_city&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=render_county&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&columns[8][data]=state_code&columns[8][name]=&columns[8][searchable]=true&columns[8][orderable]=true&columns[8][search][value]=&columns[8][search][regex]=false&columns[9][data]=render_owner&columns[9][name]=&columns[9][searchable]=true&columns[9][orderable]=true&columns[9][search][value]=&columns[9][search][regex]=false&columns[10][data]=render_solicitor&columns[10][name]=&columns[10][searchable]=true&columns[10][orderable]=true&columns[10][search][value]=&columns[10][search][regex]=false&columns[11][data]=posting_type&columns[11][name]=&columns[11][searchable]=true&columns[11][orderable]=true&columns[11][search][value]=&columns[11][search][regex]=false&columns[12][data]=render_empty&columns[12][name]=&columns[12][searchable]=true&columns[12][orderable]=true&columns[12][search][value]=&columns[12][search][regex]=false&columns[13][data]=render_empty&columns[13][name]=&columns[13][searchable]=true&columns[13][orderable]=true&columns[13][search][value]=&columns[13][search][regex]=false&columns[14][data]=render_empty&columns[14][name]=&columns[14][searchable]=true&columns[14][orderable]=true&columns[14][search][value]=&columns[14][search][regex]=false&columns[15][data]=render_empty&columns[15][name]=&columns[15][searchable]=true&columns[15][orderable]=true&columns[15][search][value]=&columns[15][search][regex]=false&columns[16][data]=project_id&columns[16][name]=&columns[16][searchable]=true&columns[16][orderable]=true&columns[16][search][value]=&columns[16][search][regex]=false&start=0&length=25&search[value]=&search[regex]=false&_=1685987743241')
    df = pd.json_normalize(r.json()['data'])
    for col in df.columns:
        df[col] = df[col].replace(r'<[^<>]*>', '', regex=True)
    print(df)
Result in terminal:
     	DT_RowId 	render_my_posting 	render_post_date 	render_project_id 	render_category_search_string 	render_name 	bid_date_str 	render_city 	render_county 	state_code 	render_owner 	render_solicitor 	posting_type 	render_empty 	project_id
    0 	0 		05/12/2023 	8526724 	Street/Roadway Reconstruction 	Key No. 22408 3000 E & FOOTHILL RD CURVE, TWI... 	06/06/2023 02:00 PM MDT 	N/A 	Twin Falls 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8526724
    1 	1 		05/16/2023 	8529878 	Traffic Control Devices (Signa... 	Key No. 24192 SH-75, Ohio Gulch Road Intersec... 	06/06/2023 02:00 PM MDT 	N/A 	Blaine 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8529878
    2 	2 		05/18/2023 	8534098 	Roadway Pavement Markings 	Key No. 23815 FY24 D6 STRIPING 	06/06/2023 02:00 PM MDT 	N/A 	Bonneville, Fremont,... 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8534098
    3 	3 		05/19/2023 	8536176 	Roadway Pavement Markings 	Key No. 21842, I-84, FY23 D4 Interstate Striping 	06/06/2023 02:00 PM MDT 	N/A 	Various 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8536176
    4 	4 		05/22/2023 	8539402 	Seal Coating 	Key No. 20592 / 20482 SH-3, CDA RV BR to I-90... 	06/06/2023 02:00 PM MDT 	N/A 	Kootenai 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8539402
    5 	5 		05/22/2023 	8539418 	Bridges/Overpasses 	Key No. 23474 US-20; EXIT 343 INTERCHANGE 	06/13/2023 02:00 PM MDT 	N/A 	Fremont 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8539418
    6 	6 		05/25/2023 	8544737 	Pavement - Marking 	Key No. 23791, FY24 D1 STRIPING 	06/13/2023 02:00 PM MDT 	N/A 	Kootenai and Shoshon... 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8544737
    7 	7 		05/25/2023 	8544742 	Bridge (Replacement or Rehabil... 	Key No. 20487 FY24 D1 BRIDGE REPAIR 	06/13/2023 02:00 PM MDT 	N/A 	Kootenai and Shoshon... 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8544742
    8 	8 		06/05/2023 	8554006 	Street/Roadway Reconstruction 	Key No. 24249 SH-11 PIERCE TO GRANGEMONT ROAD... 	06/27/2023 02:00 PM MDT 	N/A 	Clearwater 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8554006
</details>
# 答案2
**得分**: 0
以下是您要翻译的内容:
首先我注意到的是:更改您的引号。我运行了您的脚本,它正常工作。
这一行:
```python
# URL of the website to scrape
url = ´https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787´
应更改为:
# URL of the website to scrape
url = "https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787"
我得到的输出是:
[<tr>
<th class="""">Saved Search</th>
<th class="sorting" id="id_searchptbl_th1" onclick="sortTable(1)">Name</th>
<th class="sorting" id="id_searchptbl_th2" onclick="sortTable(2)">Search Criteria</th>
<th class="sorting" id="id_searchptbl_th3" onclick="sortTable(3)">Days Notified</th>
<th class="sorting" id="id_searchptbl_th4" onclick="sortTable(4)">Default</th>
<th class="""">Delete</th>
</tr>]
英文:
First thing I noticed: change your quotes. I ran you script and it works.
The line:
# URL of the website to scrape
url = ´https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787´
Must be:
# URL of the website to scrape
url = "https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787"
The output I got:
[<tr>
<th class="">Saved Search</th>
<th class="sorting" id="id_searchptbl_th1" onclick="sortTable(1)">Name</th>
<th class="sorting" id="id_searchptbl_th2" onclick="sortTable(2)">Search Criteria</th>
<th class="sorting" id="id_searchptbl_th3" onclick="sortTable(3)">Days Notified</th>
<th class="sorting" id="id_searchptbl_th4" onclick="sortTable(4)">Default</th>
<th class="">Delete</th>
</tr>]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论