“Unwanted result web scrapping” 可以翻译为 “不需要的结果网络抓取”。

huangapple go评论53阅读模式
英文:

Unwanted result web scrapping

问题

I want to scrap data from the page which get opens by clicking these quest no-8526724 of the table but the click is not working quest no is getting printed

enter code here
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

Open the website in the browser

driver.get('https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787')

Wait for the table to load

table = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, 'table_id')))

Find all the quest numbers in the table

quest_numbers = table.find_elements(By.XPATH, '//table//tbody//tr//td[2]')

Click on each quest number and print the resulting page

for quest_number in quest_numbers:
quest_number_text = quest_number.text
print(quest_number_text)
current_url = driver.current_url
driver.execute_script("arguments[0].click();", quest_number)

WebDriverWait(driver,40).
until(EC.presence_of_element_located((By.CLASS_NAME, 'posting-second-header'))) 
page_source = driver.page_source
print('Quest Number:', quest_number_text)
print('Page Source:', page_source)
time.sleep(2) 
# Go back to the table
driver.back()

Close the browser

driver.quit()

英文:

I want to scrap data from the page which get opens by clicking these quest no-8526724 of the table but the click is not working quest no is getting printed

enter code here
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

Open the website in the browser

driver.get('https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787')

Wait for the table to load

table = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, 'table_id')))

Find all the quest numbers in the table

quest_numbers = table.find_elements(By.XPATH, '//table//tbody//tr//td[2]')

Click on each quest number and print the resulting page

for quest_number in quest_numbers:
quest_number_text = quest_number.text
print(quest_number_text)
current_url = driver.current_url
driver.execute_script("arguments[0].click();", quest_number)

WebDriverWait(driver,40).
until(EC.presence_of_element_located((By.CLASS_NAME, 'posting-second- 
header'))) 
page_source = driver.page_source
print('Quest Number:', quest_number_text)
print('Page Source:', page_source)
time.sleep(2) 
# Go back to the table
driver.back()

Close the browser

driver.quit()

答案1

得分: 1

以下是您提供的代码的翻译:

表格中的数据是动态的并且从另一个API端点获取以下是直接从该端点抓取数据的一种方法

import pandas as pd
import requests

headers = {
    'Referer': 'https://qcpi.questcdn.com/cdn/posting/?group=1950787&provider=1950787',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

s = requests.Session()
s.headers.update(headers)
r = s.get('https://qcpi.questcdn.com/cdn/browse_posting/?search_id=&postings_since_last_login=&draw=1&columns[0][data]=render_my_posting&columns[0][name]=&columns[0][searchable]=false&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=render_post_date&columns[1][name]=&columns[1][searchable]=false&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=render_project_id&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=render_category_search_string&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=render_name&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=bid_date_str&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=render_city&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=render_county&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&columns[8][data]=state_code&columns[8][name]=&columns[8][searchable]=true&columns[8][orderable]=true&columns[8][search][value]=&columns[8][search][regex]=false&columns[9][data]=render_owner&columns[9][name]=&columns[9][searchable]=true&columns[9][orderable]=true&columns[9][search][value]=&columns[9][search][regex]=false&columns[10][data]=render_solicitor&columns[10][name]=&columns[10][searchable]=true&columns[10][orderable]=true&columns[10][search][value]=&columns[10][search][regex]=false&columns[11][data]=posting_type&columns[11][name]=&columns[11][searchable]=true&columns[11][orderable]=true&columns[11][search][value]=&columns[11][search][regex]=false&columns[12][data]=render_empty&columns[12][name]=&columns[12][searchable]=true&columns[12][orderable]=true&columns[12][search][value]=&columns[12][search][regex]=false&columns[13][data]=render_empty&columns[13][name]=&columns[13][searchable]=true&columns[13][orderable]=true&columns[13][search][value]=&columns[13][search][regex]=false&columns[14][data]=render_empty&columns[14][name]=&columns[14][searchable]=true&columns[14][orderable]=true&columns[14][search][value]=&columns[14][search][regex]=false&columns[15][data]=render_empty&columns[15][name]=&columns[15][searchable]=true&columns[15][orderable]=true&columns[15][search][value]=&columns[15][search][regex]=false&columns[16][data]=project_id&columns[16][name]=&columns[16][searchable]=true&columns[16][orderable]=true&columns[16][search][value]=&columns[16][search][regex]=false&start=0&length=25&search[value]=&search[regex]=false&_=1685987743241')
df = pd.json_normalize(r.json()['data'])
for col in df.columns:
    df[col] = df[col].replace(r'<[^<>]*>', '', regex=True)
print(df)

终端中的结果

     	DT_RowId 	render_my_posting 	render_post_date 	render_project_id 	render_category_search_string 	render_name 	bid_date_str 	render_city 	render_county 	state_code 	render_owner 	render_solicitor 	posting_type 	render_empty 	project_id
0 	0 		05/12/2023 	8526724 	Street/Roadway Reconstruction 	Key No. 22408 3000 E & FOOTHILL RD CURVE, TWI... 	06/06/2023 02:00 PM MDT 	N/A 	Twin Falls 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8526724
1 	1 		05/16/2023 	8529878 	Traffic Control Devices (Signa... 	Key No. 24192 SH-75, Ohio Gulch Road Intersec... 	06

<details>
<summary>英文:</summary>

Data in that table is dynamic, and is hydrated from another api endpoint. Here is one way of obtaining that data, by scraping that endpoint directly:

    import pandas as pd
    import requests
    
    headers = {
        &#39;Referer&#39;: &#39;https://qcpi.questcdn.com/cdn/posting/?group=1950787&amp;provider=1950787&#39;,
        &#39;Accept&#39;: &#39;application/json, text/javascript, */*; q=0.01&#39;,
        &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36&#39;
    }
    
    s = requests.Session()
    s.headers.update(headers)
    r = s.get(&#39;https://qcpi.questcdn.com/cdn/browse_posting/?search_id=&amp;postings_since_last_login=&amp;draw=1&amp;columns[0][data]=render_my_posting&amp;columns[0][name]=&amp;columns[0][searchable]=false&amp;columns[0][orderable]=false&amp;columns[0][search][value]=&amp;columns[0][search][regex]=false&amp;columns[1][data]=render_post_date&amp;columns[1][name]=&amp;columns[1][searchable]=false&amp;columns[1][orderable]=true&amp;columns[1][search][value]=&amp;columns[1][search][regex]=false&amp;columns[2][data]=render_project_id&amp;columns[2][name]=&amp;columns[2][searchable]=true&amp;columns[2][orderable]=true&amp;columns[2][search][value]=&amp;columns[2][search][regex]=false&amp;columns[3][data]=render_category_search_string&amp;columns[3][name]=&amp;columns[3][searchable]=true&amp;columns[3][orderable]=true&amp;columns[3][search][value]=&amp;columns[3][search][regex]=false&amp;columns[4][data]=render_name&amp;columns[4][name]=&amp;columns[4][searchable]=true&amp;columns[4][orderable]=true&amp;columns[4][search][value]=&amp;columns[4][search][regex]=false&amp;columns[5][data]=bid_date_str&amp;columns[5][name]=&amp;columns[5][searchable]=true&amp;columns[5][orderable]=true&amp;columns[5][search][value]=&amp;columns[5][search][regex]=false&amp;columns[6][data]=render_city&amp;columns[6][name]=&amp;columns[6][searchable]=true&amp;columns[6][orderable]=true&amp;columns[6][search][value]=&amp;columns[6][search][regex]=false&amp;columns[7][data]=render_county&amp;columns[7][name]=&amp;columns[7][searchable]=true&amp;columns[7][orderable]=true&amp;columns[7][search][value]=&amp;columns[7][search][regex]=false&amp;columns[8][data]=state_code&amp;columns[8][name]=&amp;columns[8][searchable]=true&amp;columns[8][orderable]=true&amp;columns[8][search][value]=&amp;columns[8][search][regex]=false&amp;columns[9][data]=render_owner&amp;columns[9][name]=&amp;columns[9][searchable]=true&amp;columns[9][orderable]=true&amp;columns[9][search][value]=&amp;columns[9][search][regex]=false&amp;columns[10][data]=render_solicitor&amp;columns[10][name]=&amp;columns[10][searchable]=true&amp;columns[10][orderable]=true&amp;columns[10][search][value]=&amp;columns[10][search][regex]=false&amp;columns[11][data]=posting_type&amp;columns[11][name]=&amp;columns[11][searchable]=true&amp;columns[11][orderable]=true&amp;columns[11][search][value]=&amp;columns[11][search][regex]=false&amp;columns[12][data]=render_empty&amp;columns[12][name]=&amp;columns[12][searchable]=true&amp;columns[12][orderable]=true&amp;columns[12][search][value]=&amp;columns[12][search][regex]=false&amp;columns[13][data]=render_empty&amp;columns[13][name]=&amp;columns[13][searchable]=true&amp;columns[13][orderable]=true&amp;columns[13][search][value]=&amp;columns[13][search][regex]=false&amp;columns[14][data]=render_empty&amp;columns[14][name]=&amp;columns[14][searchable]=true&amp;columns[14][orderable]=true&amp;columns[14][search][value]=&amp;columns[14][search][regex]=false&amp;columns[15][data]=render_empty&amp;columns[15][name]=&amp;columns[15][searchable]=true&amp;columns[15][orderable]=true&amp;columns[15][search][value]=&amp;columns[15][search][regex]=false&amp;columns[16][data]=project_id&amp;columns[16][name]=&amp;columns[16][searchable]=true&amp;columns[16][orderable]=true&amp;columns[16][search][value]=&amp;columns[16][search][regex]=false&amp;start=0&amp;length=25&amp;search[value]=&amp;search[regex]=false&amp;_=1685987743241&#39;)
    df = pd.json_normalize(r.json()[&#39;data&#39;])
    for col in df.columns:
        df[col] = df[col].replace(r&#39;&lt;[^&lt;&gt;]*&gt;&#39;, &#39;&#39;, regex=True)
    print(df)

Result in terminal:

     	DT_RowId 	render_my_posting 	render_post_date 	render_project_id 	render_category_search_string 	render_name 	bid_date_str 	render_city 	render_county 	state_code 	render_owner 	render_solicitor 	posting_type 	render_empty 	project_id
    0 	0 		05/12/2023 	8526724 	Street/Roadway Reconstruction 	Key No. 22408 3000 E &amp; FOOTHILL RD CURVE, TWI... 	06/06/2023 02:00 PM MDT 	N/A 	Twin Falls 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8526724
    1 	1 		05/16/2023 	8529878 	Traffic Control Devices (Signa... 	Key No. 24192 SH-75, Ohio Gulch Road Intersec... 	06/06/2023 02:00 PM MDT 	N/A 	Blaine 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8529878
    2 	2 		05/18/2023 	8534098 	Roadway Pavement Markings 	Key No. 23815 FY24 D6 STRIPING 	06/06/2023 02:00 PM MDT 	N/A 	Bonneville, Fremont,... 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8534098
    3 	3 		05/19/2023 	8536176 	Roadway Pavement Markings 	Key No. 21842, I-84, FY23 D4 Interstate Striping 	06/06/2023 02:00 PM MDT 	N/A 	Various 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8536176
    4 	4 		05/22/2023 	8539402 	Seal Coating 	Key No. 20592 / 20482 SH-3, CDA RV BR to I-90... 	06/06/2023 02:00 PM MDT 	N/A 	Kootenai 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8539402
    5 	5 		05/22/2023 	8539418 	Bridges/Overpasses 	Key No. 23474 US-20; EXIT 343 INTERCHANGE 	06/13/2023 02:00 PM MDT 	N/A 	Fremont 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8539418
    6 	6 		05/25/2023 	8544737 	Pavement - Marking 	Key No. 23791, FY24 D1 STRIPING 	06/13/2023 02:00 PM MDT 	N/A 	Kootenai and Shoshon... 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8544737
    7 	7 		05/25/2023 	8544742 	Bridge (Replacement or Rehabil... 	Key No. 20487 FY24 D1 BRIDGE REPAIR 	06/13/2023 02:00 PM MDT 	N/A 	Kootenai and Shoshon... 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8544742
    8 	8 		06/05/2023 	8554006 	Street/Roadway Reconstruction 	Key No. 24249 SH-11 PIERCE TO GRANGEMONT ROAD... 	06/27/2023 02:00 PM MDT 	N/A 	Clearwater 	ID 	Idaho Transportation... 	Idaho Transportati... 	Construction Project 		8554006

</details>



# 答案2
**得分**: 0

以下是您要翻译的内容

首先我注意到的是更改您的引号我运行了您的脚本它正常工作

这一行

```python
# URL of the website to scrape
url = &#180;https://qcpi.questcdn.com/cdn/posting/?group=1950787&amp;provider=1950787&#180;

应更改为:

# URL of the website to scrape
url = &quot;https://qcpi.questcdn.com/cdn/posting/?group=1950787&amp;provider=1950787&quot;

我得到的输出是:

[<tr>
<th class="&quot;&quot;">Saved Search</th>
<th class="sorting" id="id_searchptbl_th1" onclick="sortTable(1)">Name</th>
<th class="sorting" id="id_searchptbl_th2" onclick="sortTable(2)">Search Criteria</th>
<th class="sorting" id="id_searchptbl_th3" onclick="sortTable(3)">Days Notified</th>
<th class="sorting" id="id_searchptbl_th4" onclick="sortTable(4)">Default</th>
<th class="&quot;&quot;">Delete</th>
</tr>]
英文:

First thing I noticed: change your quotes. I ran you script and it works.

The line:

# URL of the website to scrape
url = &#180;https://qcpi.questcdn.com/cdn/posting/?group=1950787&amp;provider=1950787&#180;

Must be:

# URL of the website to scrape
url = &quot;https://qcpi.questcdn.com/cdn/posting/?group=1950787&amp;provider=1950787&quot;

The output I got:

[&lt;tr&gt;
&lt;th class=&quot;&quot;&gt;Saved Search&lt;/th&gt;
&lt;th class=&quot;sorting&quot; id=&quot;id_searchptbl_th1&quot; onclick=&quot;sortTable(1)&quot;&gt;Name&lt;/th&gt;
&lt;th class=&quot;sorting&quot; id=&quot;id_searchptbl_th2&quot; onclick=&quot;sortTable(2)&quot;&gt;Search Criteria&lt;/th&gt;
&lt;th class=&quot;sorting&quot; id=&quot;id_searchptbl_th3&quot; onclick=&quot;sortTable(3)&quot;&gt;Days Notified&lt;/th&gt;
&lt;th class=&quot;sorting&quot; id=&quot;id_searchptbl_th4&quot; onclick=&quot;sortTable(4)&quot;&gt;Default&lt;/th&gt;
&lt;th class=&quot;&quot;&gt;Delete&lt;/th&gt;
&lt;/tr&gt;]

huangapple
  • 本文由 发表于 2023年6月6日 00:51:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76408502.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定