英文:
Web-scrapping ASPX page where loop over page number from Network>Payload
问题
我正在尝试提取包含"office"、"cadre"、"designation"、"name"和"asset_details"列的表格:
在网络 > 负载 > 表单数据中,有一个页面编号(__EVENTARGUMENT: Page$x),不能输入,必须确定。
这是我的代码尝试:
import json
import requests
import pandas as pd
api_url = "http://bpsm.bihar.gov.in/Assets2019/AssetDetails.aspx?P1=2&P2=33&P3=0&P4=0"
payload = {"P1": "2", "P2": "33", "P3": "0", "P4": "0"}
all_data = []
for P2 in range(1, 39): # <-- increase from 1 to 200
print(P2)
payload['P2'] = P2
data = requests.post(api_url, json=payload).json()
data = json.loads(data['d'])
if not data:
break
for name, count in data[0].items():
all_data.append({
})
请注意,代码中的特殊字符(如&
和<
)似乎是HTML实体编码,您可能需要根据需要进行适当的解码以确保代码正确运行。
英文:
I am trying to scrap the table with column office, cadre, designation, name, and asset_details:
Here P2 ranges from 1 to 38.
In the Network > Payload > Form Data has the page number (__EVENTARGUMENT: Page$x), which can't be input and has to be ascertained.
http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&P2=7&P3=0&P4=0
Here is my attempt at code:
import json
import requests
import pandas as pd
api_url = (
"http://bpsm.bihar.gov.in/Assets2019/AssetDetails.aspx?P1=2&P2=33&P3=0&P4=0"
)
payload = {"P1": "2", "P2": "33", "P3": "0", "P4": "0"}
all_data = []
for P2 in range(1, 39): # <-- increase from 1 to 200
print(P2)
payload['P2'] = P2
data = requests.post(api_url, json=payload).json()
data = json.loads(data['d'])
if not data:
break
for name, count in data[0].items():
all_data.append({
})
答案1
得分: 2
你可以在Python中使用BeautifulSoup库。
import requests
from bs4 import BeautifulSoup
base_url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx"
payload = {"P1": "2", "P2": "", "P3": "0", "P4": "0"}
all_data = []
for P2 in range(1, 39):
print(P2)
payload['P2'] = str(P2)
response = requests.post(base_url, data=payload)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'rgMasterTable'})
rows = table.find_all('tr')
for row in rows[1:]: # 跳过标题行
columns = row.find_all('td')
office = columns[0].text.strip()
cadre = columns[1].text.strip()
designation = columns[2].text.strip()
name = columns[3].text.strip()
asset_details = columns[4].text.strip()
all_data.append({
'Office': office,
'Cadre': cadre,
'Designation': designation,
'Name': name,
'Asset Details': asset_details
})
# 从抓取的数据创建一个DataFrame
df = pd.DataFrame(all_data)
print(df)
这段代码从1到38的P2值范围进行迭代。对于每个值,它发送一个带有更新的payload的POST请求,以获取相应的页面。然后,它使用BeautifulSoup解析HTML响应并提取所需的表格数据。抓取的数据存储在字典列表中,最后从该列表创建一个DataFrame。
如果您有任何进一步的问题,请告诉我。
英文:
You can use the BeautifulSoup library in Python.
import requests
from bs4 import BeautifulSoup
base_url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx"
payload = {"P1": "2", "P2": "", "P3": "0", "P4": "0"}
all_data = []
for P2 in range(1, 39):
print(P2)
payload['P2'] = str(P2)
response = requests.post(base_url, data=payload)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'rgMasterTable'})
rows = table.find_all('tr')
for row in rows[1:]: # Skip the header row
columns = row.find_all('td')
office = columns[0].text.strip()
cadre = columns[1].text.strip()
designation = columns[2].text.strip()
name = columns[3].text.strip()
asset_details = columns[4].text.strip()
all_data.append({
'Office': office,
'Cadre': cadre,
'Designation': designation,
'Name': name,
'Asset Details': asset_details
})
# Create a DataFrame from the scraped data
df = pd.DataFrame(all_data)
print(df)
The code iterates over the range of P2 values from 1 to 38. For each value, it sends a POST request with the updated payload to fetch the corresponding page. It then uses BeautifulSoup to parse the HTML response and extracts the desired table data. The scraped data is stored in a list of dictionaries, and finally, a DataFrame is created from that list.
Please let me know if you have any further questions.
答案2
得分: 0
尝试使用 requests_html,
首先,安装库:pip install requests-html
然后使用以下代码:
from requests_html import HTMLSession
url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&P2=7&P3=0&P4=0"
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=10)
table = r.html.find('#ctl00_ContentPlaceHolder2_GridView1 > tbody > tr:nth-child(n)')
for row in table:
office = table[0].find('td')[0].text
Cadre = table[0].find('td')[1].text
Designation = table[0].find('td')[2].text
Name = table[0].find('td')[3].text
英文:
try to use requests_html,
First of all, install the library: pip install requests-html
then use the code:
from requests_html import HTMLSession
url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&P2=7&P3=0&P4=0"
session = HTMLSession()
r = session.get(url)
r.html.render(timeout=10)
table = r.html.find('#ctl00_ContentPlaceHolder2_GridView1 > tbody > tr:nth-child(n)')
for row in table:
office = table[0].find('td')[0].text
Cadre = table [0]. find('td')[1].text
Designation = table[0].find('td')[2].text
Name = table[0].find('td')[3].text
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论