Web-scrapping ASPX页面,在网络>负载中循环遍历页面编号。

huangapple go评论74阅读模式
英文:

Web-scrapping ASPX page where loop over page number from Network>Payload

问题

我正在尝试提取包含"office"、"cadre"、"designation"、"name"和"asset_details"列的表格:

在网络 > 负载 > 表单数据中,有一个页面编号(__EVENTARGUMENT: Page$x),不能输入,必须确定。

这是我的代码尝试:

import json
import requests
import pandas as pd

api_url = "http://bpsm.bihar.gov.in/Assets2019/AssetDetails.aspx?P1=2&P2=33&P3=0&P4=0"
payload = {"P1": "2", "P2": "33", "P3": "0", "P4": "0"}

all_data = []
for P2 in range(1, 39):  # <-- increase from 1 to 200
    print(P2)
    payload['P2'] = P2
    data = requests.post(api_url, json=payload).json()
    data = json.loads(data['d'])
    if not data:
        break
    for name, count in data[0].items():
        all_data.append({
        })

请注意,代码中的特殊字符(如&<)似乎是HTML实体编码,您可能需要根据需要进行适当的解码以确保代码正确运行。

英文:

I am trying to scrap the table with column office, cadre, designation, name, and asset_details:

Here P2 ranges from 1 to 38.

In the Network > Payload > Form Data has the page number (__EVENTARGUMENT: Page$x), which can't be input and has to be ascertained.

http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&P2=7&P3=0&P4=0

Here is my attempt at code:

import json
import requests
import pandas as pd


api_url = (
    "http://bpsm.bihar.gov.in/Assets2019/AssetDetails.aspx?P1=2&P2=33&P3=0&P4=0"
)
payload = {"P1": "2", "P2": "33", "P3": "0", "P4": "0"}


all_data = []
for P2 in range(1, 39):  # <-- increase from 1 to 200
    print(P2)
    payload['P2'] = P2
    data = requests.post(api_url, json=payload).json()
        data = json.loads(data['d'])
        if not data:
            break
        for name, count in data[0].items():
            all_data.append({
            })

答案1

得分: 2

你可以在Python中使用BeautifulSoup库。

import requests
from bs4 import BeautifulSoup

base_url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx"
payload = {"P1": "2", "P2": "", "P3": "0", "P4": "0"}

all_data = []

for P2 in range(1, 39):
    print(P2)
    payload['P2'] = str(P2)
    response = requests.post(base_url, data=payload)
    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table', {'class': 'rgMasterTable'})
    rows = table.find_all('tr')

    for row in rows[1:]:  # 跳过标题行
        columns = row.find_all('td')
        office = columns[0].text.strip()
        cadre = columns[1].text.strip()
        designation = columns[2].text.strip()
        name = columns[3].text.strip()
        asset_details = columns[4].text.strip()

        all_data.append({
            'Office': office,
            'Cadre': cadre,
            'Designation': designation,
            'Name': name,
            'Asset Details': asset_details
        })

# 从抓取的数据创建一个DataFrame
df = pd.DataFrame(all_data)
print(df)

这段代码从1到38的P2值范围进行迭代。对于每个值,它发送一个带有更新的payload的POST请求,以获取相应的页面。然后,它使用BeautifulSoup解析HTML响应并提取所需的表格数据。抓取的数据存储在字典列表中,最后从该列表创建一个DataFrame。

如果您有任何进一步的问题,请告诉我。

英文:

You can use the BeautifulSoup library in Python.

import requests
from bs4 import BeautifulSoup

base_url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx"
payload = {"P1": "2", "P2": "", "P3": "0", "P4": "0"}

all_data = []

for P2 in range(1, 39):
    print(P2)
    payload['P2'] = str(P2)
    response = requests.post(base_url, data=payload)
    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table', {'class': 'rgMasterTable'})
    rows = table.find_all('tr')

    for row in rows[1:]:  # Skip the header row
        columns = row.find_all('td')
        office = columns[0].text.strip()
        cadre = columns[1].text.strip()
        designation = columns[2].text.strip()
        name = columns[3].text.strip()
        asset_details = columns[4].text.strip()

        all_data.append({
            'Office': office,
            'Cadre': cadre,
            'Designation': designation,
            'Name': name,
            'Asset Details': asset_details
        })

# Create a DataFrame from the scraped data
df = pd.DataFrame(all_data)
print(df)

The code iterates over the range of P2 values from 1 to 38. For each value, it sends a POST request with the updated payload to fetch the corresponding page. It then uses BeautifulSoup to parse the HTML response and extracts the desired table data. The scraped data is stored in a list of dictionaries, and finally, a DataFrame is created from that list.
Please let me know if you have any further questions.

答案2

得分: 0

尝试使用 requests_html,

首先,安装库:pip install requests-html

然后使用以下代码:

from requests_html import HTMLSession

url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&P2=7&P3=0&P4=0"

session = HTMLSession()

r = session.get(url)

r.html.render(timeout=10)

table = r.html.find('#ctl00_ContentPlaceHolder2_GridView1 > tbody > tr:nth-child(n)')

for row in table:
    office = table[0].find('td')[0].text
    Cadre = table[0].find('td')[1].text
    Designation = table[0].find('td')[2].text
    Name = table[0].find('td')[3].text
英文:

try to use requests_html,

First of all, install the library: pip install requests-html

then use the code:

from requests_html import HTMLSession

url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&P2=7&P3=0&P4=0"

session = HTMLSession()

r = session.get(url)

r.html.render(timeout=10)


table = r.html.find('#ctl00_ContentPlaceHolder2_GridView1 > tbody > tr:nth-child(n)')

for row in table:
office = table[0].find('td')[0].text
Cadre = table [0]. find('td')[1].text
Designation = table[0].find('td')[2].text
Name = table[0].find('td')[3].text

huangapple
  • 本文由 发表于 2023年6月18日 21:34:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76500802.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定