2023年6月18日 21:34:15go评论159阅读模式

英文:

Web-scrapping ASPX page where loop over page number from Network>Payload

问题

我正在尝试提取包含"office"、"cadre"、"designation"、"name"和"asset_details"列的表格：

在网络 > 负载 > 表单数据中，有一个页面编号(__EVENTARGUMENT: Page$x)，不能输入，必须确定。

这是我的代码尝试：

import json
import requests
import pandas as pd

api_url = "http://bpsm.bihar.gov.in/Assets2019/AssetDetails.aspx?P1=2&amp;P2=33&amp;P3=0&amp;P4=0"
payload = {"P1": "2", "P2": "33", "P3": "0", "P4": "0"}

all_data = []
for P2 in range(1, 39):  # &lt;-- increase from 1 to 200
    print(P2)
    payload['P2'] = P2
    data = requests.post(api_url, json=payload).json()
    data = json.loads(data['d'])
    if not data:
        break
    for name, count in data[0].items():
        all_data.append({
        })

请注意，代码中的特殊字符（如&和<）似乎是HTML实体编码，您可能需要根据需要进行适当的解码以确保代码正确运行。

英文:

I am trying to scrap the table with column office, cadre, designation, name, and asset_details:

Here P2 ranges from 1 to 38.

In the Network > Payload > Form Data has the page number (__EVENTARGUMENT: Page$x), which can't be input and has to be ascertained.

http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&P2=7&P3=0&P4=0

Here is my attempt at code:

import json
import requests
import pandas as pd


api_url = (
    &quot;http://bpsm.bihar.gov.in/Assets2019/AssetDetails.aspx?P1=2&amp;P2=33&amp;P3=0&amp;P4=0&quot;
)
payload = {&quot;P1&quot;: &quot;2&quot;, &quot;P2&quot;: &quot;33&quot;, &quot;P3&quot;: &quot;0&quot;, &quot;P4&quot;: &quot;0&quot;}


all_data = []
for P2 in range(1, 39):  # &lt;-- increase from 1 to 200
    print(P2)
    payload[&#39;P2&#39;] = P2
    data = requests.post(api_url, json=payload).json()
        data = json.loads(data[&#39;d&#39;])
        if not data:
            break
        for name, count in data[0].items():
            all_data.append({
            })

答案1

得分: 2

你可以在Python中使用BeautifulSoup库。

import requests
from bs4 import BeautifulSoup

base_url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx"
payload = {"P1": "2", "P2": "", "P3": "0", "P4": "0"}

all_data = []

for P2 in range(1, 39):
    print(P2)
    payload['P2'] = str(P2)
    response = requests.post(base_url, data=payload)
    soup = BeautifulSoup(response.text, 'html.parser')

    table = soup.find('table', {'class': 'rgMasterTable'})
    rows = table.find_all('tr')

    for row in rows[1:]:  # 跳过标题行
        columns = row.find_all('td')
        office = columns[0].text.strip()
        cadre = columns[1].text.strip()
        designation = columns[2].text.strip()
        name = columns[3].text.strip()
        asset_details = columns[4].text.strip()

        all_data.append({
            'Office': office,
            'Cadre': cadre,
            'Designation': designation,
            'Name': name,
            'Asset Details': asset_details
        })

# 从抓取的数据创建一个DataFrame
df = pd.DataFrame(all_data)
print(df)

这段代码从1到38的P2值范围进行迭代。对于每个值，它发送一个带有更新的payload的POST请求，以获取相应的页面。然后，它使用BeautifulSoup解析HTML响应并提取所需的表格数据。抓取的数据存储在字典列表中，最后从该列表创建一个DataFrame。

如果您有任何进一步的问题，请告诉我。

英文:

You can use the BeautifulSoup library in Python.

import requests
from bs4 import BeautifulSoup

base_url = &quot;http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx&quot;
payload = {&quot;P1&quot;: &quot;2&quot;, &quot;P2&quot;: &quot;&quot;, &quot;P3&quot;: &quot;0&quot;, &quot;P4&quot;: &quot;0&quot;}

all_data = []

for P2 in range(1, 39):
    print(P2)
    payload[&#39;P2&#39;] = str(P2)
    response = requests.post(base_url, data=payload)
    soup = BeautifulSoup(response.text, &#39;html.parser&#39;)

    table = soup.find(&#39;table&#39;, {&#39;class&#39;: &#39;rgMasterTable&#39;})
    rows = table.find_all(&#39;tr&#39;)

    for row in rows[1:]:  # Skip the header row
        columns = row.find_all(&#39;td&#39;)
        office = columns[0].text.strip()
        cadre = columns[1].text.strip()
        designation = columns[2].text.strip()
        name = columns[3].text.strip()
        asset_details = columns[4].text.strip()

        all_data.append({
            &#39;Office&#39;: office,
            &#39;Cadre&#39;: cadre,
            &#39;Designation&#39;: designation,
            &#39;Name&#39;: name,
            &#39;Asset Details&#39;: asset_details
        })

# Create a DataFrame from the scraped data
df = pd.DataFrame(all_data)
print(df)

The code iterates over the range of P2 values from 1 to 38. For each value, it sends a POST request with the updated payload to fetch the corresponding page. It then uses BeautifulSoup to parse the HTML response and extracts the desired table data. The scraped data is stored in a list of dictionaries, and finally, a DataFrame is created from that list.
Please let me know if you have any further questions.

答案2

得分: 0

尝试使用 requests_html，

首先，安装库：pip install requests-html

然后使用以下代码：

from requests_html import HTMLSession

url = "http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&amp;P2=7&amp;P3=0&amp;P4=0"

session = HTMLSession()

r = session.get(url)

r.html.render(timeout=10)

table = r.html.find('#ctl00_ContentPlaceHolder2_GridView1 > tbody > tr:nth-child(n)')

for row in table:
    office = table[0].find('td')[0].text
    Cadre = table[0].find('td')[1].text
    Designation = table[0].find('td')[2].text
    Name = table[0].find('td')[3].text

英文:

try to use requests_html,

First of all, install the library: pip install requests-html

then use the code:

from requests_html import HTMLSession

url = &quot;http://bpsm.bihar.gov.in/Assets2020/AssetDetails.aspx?P1=2&amp;P2=7&amp;P3=0&amp;P4=0&quot;

session = HTMLSession()

r = session.get(url)

r.html.render(timeout=10)


table = r.html.find(&#39;#ctl00_ContentPlaceHolder2_GridView1 &gt; tbody &gt; tr:nth-child(n)&#39;)

for row in table:
office = table[0].find(&#39;td&#39;)[0].text
Cadre = table [0]. find(&#39;td&#39;)[1].text
Designation = table[0].find(&#39;td&#39;)[2].text
Name = table[0].find(&#39;td&#39;)[3].text

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Web-scrapping ASPX页面，在网络>负载中循环遍历页面编号。

问题

答案1

答案2

如何使用Pandas获取CSV中的特定行。

<input> 在 Python 中未使用 Flask 检索到

将项目与来自其他列的值相关联。

"requests.exceptions.HTTPError: 400 Client Error" creating a Jira issue when using the package atlassian

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论