2023年8月10日 18:44:30go评论132阅读模式

英文:

Scrape data from AJAX webpage with python

问题

我遇到了这个问题 - 我需要从这个网页 - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb 中抓取动态表格的数据。

这个网页使用ajax来生成我想获取的表格。我已经检查了元素，似乎很简单，我有带参数的请求URL，我尝试发送请求，得到了200的响应代码，但响应内容为空。

我肯定是做错了些什么，但我不确定如何在Python中获取这些数据，尽管看起来似乎很简单，有人能帮我吗？

我想获取与网站上显示的相同的表格。

英文:

im having this issue - i need to scrape a dynamic table's data from this webpage - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb

This webpage uses ajax to generate the table that I want to fetch. I have inspected the element and it seems to be straightforward, I have the request url with param, I try to send a request, get response code 200 and the response is empty.

I must be doing something wrong, but im not sure how to fetch this data in python even though it seems kind of straightforward, could anyone help me out?

I want to get the same table as the one that is displayed on the website.

答案1

得分: 3

以下是已翻译的内容：

"Actually, this page turned out to be a pretty cool challenge!" -> "实际上，这个页面竟然是一个相当酷的挑战！"

"Breakdown:" -> "分解："

"- The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out" -> "- 报告的链接位于源HTML中，但表格是通过JavaScript动态呈现的，但您可以轻松地提取它"

"- The safeargs_data value" -> "- safeargs_data的值"

"is just a silly way of obfuscating in hex this value" -> "只是以十六进制方式混淆此值的愚蠢方式"

"- I've decoded it for ease of readability and editing e.g. the data key" -> "- 为了便于阅读和编辑，我已对其进行了解码，例如data键"

"- Finally, I use the table_link, payload data, and updated headers to make a POST request." -> "- 最后，我使用table_link，payload数据和更新的headers发出POST请求。"

"- Then, it's easy to get the table out of the JSON and parse it with pandas" -> "- 然后，很容易从JSON中获取表格并使用pandas进行解析"

"By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report." -> "顺便说一句，如果您将URL中的十六进制值转换并将safeargs_data添加到其中，您仍然可以获得您的报告。"

"Here's a full, decoded URL" -> "这是一个完整的、解码后的URL"

"Here's my take on it:" -> "这是我的见解："

"import binascii" -> "导入 binascii"

"from urllib.parse import urlencode" -> "从 urllib.parse 导入 urlencode"

"import pandas as pd" -> "导入 pandas as pd"

"import requests" -> "导入 requests"

"from bs4 import BeautifulSoup" -> "从 bs4 导入 BeautifulSoup"

"from tabulate import tabulate" -> "从 tabulate 导入 tabulate"

"url = (..." -> "url = (..."

"headers = {..." -> "headers = {..."

"with requests.Session() as session:" -> "使用 requests.Session() 作为 session："

"table_link = (..." -> "table_link = (..."

"headers.update({..." -> "headers.update({..."

"payload_data = {" -> "payload_data = {"

"hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()" -> "hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()"

"table_data = session.post(" -> "table_data = session.post("

"df = pd.read_html(" -> "df = pd.read_html("

"df.to_csv(" -> "df.to_csv("

"print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))" -> "print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))"

英文:

Actually, this page turned out to be a pretty cool challenge!

Breakdown:

The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out
The safeargs_data value

5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31

is just a silly way of obfuscating in hex this value

__report=PL_USE_RBB&amp;__callType=p&amp;data=2023-08-10&amp;_svgsupport=true&amp;resourceID=renderingURL&amp;__pageNumber=1&amp;__batchID=189df65a64d-1

I've decoded it for ease of readability and editing e.g. the data key
Finally, I use the table_link, payload data, and updated headers to make a POST request.
Then, it's easy to get the table out of the JSON and parse it with pandas

By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report.

Here's a full, decoded URL:

https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&amp;p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&amp;p_p_lifecycle=1&amp;p_p_state=normal&amp;p_p_mode=view&amp;p_p_col_id=column-2&amp;p_p_col_pos=1&amp;p_p_col_count=2&amp;_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&amp;__action=processEdit__report=PL_USE_RBB&amp;__callType=p&amp;data=2023-08-10&amp;_svgsupport=true&amp;resourceID=renderingURL&amp;__pageNumber=1&amp;__batchID=189df65a64d-1

Here's my take on it:

import binascii
from urllib.parse import urlencode
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = (
    &quot;https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/&quot;
    &quot;raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb&quot;
)
headers = {
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) &quot;
                  &quot;AppleWebKit/537.36 (KHTML, like Gecko) &quot;
                  &quot;Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200&quot;,
}
with requests.Session() as session:
    table_link = (
        BeautifulSoup(session.get(url, headers=headers).content, &quot;lxml&quot;)
        .select_one(&quot;a[class=&#39;vui-generic-url&#39;]&quot;)
        .get(&quot;href&quot;)
    )
    headers.update({&quot;X-Requested-With&quot;: &quot;XMLHttpRequest&quot;})
    payload_data = {
        &quot;__report&quot;: &quot;PL_USE_RBB&quot;,
        &quot;__callType&quot;: &quot;p&quot;,
        &quot;data&quot;: &quot;2023-08-10&quot;,
        &quot;_svgsupport&quot;: &quot;true&quot;,
        &quot;resourceID&quot;: &quot;renderingURL&quot;,
        &quot;__pageNumber&quot;: &quot;1&quot;,
        &quot;__batchID&quot;: &quot;189df65a64d-1&quot;,
    }
    hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()
    table_data = session.post(
        table_link,
        data={&quot;safeargs_data&quot;: hex_it},
        headers=headers,
    )
    df = pd.read_html(
        # .replace() is used to get rid of NBSPs
        table_data.json()[&quot;reportContent&quot;].replace(&quot;\xa0&quot;, &quot;&quot;),
        flavor=&quot;lxml&quot;,
        skiprows=[0],
    )[1]
    df.dropna(how=&quot;all&quot;, inplace=True)
    df.to_csv(&quot;PL_USE_RBB.csv&quot;, index=False)
    print(tabulate(df, headers=&quot;keys&quot;, tablefmt=&quot;psql&quot;, showindex=False))

This should save a .csv file PL_USE_RBB.csv and then print this:

+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
|   (&#39;Numer iteracji&#39;, &#39;24&#39;) | (&#39;Początek&#39;, &#39;2023-08-10 15:15:34&#39;)   | (&#39;Koniec&#39;, &#39;2023-08-10 15:16:06&#39;)   |   (&#39;Początek&#39;, &#39;17&#39;) |   (&#39;Koniec&#39;, &#39;24&#39;) |   (&#39;[MWh]&#39;, &#39;47561,000&#39;) |
|----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
|                         23 | 2023-08-10 14:15:42                   | 2023-08-10 14:16:16                 |                   16 |                 24 |              4.88788e+07 |
|                         22 | 2023-08-10 13:15:39                   | 2023-08-10 13:16:24                 |                   15 |                 24 |              4.50884e+07 |
|                         21 | 2023-08-10 12:15:36                   | 2023-08-10 12:16:10                 |                   14 |                 24 |              4.09294e+07 |
|                         20 | 2023-08-10 11:15:33                   | 2023-08-10 11:16:15                 |                   13 |                 24 |              3.12136e+07 |
|                         19 | 2023-08-10 10:15:41                   | 2023-08-10 10:16:07                 |                   12 |                 24 |              2.55946e+07 |
|                         18 | 2023-08-10 09:15:40                   | 2023-08-10 09:16:05                 |                   11 |                 24 |              2.26086e+07 |
|                         17 | 2023-08-10 08:15:40                   | 2023-08-10 08:16:00                 |                   10 |                 24 |              1.58324e+07 |
|                         16 | 2023-08-10 07:15:35                   | 2023-08-10 07:15:56                 |                    9 |                 24 |              1.11414e+07 |
|                         15 | 2023-08-10 06:15:33                   | 2023-08-10 06:15:52                 |                    8 |                 24 |              1.11796e+07 |
|                         14 | 2023-08-10 05:15:32                   | 2023-08-10 05:15:52                 |                    7 |                 24 |              9.639e+06   |
|                         13 | 2023-08-10 04:15:41                   | 2023-08-10 04:16:11                 |                    6 |                 24 |              9.0502e+06  |
|                         12 | 2023-08-10 03:15:36                   | 2023-08-10 03:15:55                 |                    5 |                 24 |              7.871e+06   |
|                         11 | 2023-08-10 02:15:35                   | 2023-08-10 02:16:03                 |                    4 |                 24 |              8.395e+06   |
|                         10 | 2023-08-10 01:15:41                   | 2023-08-10 01:16:04                 |                    3 |                 24 |              7.8954e+06  |
|                          9 | 2023-08-10 00:15:37                   | 2023-08-10 00:15:55                 |                    2 |                 24 |              8.2582e+06  |
|                          8 | 2023-08-09 23:15:03                   | 2023-08-09 23:15:24                 |                    1 |                 24 |              6.6784e+06  |
|                          7 | 2023-08-09 22:15:08                   | 2023-08-09 22:15:16                 |                    1 |                 24 |         603200           |
|                          6 | 2023-08-09 21:15:12                   | 2023-08-09 21:15:22                 |                    1 |                 24 |              0           |
|                          5 | 2023-08-09 20:15:06                   | 2023-08-09 20:15:12                 |                    1 |                 24 |              0           |
|                          4 | 2023-08-09 19:15:04                   | 2023-08-09 19:15:14                 |                    1 |                 24 |              0           |
|                          3 | 2023-08-09 18:15:11                   | 2023-08-09 18:15:32                 |                    1 |                 24 |              0           |
|                          2 | 2023-08-09 17:15:11                   | 2023-08-09 17:15:22                 |                    1 |                 24 |              0           |
|                          1 | 2023-08-09 16:15:09                   | 2023-08-09 16:15:31                 |                    1 |                 24 |              0           |
+----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从AJAX网页中使用Python抓取数据。

问题

答案1

如何在VSCode上启用Pylint扩展以显示“E202”错误？

如何在GitLab CI中不为每个作业安装Python依赖。

无法使用Selenium和Python找到并点击复选框

“Can’t run pytest with tmpdir: ‘AttributeError: module ‘py’ has no attribute ‘path'”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。