从AJAX网页中使用Python抓取数据。

huangapple go评论132阅读模式
英文:

Scrape data from AJAX webpage with python

问题

我遇到了这个问题 - 我需要从这个网页 - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb 中抓取动态表格的数据。

这个网页使用ajax来生成我想获取的表格。我已经检查了元素,似乎很简单,我有带参数的请求URL,我尝试发送请求,得到了200的响应代码,但响应内容为空。

我肯定是做错了些什么,但我不确定如何在Python中获取这些数据,尽管看起来似乎很简单,有人能帮我吗?

我想获取与网站上显示的相同的表格。

英文:

im having this issue - i need to scrape a dynamic table's data from this webpage - https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb

This webpage uses ajax to generate the table that I want to fetch. I have inspected the element and it seems to be straightforward, I have the request url with param, I try to send a request, get response code 200 and the response is empty.

I must be doing something wrong, but im not sure how to fetch this data in python even though it seems kind of straightforward, could anyone help me out?

I want to get the same table as the one that is displayed on the website.

答案1

得分: 3

以下是已翻译的内容:

"Actually, this page turned out to be a pretty cool challenge!" -> "实际上,这个页面竟然是一个相当酷的挑战!"

"Breakdown:" -> "分解:"

"- The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out" -> "- 报告的链接位于源HTML中,但表格是通过JavaScript动态呈现的,但您可以轻松地提取它"

"- The safeargs_data value" -> "- safeargs_data的值"

"is just a silly way of obfuscating in hex this value" -> "只是以十六进制方式混淆此值的愚蠢方式"

"- I've decoded it for ease of readability and editing e.g. the data key" -> "- 为了便于阅读和编辑,我已对其进行了解码,例如data键"

"- Finally, I use the table_link, payload data, and updated headers to make a POST request." -> "- 最后,我使用table_linkpayload数据和更新的headers发出POST请求。"

"- Then, it's easy to get the table out of the JSON and parse it with pandas" -> "- 然后,很容易从JSON中获取表格并使用pandas进行解析"

"By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report." -> "顺便说一句,如果您将URL中的十六进制值转换并将safeargs_data添加到其中,您仍然可以获得您的报告。"

"Here's a full, decoded URL" -> "这是一个完整的、解码后的URL"

"Here's my take on it:" -> "这是我的见解:"

"import binascii" -> "导入 binascii"

"from urllib.parse import urlencode" -> "从 urllib.parse 导入 urlencode"

"import pandas as pd" -> "导入 pandas as pd"

"import requests" -> "导入 requests"

"from bs4 import BeautifulSoup" -> "从 bs4 导入 BeautifulSoup"

"from tabulate import tabulate" -> "从 tabulate 导入 tabulate"

"url = (..." -> "url = (..."

"headers = {..." -> "headers = {..."

"with requests.Session() as session:" -> "使用 requests.Session() 作为 session:"

"table_link = (..." -> "table_link = (..."

"headers.update({..." -> "headers.update({..."

"payload_data = {" -> "payload_data = {"

"hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()" -> "hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()"

"table_data = session.post(" -> "table_data = session.post("

"df = pd.read_html(" -> "df = pd.read_html("

"df.to_csv(" -> "df.to_csv("

"print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))" -> "print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))"

英文:

Actually, this page turned out to be a pretty cool challenge!

Breakdown:

  • The link to the report sits in the source HTML, but the table is rendered dynamically by JavaScript but you can easily scoop it out
  • The safeargs_data value
  1. 5f5f7265706f72743d504c5f5553455f524242265f5f63616c6c547970653d7026646174613d323032332d30382d3130265f737667737570706f72743d74727565267265736f7572636549443d72656e646572696e6755524c265f5f706167654e756d6265723d31265f5f626174636849443d31383964663635613634642d31

is just a silly way of obfuscating in hex this value

  1. __report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1
  • I've decoded it for ease of readability and editing e.g. the data key
  • Finally, I use the table_link, payload data, and updated headers to make a POST request.
  • Then, it's easy to get the table out of the JSON and parse it with pandas

By the way, if you convert the hex value from the URL and add the safeargs_data to it, you'll still get your report.

Here's a full, decoded URL:

  1. https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb?p_auth=2XVP5Wtz&p_p_id=VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_VisioPortlet_WAR_visioneoportlet_INSTANCE_xOsekso49yXt___action=processEdit&__action=processEdit__report=PL_USE_RBB&__callType=p&data=2023-08-10&_svgsupport=true&resourceID=renderingURL&__pageNumber=1&__batchID=189df65a64d-1

Here's my take on it:

  1. import binascii
  2. from urllib.parse import urlencode
  3. import pandas as pd
  4. import requests
  5. from bs4 import BeautifulSoup
  6. from tabulate import tabulate
  7. url = (
  8. "https://www.pse.pl/dane-systemowe/funkcjonowanie-kse/"
  9. "raporty-godzinowe-z-funkcjonowania-rb/iteracje-obslugi-use-w-ramach-rbb"
  10. )
  11. headers = {
  12. "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
  13. "AppleWebKit/537.36 (KHTML, like Gecko) "
  14. "Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200",
  15. }
  16. with requests.Session() as session:
  17. table_link = (
  18. BeautifulSoup(session.get(url, headers=headers).content, "lxml")
  19. .select_one("a[class='vui-generic-url']")
  20. .get("href")
  21. )
  22. headers.update({"X-Requested-With": "XMLHttpRequest"})
  23. payload_data = {
  24. "__report": "PL_USE_RBB",
  25. "__callType": "p",
  26. "data": "2023-08-10",
  27. "_svgsupport": "true",
  28. "resourceID": "renderingURL",
  29. "__pageNumber": "1",
  30. "__batchID": "189df65a64d-1",
  31. }
  32. hex_it = binascii.hexlify(urlencode(payload_data).encode()).decode()
  33. table_data = session.post(
  34. table_link,
  35. data={"safeargs_data": hex_it},
  36. headers=headers,
  37. )
  38. df = pd.read_html(
  39. # .replace() is used to get rid of NBSPs
  40. table_data.json()["reportContent"].replace("\xa0", ""),
  41. flavor="lxml",
  42. skiprows=[0],
  43. )[1]
  44. df.dropna(how="all", inplace=True)
  45. df.to_csv("PL_USE_RBB.csv", index=False)
  46. print(tabulate(df, headers="keys", tablefmt="psql", showindex=False))

This should save a .csv file PL_USE_RBB.csv and then print this:

  1. +----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+
  2. | ('Numer iteracji', '24') | ('Początek', '2023-08-10 15:15:34') | ('Koniec', '2023-08-10 15:16:06') | ('Początek', '17') | ('Koniec', '24') | ('[MWh]', '47561,000') |
  3. |----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------|
  4. | 23 | 2023-08-10 14:15:42 | 2023-08-10 14:16:16 | 16 | 24 | 4.88788e+07 |
  5. | 22 | 2023-08-10 13:15:39 | 2023-08-10 13:16:24 | 15 | 24 | 4.50884e+07 |
  6. | 21 | 2023-08-10 12:15:36 | 2023-08-10 12:16:10 | 14 | 24 | 4.09294e+07 |
  7. | 20 | 2023-08-10 11:15:33 | 2023-08-10 11:16:15 | 13 | 24 | 3.12136e+07 |
  8. | 19 | 2023-08-10 10:15:41 | 2023-08-10 10:16:07 | 12 | 24 | 2.55946e+07 |
  9. | 18 | 2023-08-10 09:15:40 | 2023-08-10 09:16:05 | 11 | 24 | 2.26086e+07 |
  10. | 17 | 2023-08-10 08:15:40 | 2023-08-10 08:16:00 | 10 | 24 | 1.58324e+07 |
  11. | 16 | 2023-08-10 07:15:35 | 2023-08-10 07:15:56 | 9 | 24 | 1.11414e+07 |
  12. | 15 | 2023-08-10 06:15:33 | 2023-08-10 06:15:52 | 8 | 24 | 1.11796e+07 |
  13. | 14 | 2023-08-10 05:15:32 | 2023-08-10 05:15:52 | 7 | 24 | 9.639e+06 |
  14. | 13 | 2023-08-10 04:15:41 | 2023-08-10 04:16:11 | 6 | 24 | 9.0502e+06 |
  15. | 12 | 2023-08-10 03:15:36 | 2023-08-10 03:15:55 | 5 | 24 | 7.871e+06 |
  16. | 11 | 2023-08-10 02:15:35 | 2023-08-10 02:16:03 | 4 | 24 | 8.395e+06 |
  17. | 10 | 2023-08-10 01:15:41 | 2023-08-10 01:16:04 | 3 | 24 | 7.8954e+06 |
  18. | 9 | 2023-08-10 00:15:37 | 2023-08-10 00:15:55 | 2 | 24 | 8.2582e+06 |
  19. | 8 | 2023-08-09 23:15:03 | 2023-08-09 23:15:24 | 1 | 24 | 6.6784e+06 |
  20. | 7 | 2023-08-09 22:15:08 | 2023-08-09 22:15:16 | 1 | 24 | 603200 |
  21. | 6 | 2023-08-09 21:15:12 | 2023-08-09 21:15:22 | 1 | 24 | 0 |
  22. | 5 | 2023-08-09 20:15:06 | 2023-08-09 20:15:12 | 1 | 24 | 0 |
  23. | 4 | 2023-08-09 19:15:04 | 2023-08-09 19:15:14 | 1 | 24 | 0 |
  24. | 3 | 2023-08-09 18:15:11 | 2023-08-09 18:15:32 | 1 | 24 | 0 |
  25. | 2 | 2023-08-09 17:15:11 | 2023-08-09 17:15:22 | 1 | 24 | 0 |
  26. | 1 | 2023-08-09 16:15:09 | 2023-08-09 16:15:31 | 1 | 24 | 0 |
  27. +----------------------------+---------------------------------------+-------------------------------------+----------------------+--------------------+--------------------------+

huangapple
  • 本文由 发表于 2023年8月10日 18:44:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76874980.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定