读取Chrome控制台中的数据到Python。

huangapple go评论132阅读模式
英文:

read data from chrome console to python

问题

以下是您要翻译的代码部分:

import requests
import lxml.html as html
import pandas as pd

url_padre = "https://www.op.gg/summoners/br/tercermundista"

link_farm = '//div[@class="stats"]//div[@class="cs"]'

r = requests.get(url_padre)

home = r.content.decode("utf-8")

parser = html.fromstring(home)
farm = parser.xpath(link_farm)

print(farm)

这段代码输出 "[]",但当在 Chrome 控制台中使用以下 XPath 查询时:$x('//div[@class="stats"]//div[@class="cs"]').map(x => x.innerText),可以得到我想要的数字,但我的 Python 代码不行。出了什么问题?

您需要的代码来解决这个问题:

import requests
import lxml.html as html
import pandas as pd

url_padre = "https://www.op.gg/summoners/br/tercermundista"

link_farm = '//div[@class="stats"]//div[@class="cs"]'

r = requests.get(url_padre)

home = r.content.decode("utf-8")

parser = html.fromstring(home)
farm = parser.xpath(link_farm)

for item in farm:
    print(item.text_content())

这段代码将输出您想要的数字。如果仍然存在问题,请提供更多详细信息以获取更多帮助。

英文:

`I have a code in python to read xpath from a website (https://www.op.gg/summoners/kr/Hide%20on%20bush)

import requests
import lxml.html as html
import pandas as pd

url_padre = "https://www.op.gg/summoners/br/tercermundista"

link_farm = '//div[@class="stats"]//div[@class="cs"]'

r = requests.get(url_padre) 

home=r.content.decode("utf-8") 

parser=html.fromstring(home) 
farm=parser.xpath(link_farm) 

print(farm)`

this code print "[]"

but when in the console chrome put this xpath: $x('//div[@class="stats"]//div[@class="cs"]').map(x=>x.innerText), this print the numbers i want, but my python code dont do it
What is the mistake?

i want a code to solve my mistake

--------------------------edit---------------------------


Error                                     Traceback (most recent call last)
c:\Users\GCO\Desktop\Analisis de datos\borradores\fsdfs.ipynb Cell 2 in 3
      1 from playwright.sync_api import sync_playwright
----> 3 with sync_playwright() as p, p.chromium.launch() as browser:
      4     page = browser.new_page()
      5     page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)

File c:\Users\GCO\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\sync_api\_context_manager.py:47, in PlaywrightContextManager.__enter__(self)
     45             self._own_loop = True
     46         if self._loop.is_running():
---> 47             raise Error(
     48                 """It looks like you are using Playwright Sync API inside the asyncio loop.
     49 Please use the Async API instead."""
     50             )
     52         # In Python 3.7, asyncio.Process.wait() hangs because it does not use ThreadedChildWatcher
     53         # which is used in Python 3.8+. This is unix specific and also takes care about
     54         # cleaning up zombie processes. See https://bugs.python.org/issue35621
     55         if (
     56             sys.version_info[0] == 3
     57             and sys.version_info[1] == 7
     58             and sys.platform != "win32"
     59             and isinstance(asyncio.get_child_watcher(), asyncio.SafeChildWatcher)
     60         ):

Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.

答案1

得分: 1

我了解到,您无法使用requests获取动态生成的内容。

以下是使用playwright解决此问题的方法,它可以在解析之前加载整个页面:

  1. 使用pip install playwright安装playwright。
  2. 使用playwright install chromium --with-deps安装浏览器和依赖项。
  3. 运行以下代码:
from playwright.sync_api import sync_playwright

with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)
    selector = "//div[@class='stats']//div[@class='cs']/div"
    cs_stats = page.query_selector_all(selector)
    print(len(cs_stats), [cs.inner_text() for cs in cs_stats])

如果您想继续使用lxml作为解析工具,您可以使用以下代码:

from lxml import html
from playwright.sync_api import sync_playwright

with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)
    selector = "//div[@class='stats']//div[@class='cs']/div"
    c = page.content()
    parser = html.fromstring(c)
    farm = parser.xpath(selector)
    print(len(farm), [cs.text for cs in farm])

附言:

我还注意到op.gg使用相当简单的HTTP请求,不需要授权。您可以使用以下代码找到所需信息:

import json
from urllib.request import urlopen
url = "https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg?&limit=20"
r = urlopen(url)
games = json.load(r).get("data", [])
print(games)

games是包含所有所需信息的字典列表。CS统计信息存储在列表元素中,键为games[0]["myData"]["stats"]["minion_kill"]

唯一的困难是找到如何获取所需用户的summoner_id(在您的示例中为4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg)。

英文:

As I understand you can not get dynamically generated content using requests.

Here is solution using playwright which can load whole page before parsing.

  1. Install playwright using pip install playwright
  2. Install browser and dependencies using playwright install chromium --with-deps
  3. Run following code
from playwright.sync_api import sync_playwright

with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)
    selector = "//div[@class='stats']//div[@class='cs']/div"
    cs_stats = page.query_selector_all(selector)
    print(len(cs_stats), [cs.inner_text() for cs in cs_stats])

If you want to stick with lxml as parsing tool you can use following code:

from lxml import html
from playwright.sync_api import sync_playwright

with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)
    selector = "//div[@class='stats']//div[@class='cs']/div"
    c = page.content()
    parser = html.fromstring(c)
    farm = parser.xpath(selector)
    print(len(farm), [cs.text for cs in farm])

P.S.

Also I have noticed that op.gg use pretty simple HTTP requests that do not need authorization. You can find desired info using this code:

import json
from urllib.request import urlopen
url = "https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg?&limit=20"
r = urlopen(url)
games = json.load(r).get("data", [])
print(games)

games is a list of dicts that stores all info you need. CS stats are stored in list element under following keys: games[0]["myData"]["stats"]["minion_kill"]

The only difficult thing here is to find how to get summoner_id for desired user (which is 4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg in your example)

答案2

得分: 1

你可以使用这个示例来从外部URL加载数据并计算CS值:

import re
import requests

url = "https://www.op.gg/summoners/kr/Hide%20on%20bush"
api_url = "https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/{summoner_id}?=&limit=20&hl=en_US&game_type=total"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0"
}

html_doc = requests.get(url, headers=headers).text
summoner_id = re.search(r'"summoner_id":"(.*?)"', html_doc).group(1)

data = requests.get(api_url.format(summoner_id=summoner_id), headers=headers).json()

for d in data["data"]:
    stats = d["myData"]["stats"]
    kills = (
        stats["minion_kill"]
        + stats["neutral_minion_kill_team_jungle"]
        + stats["neutral_minion_kill_enemy_jungle"]
        + stats["neutral_minion_kill"]
    )
    cs = kills / (d['game_length_second'] / 60)
    print(f'{cs=:.1f}')

打印结果:

cs=6.7
cs=8.5
cs=8.2
cs=1.4
cs=7.3
cs=8.5
cs=6.8
cs=7.7
cs=8.7
cs=8.8
cs=5.6
cs=9.9
cs=7.0
cs=9.6
cs=9.7
cs=5.0
cs=7.5
cs=9.2
cs=9.0
cs=7.9
英文:

You can use this example how to load the data from external URL and compute the CS value:

import re
import requests


url = "https://www.op.gg/summoners/kr/Hide%20on%20bush"
api_url = "https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/{summoner_id}?=&limit=20&hl=en_US&game_type=total"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0"
}

html_doc = requests.get(url, headers=headers).text
summoner_id = re.search(r'"summoner_id":"(.*?)"', html_doc).group(1)

data = requests.get(api_url.format(summoner_id=summoner_id), headers=headers).json()

for d in data["data"]:
    stats = d["myData"]["stats"]
    kills = (
        stats["minion_kill"]
        + stats["neutral_minion_kill_team_jungle"]
        + stats["neutral_minion_kill_enemy_jungle"]
        + stats["neutral_minion_kill"]
    )
    cs = kills / (d['game_length_second'] / 60)
    print(f'{cs=:.1f}')

Prints:

cs=6.7
cs=8.5
cs=8.2
cs=1.4
cs=7.3
cs=8.5
cs=6.8
cs=7.7
cs=8.7
cs=8.8
cs=5.6
cs=9.9
cs=7.0
cs=9.6
cs=9.7
cs=5.0
cs=7.5
cs=9.2
cs=9.0
cs=7.9

huangapple
  • 本文由 发表于 2023年3月10日 01:12:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75687901.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定