2023年3月10日 01:12:54go评论174阅读模式

英文:

read data from chrome console to python

问题

以下是您要翻译的代码部分：

import requests
import lxml.html as html
import pandas as pd
url_padre = "https://www.op.gg/summoners/br/tercermundista"
link_farm = '//div[@class="stats"]//div[@class="cs"]'
r = requests.get(url_padre)
home = r.content.decode("utf-8")
parser = html.fromstring(home)
farm = parser.xpath(link_farm)
print(farm)

这段代码输出 "[]"，但当在 Chrome 控制台中使用以下 XPath 查询时：$x('//div[@class="stats"]//div[@class="cs"]').map(x => x.innerText)，可以得到我想要的数字，但我的 Python 代码不行。出了什么问题？

您需要的代码来解决这个问题：

import requests
import lxml.html as html
import pandas as pd
url_padre = "https://www.op.gg/summoners/br/tercermundista"
link_farm = '//div[@class="stats"]//div[@class="cs"]'
r = requests.get(url_padre)
home = r.content.decode("utf-8")
parser = html.fromstring(home)
farm = parser.xpath(link_farm)
for item in farm:
    print(item.text_content())

这段代码将输出您想要的数字。如果仍然存在问题，请提供更多详细信息以获取更多帮助。

英文:

`I have a code in python to read xpath from a website (https://www.op.gg/summoners/kr/Hide%20on%20bush)

import requests
import lxml.html as html
import pandas as pd
url_padre = &quot;https://www.op.gg/summoners/br/tercermundista&quot;
link_farm = &#39;//div[@class=&quot;stats&quot;]//div[@class=&quot;cs&quot;]&#39;
r = requests.get(url_padre) 
home=r.content.decode(&quot;utf-8&quot;) 
parser=html.fromstring(home) 
farm=parser.xpath(link_farm) 
print(farm)`

this code print "[]"

but when in the console chrome put this xpath: $x('//div[@class="stats"]//div[@class="cs"]').map(x=>x.innerText), this print the numbers i want, but my python code dont do it
What is the mistake?

i want a code to solve my mistake

--------------------------edit---------------------------

Error                                     Traceback (most recent call last)
c:\Users\GCO\Desktop\Analisis de datos\borradores\fsdfs.ipynb Cell 2 in 3
      1 from playwright.sync_api import sync_playwright
----&gt; 3 with sync_playwright() as p, p.chromium.launch() as browser:
      4     page = browser.new_page()
      5     page.goto(&quot;https://www.op.gg/summoners/kr/Hide%20on%20bush&quot;, timeout=10000)
File c:\Users\GCO\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\sync_api\_context_manager.py:47, in PlaywrightContextManager.__enter__(self)
     45             self._own_loop = True
     46         if self._loop.is_running():
---&gt; 47             raise Error(
     48                 &quot;&quot;&quot;It looks like you are using Playwright Sync API inside the asyncio loop.
     49 Please use the Async API instead.&quot;&quot;&quot;
     50             )
     52         # In Python 3.7, asyncio.Process.wait() hangs because it does not use ThreadedChildWatcher
     53         # which is used in Python 3.8+. This is unix specific and also takes care about
     54         # cleaning up zombie processes. See https://bugs.python.org/issue35621
     55         if (
     56             sys.version_info[0] == 3
     57             and sys.version_info[1] == 7
     58             and sys.platform != &quot;win32&quot;
     59             and isinstance(asyncio.get_child_watcher(), asyncio.SafeChildWatcher)
     60         ):
Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.

答案1

得分: 1

我了解到，您无法使用requests获取动态生成的内容。

以下是使用playwright解决此问题的方法，它可以在解析之前加载整个页面：

使用pip install playwright安装playwright。
使用playwright install chromium --with-deps安装浏览器和依赖项。
运行以下代码：

from playwright.sync_api import sync_playwright
with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)
    selector = "//div[@class='stats']//div[@class='cs']/div"
    cs_stats = page.query_selector_all(selector)
    print(len(cs_stats), [cs.inner_text() for cs in cs_stats])

如果您想继续使用lxml作为解析工具，您可以使用以下代码：

from lxml import html
from playwright.sync_api import sync_playwright
with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto("https://www.op.gg/summoners/kr/Hide%20on%20bush", timeout=10000)
    selector = "//div[@class='stats']//div[@class='cs']/div"
    c = page.content()
    parser = html.fromstring(c)
    farm = parser.xpath(selector)
    print(len(farm), [cs.text for cs in farm])

附言：

我还注意到op.gg使用相当简单的HTTP请求，不需要授权。您可以使用以下代码找到所需信息：

import json
from urllib.request import urlopen
url = "https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg?&limit=20"
r = urlopen(url)
games = json.load(r).get("data", [])
print(games)

games是包含所有所需信息的字典列表。CS统计信息存储在列表元素中，键为games[0]["myData"]["stats"]["minion_kill"]。

唯一的困难是找到如何获取所需用户的summoner_id（在您的示例中为4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg）。

英文:

As I understand you can not get dynamically generated content using requests.

Here is solution using playwright which can load whole page before parsing.

Install playwright using pip install playwright
Install browser and dependencies using playwright install chromium --with-deps
Run following code

from playwright.sync_api import sync_playwright
with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto(&quot;https://www.op.gg/summoners/kr/Hide%20on%20bush&quot;, timeout=10000)
    selector = &quot;//div[@class=&#39;stats&#39;]//div[@class=&#39;cs&#39;]/div&quot;
    cs_stats = page.query_selector_all(selector)
    print(len(cs_stats), [cs.inner_text() for cs in cs_stats])

If you want to stick with lxml as parsing tool you can use following code:

from lxml import html
from playwright.sync_api import sync_playwright
with sync_playwright() as p, p.chromium.launch() as browser:
    page = browser.new_page()
    page.goto(&quot;https://www.op.gg/summoners/kr/Hide%20on%20bush&quot;, timeout=10000)
    selector = &quot;//div[@class=&#39;stats&#39;]//div[@class=&#39;cs&#39;]/div&quot;
    c = page.content()
    parser = html.fromstring(c)
    farm = parser.xpath(selector)
    print(len(farm), [cs.text for cs in farm])

P.S.

Also I have noticed that op.gg use pretty simple HTTP requests that do not need authorization. You can find desired info using this code:

import json
from urllib.request import urlopen
url = &quot;https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg?&amp;limit=20&quot;
r = urlopen(url)
games = json.load(r).get(&quot;data&quot;, [])
print(games)

games is a list of dicts that stores all info you need. CS stats are stored in list element under following keys: games[0]["myData"]["stats"]["minion_kill"]

The only difficult thing here is to find how to get summoner_id for desired user (which is 4b4tvMrpRRDLvXAiQ_Vmh5yMOsD0R3GPGTUVfIanp1Httg in your example)

答案2

得分: 1

你可以使用这个示例来从外部URL加载数据并计算CS值：

import re
import requests
url = "https://www.op.gg/summoners/kr/Hide%20on%20bush"
api_url = "https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/{summoner_id}?=&limit=20&hl=en_US&game_type=total"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0"
}
html_doc = requests.get(url, headers=headers).text
summoner_id = re.search(r'"summoner_id":"(.*?)"', html_doc).group(1)
data = requests.get(api_url.format(summoner_id=summoner_id), headers=headers).json()
for d in data["data"]:
    stats = d["myData"]["stats"]
    kills = (
        stats["minion_kill"]
        + stats["neutral_minion_kill_team_jungle"]
        + stats["neutral_minion_kill_enemy_jungle"]
        + stats["neutral_minion_kill"]
    )
    cs = kills / (d['game_length_second'] / 60)
    print(f'{cs=:.1f}')

打印结果：

cs=6.7
cs=8.5
cs=8.2
cs=1.4
cs=7.3
cs=8.5
cs=6.8
cs=7.7
cs=8.7
cs=8.8
cs=5.6
cs=9.9
cs=7.0
cs=9.6
cs=9.7
cs=5.0
cs=7.5
cs=9.2
cs=9.0
cs=7.9

英文:

You can use this example how to load the data from external URL and compute the CS value:

import re
import requests
url = &quot;https://www.op.gg/summoners/kr/Hide%20on%20bush&quot;
api_url = &quot;https://op.gg/api/v1.0/internal/bypass/games/kr/summoners/{summoner_id}?=&amp;limit=20&amp;hl=en_US&amp;game_type=total&quot;
headers = {
    &quot;User-Agent&quot;: &quot;Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0&quot;
}
html_doc = requests.get(url, headers=headers).text
summoner_id = re.search(r&#39;&quot;summoner_id&quot;:&quot;(.*?)&quot;&#39;, html_doc).group(1)
data = requests.get(api_url.format(summoner_id=summoner_id), headers=headers).json()
for d in data[&quot;data&quot;]:
    stats = d[&quot;myData&quot;][&quot;stats&quot;]
    kills = (
        stats[&quot;minion_kill&quot;]
        + stats[&quot;neutral_minion_kill_team_jungle&quot;]
        + stats[&quot;neutral_minion_kill_enemy_jungle&quot;]
        + stats[&quot;neutral_minion_kill&quot;]
    )
    cs = kills / (d[&#39;game_length_second&#39;] / 60)
    print(f&#39;{cs=:.1f}&#39;)

Prints:

cs=6.7
cs=8.5
cs=8.2
cs=1.4
cs=7.3
cs=8.5
cs=6.8
cs=7.7
cs=8.7
cs=8.8
cs=5.6
cs=9.9
cs=7.0
cs=9.6
cs=9.7
cs=5.0
cs=7.5
cs=9.2
cs=9.0
cs=7.9

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

读取Chrome控制台中的数据到Python。

问题

答案1

答案2

如何使一列的值转置到特定值？

Pytorch nn.DataParallel: RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

Running ftplib code on remote server with Paramiko.

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。