2023年6月12日 22:34:09go评论82阅读模式

英文:

Extract certain keys of script content with BeautifulSoup

问题

我已经使用"BeautifulSoup"提取了某个脚本的内容。脚本的内容包含类似JSON的结构化数据。

我想提取第一个"content"组的三个"url"以及第二个"content"组的"defeatedBosses"。

以下是提取的脚本内容（部分）：

new WH.Wow.TodayInWow(WH.ge('tiw-standalone'), [{
    "id": "dungeons-and-raids",
    "groups": [{
        "content": {
            "lines": [{
                "icon": "achievement_boss_archaedas",
                "url": "/affix=9/tyrannical"
            }, {
                "icon": "spell_shaman_lavasurge",
                "url": "/affix=3/volcanic"
            }, {
                "icon": "spell_shadow_bloodboil",
                "url": "/affix=8/sanguine"
            }],
            "icons": "large"
        },
        "id": "mythicaffix",
    }, {
        "content": {
            "defeatedBosses": 9,
        },
        "id": "mythic-progression",
        "url": "/aberrus-the-shadowed-crucible/overview"
    },
    ...

而且，我的Python（3.11）脚本到目前为止如下：

import re
import json
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, "html.parser")

all_scripts = soup.find_all('script')
script_sp = all_scripts[36]

# 我的尝试

model_data = re.search(r"content = ({.*?});", script_sp.string, flags=re.S)
model_data = model_data.group(1)

model_data = json.loads(model_data)

print(model_data)

我收到了一个错误：

TypeError: expected string or bytes-like object, got 'Tag'

希望这有所帮助。

英文:

I have extracted the content of a certain script with "BeautifulSoup". The content of the script contains "json-like" structured data.

I want to extract the three "urls" of the first "content" group and the "defeatedBosses" from the second "content" group.

This is the extracted script content (part of):

new WH.Wow.TodayInWow(WH.ge(&#39;tiw-standalone&#39;), [{
    &quot;id&quot;: &quot;dungeons-and-raids&quot;,
    &quot;groups&quot;: [{
        &quot;content&quot;: {
            &quot;lines&quot;: [{
                &quot;icon&quot;: &quot;achievement_boss_archaedas&quot;,
                &quot;url&quot;: &quot;\/affix=9\/tyrannical&quot;
            }, {
                &quot;icon&quot;: &quot;spell_shaman_lavasurge&quot;,
                &quot;url&quot;: &quot;\/affix=3\/volcanic&quot;
            }, {
                &quot;icon&quot;: &quot;spell_shadow_bloodboil&quot;,
                &quot;url&quot;: &quot;\/affix=8\/sanguine&quot;
            }],
            &quot;icons&quot;: &quot;large&quot;
        },
        &quot;id&quot;: &quot;mythicaffix&quot;,
    }, {
        &quot;content&quot;: {
            &quot;defeatedBosses&quot;: 9,
        },
        &quot;id&quot;: &quot;mythic-progression&quot;,
        &quot;url&quot;: &quot;\/aberrus-the-shadowed-crucible\/overview&quot;
    }, 

    ...

And my Python (3.11) script so far:

import re
import json
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import json

req = Request(&#39;https://www.wowhead.com/today-in-wow&#39;, headers={&#39;User-Agent&#39;: &#39;Mozilla/5.0&#39;})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, &quot;html.parser&quot;)

all_scripts = soup.find_all(&#39;script&#39;)
script_sp = all_scripts[36]

// My try

model_data = re.search(r&quot;content = ({.*?});&quot;, script_sp, flags=re.S)
model_data = model_data.group(1)

model_data = json.loads(model_data)

print(model_data)

I get an error:

TypeError: expected string or bytes-like object, got &#39;Tag&#39;

答案1

得分: 2

> 给出错误：TypeError: 期望字符串或类似字节的对象，但得到了 'Tag'

你应该调用 .string:

> 如果一个标签只有一个子元素，并且该子元素是 NavigableString，那么子元素将作为 .string 可用：

all_scripts = soup.find_all('script')
script_sp = all_scripts[36].string

此外，我已经修复了你的正则表达式：

model_data = re.search(r"new WH\.Wow\.TodayInWow\(WH\.ge\('tiw-standalone'\), (\[.*?\](?=\, true\);))", script_sp, flags=re.S)

打印了大量的 JSON 数据。

要获取实际所需的值，由于 JSON 太多，我将把它留给你来查找正确的路径

英文:

> Gives error: TypeError: expected string or bytes-like object, got 'Tag'

You should call .string:

> If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

all_scripts = soup.find_all(&#39;script&#39;)
script_sp = all_scripts[36].string

Also, I have fixed your regex to:

model_data = re.search(r&quot;new WH\.Wow\.TodayInWow\(WH\.ge\(&#39;tiw-standalone&#39;\), (\[.*?\](?=\, true\);))&quot;, script_sp, flags=re.S)

Prints a ton of JSON data.

To get the actual desired values, I'll leave it up to you, as its too much JSON to find the correct path

答案2

得分: 1

以下是您提供的代码示例的翻译部分：

import re
import json
from urllib.request import Request, urlopen

req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read().decode('utf-8')

json_data = re.search(r"TodayInWow(WH.ge('tiw-standalone'), (.*), true);", html_page)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

for part in json_data:
    if part['id'] == 'dungeons-and-raids' and part['regionId'] == 'US':
        for g in part['groups']:
            print(g['name'], g.get('url', '-'))

输出：

Mythic+ Affixes /guides/mythic-keystones-and-dungeons
Aberrus, the Shadowed Crucible (Mythic) https://www.wowhead.com/guide/raids/aberrus-the-shadowed-crucible/overview
Conquest Points -

编辑：为了更容易搜索，我建议将Json数据从列表转换为字典：

import re
import json
from urllib.request import Request, urlopen

req = Request("https://www.wowhead.com/today-in-wow", headers={"User-Agent": "Mozilla/5.0"})
html_page = urlopen(req).read().decode("utf-8")

json_data = re.search(
    r"TodayInWow(WH.ge('tiw-standalone'), (.*), true);", html_page
)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

# transform the received data from list to a dictionary (for easier search)
data = {
    (d["id"], d["regionId"]): {dd["id"]: dd for dd in d["groups"]} for d in json_data
}

for line in data[("dungeons-and-raids", "US")]["mythicaffix"]["content"]["lines"]:
    l = line["name"], line["url"]
    if line["name"] == "Tyrannical":
        print(" --> ", *l)
    else:
        print("     ", *l)

输出：

 -->  Tyrannical /affix=9/tyrannical
      Volcanic /affix=3/volcanic
      Sanguine /affix=8/sanguine

请注意，这是您提供的代码的中文翻译部分。如果您需要进一步的帮助或有其他问题，请随时提出。

英文:

Here is an example how you can download the page, parse the required data and print sample imformation (info about US Dungeons&Raids):

import re
import json
from urllib.request import Request, urlopen

req = Request(&#39;https://www.wowhead.com/today-in-wow&#39;, headers={&#39;User-Agent&#39;: &#39;Mozilla/5.0&#39;})
html_page = urlopen(req).read().decode(&#39;utf-8&#39;)

json_data = re.search(r&quot;TodayInWow\(WH\.ge\(&#39;tiw-standalone&#39;\), (.*), true\);&quot;, html_page)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

for part in json_data:
    if part[&#39;id&#39;] == &#39;dungeons-and-raids&#39; and part[&#39;regionId&#39;] == &#39;US&#39;:
        for g in part[&#39;groups&#39;]:
            print(g[&#39;name&#39;], g.get(&#39;url&#39;, &#39;-&#39;))

Prints:

Mythic+ Affixes /guides/mythic-keystones-and-dungeons
Aberrus, the Shadowed Crucible (Mythic) https://www.wowhead.com/guide/raids/aberrus-the-shadowed-crucible/overview
Conquest Points -

EDIT: For easier search I recommend to transform the Json data from a list to a dictionary:

import re
import json
from urllib.request import Request, urlopen

req = Request(
    &quot;https://www.wowhead.com/today-in-wow&quot;, headers={&quot;User-Agent&quot;: &quot;Mozilla/5.0&quot;}
)
html_page = urlopen(req).read().decode(&quot;utf-8&quot;)

json_data = re.search(
    r&quot;TodayInWow\(WH\.ge\(&#39;tiw-standalone&#39;\), (.*), true\);&quot;, html_page
)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

# transform the received data from list to a dictionary (for easier search)
data = {
    (d[&quot;id&quot;], d[&quot;regionId&quot;]): {dd[&quot;id&quot;]: dd for dd in d[&quot;groups&quot;]} for d in json_data
}

for line in data[(&quot;dungeons-and-raids&quot;, &quot;US&quot;)][&quot;mythicaffix&quot;][&#39;content&#39;][&#39;lines&#39;]:
    l = line[&#39;name&#39;], line[&#39;url&#39;]
    if line[&#39;name&#39;] == &#39;Tyrannical&#39;:
        print(&#39; --&gt; &#39;, *l)
    else:
        print(&#39;     &#39;, *l)

Prints:

 --&gt;  Tyrannical /affix=9/tyrannical
      Volcanic /affix=3/volcanic
      Sanguine /affix=8/sanguine

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用BeautifulSoup提取脚本内容的特定键。

问题

答案1

答案2

如何从包中的另一个文件夹导入文件

Django: 模板语法错误 – 无法解析余下的部分

如果后台子进程失败，如何引发异常？

如何格式化引用列名的UPDATE查询？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论