使用BeautifulSoup提取脚本内容的特定键。

huangapple go评论82阅读模式
英文:

Extract certain keys of script content with BeautifulSoup

问题

我已经使用"BeautifulSoup"提取了某个脚本的内容。脚本的内容包含类似JSON的结构化数据。

我想提取第一个"content"组的三个"url"以及第二个"content"组的"defeatedBosses"。

以下是提取的脚本内容(部分):

new WH.Wow.TodayInWow(WH.ge('tiw-standalone'), [{
    "id": "dungeons-and-raids",
    "groups": [{
        "content": {
            "lines": [{
                "icon": "achievement_boss_archaedas",
                "url": "/affix=9/tyrannical"
            }, {
                "icon": "spell_shaman_lavasurge",
                "url": "/affix=3/volcanic"
            }, {
                "icon": "spell_shadow_bloodboil",
                "url": "/affix=8/sanguine"
            }],
            "icons": "large"
        },
        "id": "mythicaffix",
    }, {
        "content": {
            "defeatedBosses": 9,
        },
        "id": "mythic-progression",
        "url": "/aberrus-the-shadowed-crucible/overview"
    },
    ...

而且,我的Python(3.11)脚本到目前为止如下:

import re
import json
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, "html.parser")

all_scripts = soup.find_all('script')
script_sp = all_scripts[36]

# 我的尝试

model_data = re.search(r"content = ({.*?});", script_sp.string, flags=re.S)
model_data = model_data.group(1)

model_data = json.loads(model_data)

print(model_data)

我收到了一个错误:

TypeError: expected string or bytes-like object, got 'Tag'

希望这有所帮助。

英文:

I have extracted the content of a certain script with "BeautifulSoup". The content of the script contains "json-like" structured data.

I want to extract the three "urls" of the first "content" group and the "defeatedBosses" from the second "content" group.

This is the extracted script content (part of):

new WH.Wow.TodayInWow(WH.ge('tiw-standalone'), [{
    "id": "dungeons-and-raids",
    "groups": [{
        "content": {
            "lines": [{
                "icon": "achievement_boss_archaedas",
                "url": "\/affix=9\/tyrannical"
            }, {
                "icon": "spell_shaman_lavasurge",
                "url": "\/affix=3\/volcanic"
            }, {
                "icon": "spell_shadow_bloodboil",
                "url": "\/affix=8\/sanguine"
            }],
            "icons": "large"
        },
        "id": "mythicaffix",
    }, {
        "content": {
            "defeatedBosses": 9,
        },
        "id": "mythic-progression",
        "url": "\/aberrus-the-shadowed-crucible\/overview"
    }, 

    ...

And my Python (3.11) script so far:

import re
import json
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import json

req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, "html.parser")

all_scripts = soup.find_all('script')
script_sp = all_scripts[36]

// My try

model_data = re.search(r"content = ({.*?});", script_sp, flags=re.S)
model_data = model_data.group(1)

model_data = json.loads(model_data)

print(model_data)

I get an error:

TypeError: expected string or bytes-like object, got 'Tag'

答案1

得分: 2

> 给出错误:TypeError: 期望字符串或类似字节的对象,但得到了 'Tag'

你应该调用 .string:

> 如果一个标签只有一个子元素,并且该子元素是 NavigableString,那么子元素将作为 .string 可用:

all_scripts = soup.find_all('script')
script_sp = all_scripts[36].string

此外,我已经修复了你的正则表达式:

model_data = re.search(r"new WH\.Wow\.TodayInWow\(WH\.ge\('tiw-standalone'\), (\[.*?\](?=\, true\);))", script_sp, flags=re.S)

打印了大量的 JSON 数据。

要获取实际所需的值,由于 JSON 太多,我将把它留给你来查找正确的路径 使用BeautifulSoup提取脚本内容的特定键。

英文:

> Gives error: TypeError: expected string or bytes-like object, got 'Tag'

You should call .string:

> If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

all_scripts = soup.find_all('script')
script_sp = all_scripts[36].string

Also, I have fixed your regex to:

model_data = re.search(r"new WH\.Wow\.TodayInWow\(WH\.ge\('tiw-standalone'\), (\[.*?\](?=\, true\);))", script_sp, flags=re.S)

Prints a ton of JSON data.

To get the actual desired values, I'll leave it up to you, as its too much JSON to find the correct path 使用BeautifulSoup提取脚本内容的特定键。

答案2

得分: 1

以下是您提供的代码示例的翻译部分:

import re
import json
from urllib.request import Request, urlopen

req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read().decode('utf-8')

json_data = re.search(r"TodayInWow(WH.ge('tiw-standalone'), (.*), true);", html_page)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

for part in json_data:
    if part['id'] == 'dungeons-and-raids' and part['regionId'] == 'US':
        for g in part['groups']:
            print(g['name'], g.get('url', '-'))

输出:

Mythic+ Affixes /guides/mythic-keystones-and-dungeons
Aberrus, the Shadowed Crucible (Mythic) https://www.wowhead.com/guide/raids/aberrus-the-shadowed-crucible/overview
Conquest Points -

编辑:为了更容易搜索,我建议将Json数据从列表转换为字典:

import re
import json
from urllib.request import Request, urlopen

req = Request("https://www.wowhead.com/today-in-wow", headers={"User-Agent": "Mozilla/5.0"})
html_page = urlopen(req).read().decode("utf-8")

json_data = re.search(
    r"TodayInWow(WH.ge('tiw-standalone'), (.*), true);", html_page
)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

# transform the received data from list to a dictionary (for easier search)
data = {
    (d["id"], d["regionId"]): {dd["id"]: dd for dd in d["groups"]} for d in json_data
}

for line in data[("dungeons-and-raids", "US")]["mythicaffix"]["content"]["lines"]:
    l = line["name"], line["url"]
    if line["name"] == "Tyrannical":
        print(" --> ", *l)
    else:
        print("     ", *l)

输出:

 -->  Tyrannical /affix=9/tyrannical
      Volcanic /affix=3/volcanic
      Sanguine /affix=8/sanguine

请注意,这是您提供的代码的中文翻译部分。如果您需要进一步的帮助或有其他问题,请随时提出。

英文:

Here is an example how you can download the page, parse the required data and print sample imformation (info about US Dungeons&Raids):

import re
import json
from urllib.request import Request, urlopen

req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read().decode('utf-8')

json_data = re.search(r"TodayInWow\(WH\.ge\('tiw-standalone'\), (.*), true\);", html_page)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

for part in json_data:
    if part['id'] == 'dungeons-and-raids' and part['regionId'] == 'US':
        for g in part['groups']:
            print(g['name'], g.get('url', '-'))

Prints:

Mythic+ Affixes /guides/mythic-keystones-and-dungeons
Aberrus, the Shadowed Crucible (Mythic) https://www.wowhead.com/guide/raids/aberrus-the-shadowed-crucible/overview
Conquest Points -

EDIT: For easier search I recommend to transform the Json data from a list to a dictionary:

import re
import json
from urllib.request import Request, urlopen

req = Request(
    "https://www.wowhead.com/today-in-wow", headers={"User-Agent": "Mozilla/5.0"}
)
html_page = urlopen(req).read().decode("utf-8")

json_data = re.search(
    r"TodayInWow\(WH\.ge\('tiw-standalone'\), (.*), true\);", html_page
)
json_data = json.loads(json_data.group(1))

# uncomment to print all data:
# print(json.dumps(json_data, indent=4))

# transform the received data from list to a dictionary (for easier search)
data = {
    (d["id"], d["regionId"]): {dd["id"]: dd for dd in d["groups"]} for d in json_data
}

for line in data[("dungeons-and-raids", "US")]["mythicaffix"]['content']['lines']:
    l = line['name'], line['url']
    if line['name'] == 'Tyrannical':
        print(' --> ', *l)
    else:
        print('     ', *l)

Prints:

 -->  Tyrannical /affix=9/tyrannical
      Volcanic /affix=3/volcanic
      Sanguine /affix=8/sanguine

huangapple
  • 本文由 发表于 2023年6月12日 22:34:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76457695.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定