英文:
Extract certain keys of script content with BeautifulSoup
问题
我已经使用"BeautifulSoup"提取了某个脚本的内容。脚本的内容包含类似JSON的结构化数据。
我想提取第一个"content"组的三个"url"以及第二个"content"组的"defeatedBosses"。
以下是提取的脚本内容(部分):
new WH.Wow.TodayInWow(WH.ge('tiw-standalone'), [{
"id": "dungeons-and-raids",
"groups": [{
"content": {
"lines": [{
"icon": "achievement_boss_archaedas",
"url": "/affix=9/tyrannical"
}, {
"icon": "spell_shaman_lavasurge",
"url": "/affix=3/volcanic"
}, {
"icon": "spell_shadow_bloodboil",
"url": "/affix=8/sanguine"
}],
"icons": "large"
},
"id": "mythicaffix",
}, {
"content": {
"defeatedBosses": 9,
},
"id": "mythic-progression",
"url": "/aberrus-the-shadowed-crucible/overview"
},
...
而且,我的Python(3.11)脚本到目前为止如下:
import re
import json
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, "html.parser")
all_scripts = soup.find_all('script')
script_sp = all_scripts[36]
# 我的尝试
model_data = re.search(r"content = ({.*?});", script_sp.string, flags=re.S)
model_data = model_data.group(1)
model_data = json.loads(model_data)
print(model_data)
我收到了一个错误:
TypeError: expected string or bytes-like object, got 'Tag'
希望这有所帮助。
英文:
I have extracted the content of a certain script with "BeautifulSoup". The content of the script contains "json-like" structured data.
I want to extract the three "urls" of the first "content" group and the "defeatedBosses" from the second "content" group.
This is the extracted script content (part of):
new WH.Wow.TodayInWow(WH.ge('tiw-standalone'), [{
"id": "dungeons-and-raids",
"groups": [{
"content": {
"lines": [{
"icon": "achievement_boss_archaedas",
"url": "\/affix=9\/tyrannical"
}, {
"icon": "spell_shaman_lavasurge",
"url": "\/affix=3\/volcanic"
}, {
"icon": "spell_shadow_bloodboil",
"url": "\/affix=8\/sanguine"
}],
"icons": "large"
},
"id": "mythicaffix",
}, {
"content": {
"defeatedBosses": 9,
},
"id": "mythic-progression",
"url": "\/aberrus-the-shadowed-crucible\/overview"
},
...
And my Python (3.11) script so far:
import re
import json
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import json
req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, "html.parser")
all_scripts = soup.find_all('script')
script_sp = all_scripts[36]
// My try
model_data = re.search(r"content = ({.*?});", script_sp, flags=re.S)
model_data = model_data.group(1)
model_data = json.loads(model_data)
print(model_data)
I get an error:
TypeError: expected string or bytes-like object, got 'Tag'
答案1
得分: 2
> 给出错误:TypeError: 期望字符串或类似字节的对象,但得到了 'Tag'
你应该调用 .string
:
> 如果一个标签只有一个子元素,并且该子元素是 NavigableString,那么子元素将作为 .string 可用:
all_scripts = soup.find_all('script')
script_sp = all_scripts[36].string
此外,我已经修复了你的正则表达式:
model_data = re.search(r"new WH\.Wow\.TodayInWow\(WH\.ge\('tiw-standalone'\), (\[.*?\](?=\, true\);))", script_sp, flags=re.S)
打印了大量的 JSON 数据。
要获取实际所需的值,由于 JSON 太多,我将把它留给你来查找正确的路径
英文:
> Gives error: TypeError: expected string or bytes-like object, got 'Tag'
You should call .string
:
> If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
all_scripts = soup.find_all('script')
script_sp = all_scripts[36].string
Also, I have fixed your regex to:
model_data = re.search(r"new WH\.Wow\.TodayInWow\(WH\.ge\('tiw-standalone'\), (\[.*?\](?=\, true\);))", script_sp, flags=re.S)
Prints a ton of JSON data.
To get the actual desired values, I'll leave it up to you, as its too much JSON to find the correct path
答案2
得分: 1
以下是您提供的代码示例的翻译部分:
import re
import json
from urllib.request import Request, urlopen
req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read().decode('utf-8')
json_data = re.search(r"TodayInWow(WH.ge('tiw-standalone'), (.*), true);", html_page)
json_data = json.loads(json_data.group(1))
# uncomment to print all data:
# print(json.dumps(json_data, indent=4))
for part in json_data:
if part['id'] == 'dungeons-and-raids' and part['regionId'] == 'US':
for g in part['groups']:
print(g['name'], g.get('url', '-'))
输出:
Mythic+ Affixes /guides/mythic-keystones-and-dungeons
Aberrus, the Shadowed Crucible (Mythic) https://www.wowhead.com/guide/raids/aberrus-the-shadowed-crucible/overview
Conquest Points -
编辑:为了更容易搜索,我建议将Json数据从列表转换为字典:
import re
import json
from urllib.request import Request, urlopen
req = Request("https://www.wowhead.com/today-in-wow", headers={"User-Agent": "Mozilla/5.0"})
html_page = urlopen(req).read().decode("utf-8")
json_data = re.search(
r"TodayInWow(WH.ge('tiw-standalone'), (.*), true);", html_page
)
json_data = json.loads(json_data.group(1))
# uncomment to print all data:
# print(json.dumps(json_data, indent=4))
# transform the received data from list to a dictionary (for easier search)
data = {
(d["id"], d["regionId"]): {dd["id"]: dd for dd in d["groups"]} for d in json_data
}
for line in data[("dungeons-and-raids", "US")]["mythicaffix"]["content"]["lines"]:
l = line["name"], line["url"]
if line["name"] == "Tyrannical":
print(" --> ", *l)
else:
print(" ", *l)
输出:
--> Tyrannical /affix=9/tyrannical
Volcanic /affix=3/volcanic
Sanguine /affix=8/sanguine
请注意,这是您提供的代码的中文翻译部分。如果您需要进一步的帮助或有其他问题,请随时提出。
英文:
Here is an example how you can download the page, parse the required data and print sample imformation (info about US Dungeons&Raids):
import re
import json
from urllib.request import Request, urlopen
req = Request('https://www.wowhead.com/today-in-wow', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read().decode('utf-8')
json_data = re.search(r"TodayInWow\(WH\.ge\('tiw-standalone'\), (.*), true\);", html_page)
json_data = json.loads(json_data.group(1))
# uncomment to print all data:
# print(json.dumps(json_data, indent=4))
for part in json_data:
if part['id'] == 'dungeons-and-raids' and part['regionId'] == 'US':
for g in part['groups']:
print(g['name'], g.get('url', '-'))
Prints:
Mythic+ Affixes /guides/mythic-keystones-and-dungeons
Aberrus, the Shadowed Crucible (Mythic) https://www.wowhead.com/guide/raids/aberrus-the-shadowed-crucible/overview
Conquest Points -
EDIT: For easier search I recommend to transform the Json data from a list to a dictionary:
import re
import json
from urllib.request import Request, urlopen
req = Request(
"https://www.wowhead.com/today-in-wow", headers={"User-Agent": "Mozilla/5.0"}
)
html_page = urlopen(req).read().decode("utf-8")
json_data = re.search(
r"TodayInWow\(WH\.ge\('tiw-standalone'\), (.*), true\);", html_page
)
json_data = json.loads(json_data.group(1))
# uncomment to print all data:
# print(json.dumps(json_data, indent=4))
# transform the received data from list to a dictionary (for easier search)
data = {
(d["id"], d["regionId"]): {dd["id"]: dd for dd in d["groups"]} for d in json_data
}
for line in data[("dungeons-and-raids", "US")]["mythicaffix"]['content']['lines']:
l = line['name'], line['url']
if line['name'] == 'Tyrannical':
print(' --> ', *l)
else:
print(' ', *l)
Prints:
--> Tyrannical /affix=9/tyrannical
Volcanic /affix=3/volcanic
Sanguine /affix=8/sanguine
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论