英文:
Parsing of HTML using a template
问题
我们有一个大型项目,其中有60多个不同的HTML数据集,需要创建一个JSON对象。
例如:
源1:
<div id="foo">Person Name: Bar</div>
源2:
<div id="bar">Person Name: Foo</div>
目前存在的技术可以创建一个模板,然后说“好的,从这个div中获取这个HTML文本,并放入这个JSON字段中。
我看到这个链接,但它已经有7年的历史了:
https://stackoverflow.com/questions/25753368/performant-parsing-of-html-pages-with-node-js-and-xpath
英文:
We have a large project where we have 60+ different sets of data in HTML and need to create a JSON object on it.
For example:
Source 1:
<div id="foo">Person Name: Bar</div>
Source 2:
<div id="bar">Person Name: Foo</div>
What tech exists at this point to create a template and say "okay grab this html text from this div and put it in this json field.
I saw this but it's 7 years old.
https://stackoverflow.com/questions/25753368/performant-parsing-of-html-pages-with-node-js-and-xpath
答案1
得分: 1
Have you try Beautiful Soup and JSON python libraries?
they are easy to get started with if that's your requirement
Here is a sample code snippet that finds div tags using Beautiful Soup and creates a JSON string using JSON python library
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<div>"Hello World!"</div>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
div_tags = soup.find_all('div')
print(div_tags[0].text)
# Hello World!
if you'd want to put the text into a JSON, put the text into a python dictionary data structure and use json JSON python library to put it either into a new json document or an existing one.
here is a sample code snippet of using json library to work with JSON documents:
apod.json
{
"copyright": "Yin Hao",
"date": "2018-10-30",
"explanation": "Meteors have been shooting out from the constellation of Orion. This was expected, as October is the time of year for the Orionids Meteor Shower. Pictured here, over two dozen meteors were caught in successively added exposures last October over Wulan Hada volcano in Inner Mongolia, China. The featured image shows multiple meteor streaks that can all be connected to a single small region on the sky called the radiant, here visible just above and to the left of the belt of Orion, The Orionids meteors started as sand sized bits expelled from Comet Halley during one of its trips to the inner Solar System. Comet Halley is actually responsible for two known meteor showers, the other known as the Eta Aquarids and visible every May. An Orionids image featured on APOD one year ago today from the same location shows the same car. Next month, the Leonids Meteor Shower from Comet Tempel-Tuttle should also result in some bright meteor streaks. Follow APOD on: Facebook, Instagram, Reddit, or Twitter",
"hdurl": "https://apod.nasa.gov/apod/image/1810/Orionids_Hao_2324.jpg",
"media_type": "image",
"service_version": "v1",
"title": "Orionids Meteors over Inner Mongolia",
"url": "https://apod.nasa.gov/apod/image/1810/Orionids_Hao_960.jpg"
}
import json
with open('apod.json', 'r') as f:
json_text = f.read()
# Decode the JSON string into a Python dictionary.
apod_dict = json.loads(json_text)
print(apod_dict['explanation'])
# Encode the Python dictionary into a JSON string.
new_json_string = json.dumps(apod_dict, indent=4)
print(new_json_string)
hope it helps!
英文:
Have you try Beautiful Soup and JSON python libraries?
they are easy to get started with if that's your requirement
Here is a sample code snippet that finds div tags using Beautiful Soup and creates a JSON string using JSON python library
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<div>"Hello World!"</div>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
div_tags = soup.find_all('div')
print(div_tags[0].text)
# Hello World!
if you'd want to put the text into a JSON, put the text into a python dictionary data structure and use json JSON python library to put it either into a new json document or an existing one.
here is a sample code snippet of using json library to work with JSON documents:
apod.json
{
"copyright": "Yin Hao",
"date": "2018-10-30",
"explanation": "Meteors have been shooting out from the constellation of Orion. This was expected, as October is the time of year for the Orionids Meteor Shower. Pictured here, over two dozen meteors were caught in successively added exposures last October over Wulan Hada volcano in Inner Mongolia, China. The featured image shows multiple meteor streaks that can all be connected to a single small region on the sky called the radiant, here visible just above and to the left of the belt of Orion, The Orionids meteors started as sand sized bits expelled from Comet Halley during one of its trips to the inner Solar System. Comet Halley is actually responsible for two known meteor showers, the other known as the Eta Aquarids and visible every May. An Orionids image featured on APOD one year ago today from the same location shows the same car. Next month, the Leonids Meteor Shower from Comet Tempel-Tuttle should also result in some bright meteor streaks. Follow APOD on: Facebook, Instagram, Reddit, or Twitter",
"hdurl": "https://apod.nasa.gov/apod/image/1810/Orionids_Hao_2324.jpg",
"media_type": "image",
"service_version": "v1",
"title": "Orionids Meteors over Inner Mongolia",
"url": "https://apod.nasa.gov/apod/image/1810/Orionids_Hao_960.jpg"
}
import json
with open('apod.json', 'r') as f:
json_text = f.read()
# Decode the JSON string into a Python dictionary.
apod_dict = json.loads(json_text)
print(apod_dict['explanation'])
# Encode the Python dictionary into a JSON string.
new_json_string = json.dumps(apod_dict, indent=4)
print(new_json_string)
hope it helps!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论