HTML使用模板进行解析

huangapple go评论60阅读模式
英文:

Parsing of HTML using a template

问题

我们有一个大型项目,其中有60多个不同的HTML数据集,需要创建一个JSON对象。

例如:

源1:

<div id="foo">Person Name: Bar</div>

源2:

<div id="bar">Person Name: Foo</div>

目前存在的技术可以创建一个模板,然后说“好的,从这个div中获取这个HTML文本,并放入这个JSON字段中。

我看到这个链接,但它已经有7年的历史了:

https://stackoverflow.com/questions/25753368/performant-parsing-of-html-pages-with-node-js-and-xpath

英文:

We have a large project where we have 60+ different sets of data in HTML and need to create a JSON object on it.

For example:

Source 1:

<div id="foo">Person Name: Bar</div>

Source 2:

<div id="bar">Person Name: Foo</div>

What tech exists at this point to create a template and say "okay grab this html text from this div and put it in this json field.

I saw this but it's 7 years old.

https://stackoverflow.com/questions/25753368/performant-parsing-of-html-pages-with-node-js-and-xpath

答案1

得分: 1

Have you try Beautiful Soup and JSON python libraries?
they are easy to get started with if that's your requirement

Here is a sample code snippet that finds div tags using Beautiful Soup and creates a JSON string using JSON python library

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<div>"Hello World!"</div>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
div_tags = soup.find_all('div')
print(div_tags[0].text)
# Hello World!

if you'd want to put the text into a JSON, put the text into a python dictionary data structure and use json JSON python library to put it either into a new json document or an existing one.

here is a sample code snippet of using json library to work with JSON documents:

apod.json

{
    "copyright": "Yin Hao",
    "date": "2018-10-30",
    "explanation": "Meteors have been shooting out from the constellation of Orion. This was expected, as October is the time of year for the Orionids Meteor Shower. Pictured here, over two dozen meteors were caught in successively added exposures last October over Wulan Hada volcano in Inner Mongolia, China. The featured image shows multiple meteor streaks that can all be connected to a single small region on the sky called the radiant, here visible just above and to the left of the belt of Orion, The Orionids meteors started as sand sized bits expelled from Comet Halley during one of its trips to the inner Solar System. Comet Halley is actually responsible for two known meteor showers, the other known as the Eta Aquarids and visible every May. An Orionids image featured on APOD one year ago today from the same location shows the same car. Next month, the Leonids Meteor Shower from Comet Tempel-Tuttle should also result in some bright meteor streaks. Follow APOD on: Facebook, Instagram, Reddit, or Twitter",
    "hdurl": "https://apod.nasa.gov/apod/image/1810/Orionids_Hao_2324.jpg",
    "media_type": "image",
    "service_version": "v1",
    "title": "Orionids Meteors over Inner Mongolia",
    "url": "https://apod.nasa.gov/apod/image/1810/Orionids_Hao_960.jpg"
}
import json

with open('apod.json', 'r') as f:
    json_text = f.read()

# Decode the JSON string into a Python dictionary.
apod_dict = json.loads(json_text)
print(apod_dict['explanation'])

# Encode the Python dictionary into a JSON string.
new_json_string = json.dumps(apod_dict, indent=4)
print(new_json_string)

hope it helps!

英文:

Have you try Beautiful Soup and JSON python libraries?
they are easy to get started with if that's your requirement

Here is a sample code snippet that finds div tags using Beautiful Soup and creates a JSON string using JSON python library

from bs4 import BeautifulSoup
html_doc = &quot;&quot;&quot;
&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;
&lt;body&gt;
&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;

&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were
&lt;a href=&quot;http://example.com/elsie&quot; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,
&lt;a href=&quot;http://example.com/lacie&quot; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and
&lt;a href=&quot;http://example.com/tillie&quot; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;
and they lived at the bottom of a well.&lt;/p&gt;
&lt;div&gt;&quot;Hello World!&quot;&lt;/div&gt;
&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;
&quot;&quot;&quot;
soup = BeautifulSoup(html_doc, &#39;html.parser&#39;)
div_tags = soup.find_all(&#39;div&#39;)
print(div_tags[0].text)
# Hello World!

if you'd want to put the text into a JSON, put the text into a python dictionary data structure and use json JSON python library to put it either into a new json document or an existing one.

here is a sample code snippet of using json library to work with JSON documents:

apod.json

{
    &quot;copyright&quot;: &quot;Yin Hao&quot;,
    &quot;date&quot;: &quot;2018-10-30&quot;,
    &quot;explanation&quot;: &quot;Meteors have been shooting out from the constellation of Orion. This was expected, as October is the time of year for the Orionids Meteor Shower. Pictured here, over two dozen meteors were caught in successively added exposures last October over Wulan Hada volcano in Inner Mongolia, China. The featured image shows multiple meteor streaks that can all be connected to a single small region on the sky called the radiant, here visible just above and to the left of the belt of Orion, The Orionids meteors started as sand sized bits expelled from Comet Halley during one of its trips to the inner Solar System. Comet Halley is actually responsible for two known meteor showers, the other known as the Eta Aquarids and visible every May. An Orionids image featured on APOD one year ago today from the same location shows the same car. Next month, the Leonids Meteor Shower from Comet Tempel-Tuttle should also result in some bright meteor streaks. Follow APOD on: Facebook, Instagram, Reddit, or Twitter&quot;,
    &quot;hdurl&quot;: &quot;https://apod.nasa.gov/apod/image/1810/Orionids_Hao_2324.jpg&quot;,
    &quot;media_type&quot;: &quot;image&quot;,
    &quot;service_version&quot;: &quot;v1&quot;,
    &quot;title&quot;: &quot;Orionids Meteors over Inner Mongolia&quot;,
    &quot;url&quot;: &quot;https://apod.nasa.gov/apod/image/1810/Orionids_Hao_960.jpg&quot;
}
import json

with open(&#39;apod.json&#39;, &#39;r&#39;) as f:
    json_text = f.read()

   # Decode the JSON string into a Python dictionary.
apod_dict = json.loads(json_text)
print(apod_dict[&#39;explanation&#39;])

# Encode the Python dictionary into a JSON string.
new_json_string = json.dumps(apod_dict, indent=4)
print(new_json_string)

hope it helps!

huangapple
  • 本文由 发表于 2023年5月11日 07:57:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76223288.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定