
huangapple go评论74阅读模式

Parsing of HTML using a template





<div id="foo">Person Name: Bar</div>


<div id="bar">Person Name: Foo</div>




We have a large project where we have 60+ different sets of data in HTML and need to create a JSON object on it.

For example:

Source 1:

<div id="foo">Person Name: Bar</div>

Source 2:

<div id="bar">Person Name: Foo</div>

What tech exists at this point to create a template and say "okay grab this html text from this div and put it in this json field.

I saw this but it's 7 years old.


得分: 1

Have you try Beautiful Soup and JSON python libraries?
they are easy to get started with if that's your requirement

Here is a sample code snippet that finds div tags using Beautiful Soup and creates a JSON string using JSON python library

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="" class="sister" id="link1">Elsie</a>,
<a href="" class="sister" id="link2">Lacie</a> and
<a href="" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<div>"Hello World!"</div>
<p class="story">...</p>
soup = BeautifulSoup(html_doc, 'html.parser')
div_tags = soup.find_all('div')
# Hello World!

if you'd want to put the text into a JSON, put the text into a python dictionary data structure and use json JSON python library to put it either into a new json document or an existing one.

here is a sample code snippet of using json library to work with JSON documents:


    "copyright": "Yin Hao",
    "date": "2018-10-30",
    "explanation": "Meteors have been shooting out from the constellation of Orion. This was expected, as October is the time of year for the Orionids Meteor Shower. Pictured here, over two dozen meteors were caught in successively added exposures last October over Wulan Hada volcano in Inner Mongolia, China. The featured image shows multiple meteor streaks that can all be connected to a single small region on the sky called the radiant, here visible just above and to the left of the belt of Orion, The Orionids meteors started as sand sized bits expelled from Comet Halley during one of its trips to the inner Solar System. Comet Halley is actually responsible for two known meteor showers, the other known as the Eta Aquarids and visible every May. An Orionids image featured on APOD one year ago today from the same location shows the same car. Next month, the Leonids Meteor Shower from Comet Tempel-Tuttle should also result in some bright meteor streaks. Follow APOD on: Facebook, Instagram, Reddit, or Twitter",
    "hdurl": "",
    "media_type": "image",
    "service_version": "v1",
    "title": "Orionids Meteors over Inner Mongolia",
    "url": ""
import json

with open('apod.json', 'r') as f:
    json_text =

# Decode the JSON string into a Python dictionary.
apod_dict = json.loads(json_text)

# Encode the Python dictionary into a JSON string.
new_json_string = json.dumps(apod_dict, indent=4)

hope it helps!


Have you try Beautiful Soup and JSON python libraries?
they are easy to get started with if that's your requirement

Here is a sample code snippet that finds div tags using Beautiful Soup and creates a JSON string using JSON python library

from bs4 import BeautifulSoup
html_doc = &quot;&quot;&quot;
&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse&#39;s story&lt;/title&gt;&lt;/head&gt;
&lt;p class=&quot;title&quot;&gt;&lt;b&gt;The Dormouse&#39;s story&lt;/b&gt;&lt;/p&gt;

&lt;p class=&quot;story&quot;&gt;Once upon a time there were three little sisters; and their names were
&lt;a href=&quot;; class=&quot;sister&quot; id=&quot;link1&quot;&gt;Elsie&lt;/a&gt;,
&lt;a href=&quot;; class=&quot;sister&quot; id=&quot;link2&quot;&gt;Lacie&lt;/a&gt; and
&lt;a href=&quot;; class=&quot;sister&quot; id=&quot;link3&quot;&gt;Tillie&lt;/a&gt;;
and they lived at the bottom of a well.&lt;/p&gt;
&lt;div&gt;&quot;Hello World!&quot;&lt;/div&gt;
&lt;p class=&quot;story&quot;&gt;...&lt;/p&gt;
soup = BeautifulSoup(html_doc, &#39;html.parser&#39;)
div_tags = soup.find_all(&#39;div&#39;)
# Hello World!

if you'd want to put the text into a JSON, put the text into a python dictionary data structure and use json JSON python library to put it either into a new json document or an existing one.

here is a sample code snippet of using json library to work with JSON documents:


    &quot;copyright&quot;: &quot;Yin Hao&quot;,
    &quot;date&quot;: &quot;2018-10-30&quot;,
    &quot;explanation&quot;: &quot;Meteors have been shooting out from the constellation of Orion. This was expected, as October is the time of year for the Orionids Meteor Shower. Pictured here, over two dozen meteors were caught in successively added exposures last October over Wulan Hada volcano in Inner Mongolia, China. The featured image shows multiple meteor streaks that can all be connected to a single small region on the sky called the radiant, here visible just above and to the left of the belt of Orion, The Orionids meteors started as sand sized bits expelled from Comet Halley during one of its trips to the inner Solar System. Comet Halley is actually responsible for two known meteor showers, the other known as the Eta Aquarids and visible every May. An Orionids image featured on APOD one year ago today from the same location shows the same car. Next month, the Leonids Meteor Shower from Comet Tempel-Tuttle should also result in some bright meteor streaks. Follow APOD on: Facebook, Instagram, Reddit, or Twitter&quot;,
    &quot;hdurl&quot;: &quot;;,
    &quot;media_type&quot;: &quot;image&quot;,
    &quot;service_version&quot;: &quot;v1&quot;,
    &quot;title&quot;: &quot;Orionids Meteors over Inner Mongolia&quot;,
    &quot;url&quot;: &quot;;
import json

with open(&#39;apod.json&#39;, &#39;r&#39;) as f:
    json_text =

   # Decode the JSON string into a Python dictionary.
apod_dict = json.loads(json_text)

# Encode the Python dictionary into a JSON string.
new_json_string = json.dumps(apod_dict, indent=4)

hope it helps!

  • 本文由 发表于 2023年5月11日 07:57:27
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
