HTML解析器查找标签信息

huangapple go评论112阅读模式
英文:

HTML parser find tag info

问题

我有一个项目,其中使用了HTMLParser()。我从未使用过这个解析器,所以我阅读了文档,并找到了两个有用的方法,我可以重写这些方法来从网站中提取信息:handle_starttaghandle_data。但我不明白如何找到所需的标签信息并将其传递给handle_data以打印信息。

我需要从页面上的所有span标签中获取价格

  1. <span itemprop="price" content="590">590美元</span>

我该如何做到这一点?

英文:

I have a project where uses HTMLParser(). I never worked with this parser, so I read the documentation and found two useful methods I can override to extract information from the site: handle_starttag and handle_data. But I don't understand how to find needed tags info and pass the to handle_data to print info.

I need to get the price from all span tags on the page

  1. <span itemprop="price" content="590">590 dollars</span>

How do I get this?

答案1

得分: 1

如果每个<span>价格标签都有itemprop属性为"price",并且美元金额在content属性中,那么你可以在hanlde_starttag中像这样完成它:

  1. from html.parser import HTMLParser
  2. class MyHTMLParser(HTMLParser):
  3. def handle_starttag(self, tag, attrs):
  4. attrsDict = dict(attrs)
  5. if tag == 'span' and attrsDict['itemprop'] == 'price':
  6. price = attrsDict['content']
  7. print(price)
  8. # 在这里对`price`执行其他操作
  9. # 示例测试案例
  10. parser = MyHTMLParser()
  11. parser.feed('''
  12. <span itemprop="price" content="590">590 dollars</span>
  13. <span itemprop="price" content="430">430 dollars</span>
  14. <span itemprop="price" content="684">684 dollars</span>
  15. ''')

希望这对你有帮助。

英文:

If every &lt;span&gt; price tag has the itemprop attribute of &quot;price&quot; and the dollar amount is in the content attribute, then you can do it all in hanlde_starttag like this:

  1. from html.parser import HTMLParser
  2. class MyHTMLParser(HTMLParser):
  3. def handle_starttag(self, tag, attrs):
  4. attrsDict = dict(attrs)
  5. if tag == &#39;span&#39; and attrsDict[&#39;itemprop&#39;] == &#39;price&#39;:
  6. price = attrsDict[&#39;content&#39;]
  7. print(price)
  8. # do something else with `price` here
  9. # Example test cases
  10. parser = MyHTMLParser()
  11. parser.feed(&quot;&quot;&quot;
  12. &lt;span itemprop=&quot;price&quot; content=&quot;590&quot;&gt;590 dollars&lt;/span&gt;
  13. &lt;span itemprop=&quot;price&quot; content=&quot;430&quot;&gt;430 dollars&lt;/span&gt;
  14. &lt;span itemprop=&quot;price&quot; content=&quot;684&quot;&gt;684 dollars&lt;/span&gt;
  15. &quot;&quot;&quot;)

答案2

得分: 1

这个示例将初始化自定义的 HTMLParser 并获取 &lt;span&gt; 标签之间的文本(使用 handle_data):

  1. from html.parser import HTMLParser
  2. class MyHTMLParser(HTMLParser):
  3. def __init__(self):
  4. HTMLParser.__init__(self)
  5. self._price_tag = None
  6. self.prices = []
  7. def handle_starttag(self, tag, attrs):
  8. if tag == "span" and ('itemprop', 'price') in attrs:
  9. self._price_tag = tag
  10. def handle_endtag(self, tag):
  11. if tag == self._price_tag:
  12. self._price_tag = None
  13. def handle_data(self, data):
  14. if self._price_tag:
  15. self.prices.append(data)
  16. parser = MyHTMLParser()
  17. parser.feed("""
  18. <html>
  19. <span itemprop="price" content="570">570 dollars</span>
  20. <span itemprop="price" content="590">590 dollars</span>
  21. </html>
  22. """)
  23. print(parser.prices)

打印结果:

  1. ['570 dollars', '590 dollars']
英文:

This example will initialize custom HTMLParser and get the text between the &lt;span&gt; tags (using handle_data):

  1. from html.parser import HTMLParser
  2. class MyHTMLParser(HTMLParser):
  3. def __init__(self):
  4. HTMLParser.__init__(self)
  5. self._price_tag = None
  6. self.prices = []
  7. def handle_starttag(self, tag, attrs):
  8. if tag == &quot;span&quot; and (&#39;itemprop&#39;, &#39;price&#39;) in attrs:
  9. self._price_tag = tag
  10. def handle_endtag(self, tag):
  11. if tag == self._price_tag:
  12. self._price_tag = None
  13. def handle_data(self, data):
  14. if self._price_tag:
  15. self.prices.append(data)
  16. parser = MyHTMLParser()
  17. parser.feed(r&quot;&quot;&quot;\
  18. &lt;html&gt;
  19. &lt;span itemprop=&quot;price&quot; content=&quot;570&quot;&gt;570 dollars&lt;/span&gt;
  20. &lt;span itemprop=&quot;price&quot; content=&quot;590&quot;&gt;590 dollars&lt;/span&gt;
  21. &lt;/html&gt;
  22. &quot;&quot;&quot;
  23. )
  24. print(parser.prices)

Prints:

  1. [&#39;570 dollars&#39;, &#39;590 dollars&#39;]

huangapple
  • 本文由 发表于 2023年1月9日 01:48:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75050076.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定