HTML解析器查找标签信息

huangapple go评论86阅读模式
英文:

HTML parser find tag info

问题

我有一个项目,其中使用了HTMLParser()。我从未使用过这个解析器,所以我阅读了文档,并找到了两个有用的方法,我可以重写这些方法来从网站中提取信息:handle_starttaghandle_data。但我不明白如何找到所需的标签信息并将其传递给handle_data以打印信息。

我需要从页面上的所有span标签中获取价格

<span itemprop="price" content="590">590美元</span>

我该如何做到这一点?

英文:

I have a project where uses HTMLParser(). I never worked with this parser, so I read the documentation and found two useful methods I can override to extract information from the site: handle_starttag and handle_data. But I don't understand how to find needed tags info and pass the to handle_data to print info.

I need to get the price from all span tags on the page

<span itemprop="price" content="590">590 dollars</span>

How do I get this?

答案1

得分: 1

如果每个<span>价格标签都有itemprop属性为"price",并且美元金额在content属性中,那么你可以在hanlde_starttag中像这样完成它:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        attrsDict = dict(attrs)
        if tag == 'span' and attrsDict['itemprop'] == 'price':
            price = attrsDict['content']
            print(price)
            # 在这里对`price`执行其他操作

# 示例测试案例
parser = MyHTMLParser()
parser.feed('''
<span itemprop="price" content="590">590 dollars</span>
<span itemprop="price" content="430">430 dollars</span>
<span itemprop="price" content="684">684 dollars</span>
''')

希望这对你有帮助。

英文:

If every &lt;span&gt; price tag has the itemprop attribute of &quot;price&quot; and the dollar amount is in the content attribute, then you can do it all in hanlde_starttag like this:

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        attrsDict = dict(attrs)
        if tag == &#39;span&#39; and attrsDict[&#39;itemprop&#39;] == &#39;price&#39;:
            price = attrsDict[&#39;content&#39;]
            print(price)
            # do something else with `price` here


# Example test cases
parser = MyHTMLParser()
parser.feed(&quot;&quot;&quot;
&lt;span itemprop=&quot;price&quot; content=&quot;590&quot;&gt;590 dollars&lt;/span&gt;
&lt;span itemprop=&quot;price&quot; content=&quot;430&quot;&gt;430 dollars&lt;/span&gt;
&lt;span itemprop=&quot;price&quot; content=&quot;684&quot;&gt;684 dollars&lt;/span&gt;
            &quot;&quot;&quot;)

答案2

得分: 1

这个示例将初始化自定义的 HTMLParser 并获取 &lt;span&gt; 标签之间的文本(使用 handle_data):

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._price_tag = None
        self.prices = []

    def handle_starttag(self, tag, attrs):
        if tag == "span" and ('itemprop', 'price') in attrs:
            self._price_tag = tag

    def handle_endtag(self, tag):
        if tag == self._price_tag:
            self._price_tag = None

    def handle_data(self, data):
        if self._price_tag:
            self.prices.append(data)


parser = MyHTMLParser()
parser.feed("""
<html>
    <span itemprop="price" content="570">570 dollars</span>
    <span itemprop="price" content="590">590 dollars</span>
</html>
""")

print(parser.prices)

打印结果:

['570 dollars', '590 dollars']
英文:

This example will initialize custom HTMLParser and get the text between the &lt;span&gt; tags (using handle_data):

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._price_tag = None
        self.prices = []


    def handle_starttag(self, tag, attrs):
        if tag == &quot;span&quot; and (&#39;itemprop&#39;, &#39;price&#39;) in attrs:
            self._price_tag = tag

    def handle_endtag(self, tag):
        if tag == self._price_tag:
            self._price_tag = None

    def handle_data(self, data):
        if self._price_tag:
            self.prices.append(data)



parser = MyHTMLParser()
parser.feed(r&quot;&quot;&quot;\
&lt;html&gt;
    &lt;span itemprop=&quot;price&quot; content=&quot;570&quot;&gt;570 dollars&lt;/span&gt;
    &lt;span itemprop=&quot;price&quot; content=&quot;590&quot;&gt;590 dollars&lt;/span&gt;
&lt;/html&gt;
&quot;&quot;&quot;
)

print(parser.prices)

Prints:

[&#39;570 dollars&#39;, &#39;590 dollars&#39;]

huangapple
  • 本文由 发表于 2023年1月9日 01:48:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75050076.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定