英文:
HTML parser find tag info
问题
我有一个项目,其中使用了HTMLParser()
。我从未使用过这个解析器,所以我阅读了文档,并找到了两个有用的方法,我可以重写这些方法来从网站中提取信息:handle_starttag
和handle_data
。但我不明白如何找到所需的标签信息并将其传递给handle_data
以打印信息。
我需要从页面上的所有span标签中获取价格
<span itemprop="price" content="590">590美元</span>
我该如何做到这一点?
英文:
I have a project where uses HTMLParser()
. I never worked with this parser, so I read the documentation and found two useful methods I can override to extract information from the site: handle_starttag
and handle_data
. But I don't understand how to find needed tags info and pass the to handle_data
to print info.
I need to get the price from all span tags on the page
<span itemprop="price" content="590">590 dollars</span>
How do I get this?
答案1
得分: 1
如果每个<span>
价格标签都有itemprop
属性为"price"
,并且美元金额在content
属性中,那么你可以在hanlde_starttag
中像这样完成它:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
attrsDict = dict(attrs)
if tag == 'span' and attrsDict['itemprop'] == 'price':
price = attrsDict['content']
print(price)
# 在这里对`price`执行其他操作
# 示例测试案例
parser = MyHTMLParser()
parser.feed('''
<span itemprop="price" content="590">590 dollars</span>
<span itemprop="price" content="430">430 dollars</span>
<span itemprop="price" content="684">684 dollars</span>
''')
希望这对你有帮助。
英文:
If every <span>
price tag has the itemprop
attribute of "price"
and the dollar amount is in the content
attribute, then you can do it all in hanlde_starttag
like this:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
attrsDict = dict(attrs)
if tag == 'span' and attrsDict['itemprop'] == 'price':
price = attrsDict['content']
print(price)
# do something else with `price` here
# Example test cases
parser = MyHTMLParser()
parser.feed("""
<span itemprop="price" content="590">590 dollars</span>
<span itemprop="price" content="430">430 dollars</span>
<span itemprop="price" content="684">684 dollars</span>
""")
答案2
得分: 1
这个示例将初始化自定义的 HTMLParser
并获取 <span>
标签之间的文本(使用 handle_data
):
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self._price_tag = None
self.prices = []
def handle_starttag(self, tag, attrs):
if tag == "span" and ('itemprop', 'price') in attrs:
self._price_tag = tag
def handle_endtag(self, tag):
if tag == self._price_tag:
self._price_tag = None
def handle_data(self, data):
if self._price_tag:
self.prices.append(data)
parser = MyHTMLParser()
parser.feed("""
<html>
<span itemprop="price" content="570">570 dollars</span>
<span itemprop="price" content="590">590 dollars</span>
</html>
""")
print(parser.prices)
打印结果:
['570 dollars', '590 dollars']
英文:
This example will initialize custom HTMLParser
and get the text between the <span>
tags (using handle_data
):
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self._price_tag = None
self.prices = []
def handle_starttag(self, tag, attrs):
if tag == "span" and ('itemprop', 'price') in attrs:
self._price_tag = tag
def handle_endtag(self, tag):
if tag == self._price_tag:
self._price_tag = None
def handle_data(self, data):
if self._price_tag:
self.prices.append(data)
parser = MyHTMLParser()
parser.feed(r"""\
<html>
<span itemprop="price" content="570">570 dollars</span>
<span itemprop="price" content="590">590 dollars</span>
</html>
"""
)
print(parser.prices)
Prints:
['570 dollars', '590 dollars']
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论