英文:
lxml parse doesn't recognize Indexing
问题
I am trying to scrape and parse some HTML with Python and extract Italian fiscal code 'VTLLCU86S03I348V'.
If I inspect the page and copy the full XPath of that object I get no results and I do not understand why. I tried with normal XPath like '//div/p/a' for example and it worked. Things become messy when the XPath contains indexing like 'div[3]/div' even though it sounds weird to me as well. I have also noticed that the inspect XPath contains just one '/' at the beginning even though I think that two are required (right?).
Hereunder a minimal reproducible example. What am I missing?
from lxml import etree
import requests
GET
page = requests.get('https://notariato.it/it/notary/luca-vitale/')
PARSING
parser = etree.HTMLParser()
tree = etree.parse(StringIO(page.text), parser)
nome_raw = tree.xpath('//html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4')
for i in range(len(nome_raw)):
print(nome_raw[i].text)
Thank you!
英文:
I am trying to scrape and parse some HMLT with Python and extract Italian fiscal code 'VTLLCU86S03I348V'.
If I inspect the page and copy the full xPath of that object I get no results and I do not understand why. I tried with normal xpath like '//div/p/a' for example and it worked. Things become messy when the xpath contains indexing like 'div[3]/div' even though it sounds weird to me as well. I have also noticed that the inspect xpath contains just one '/' at the beginning even though I think that two are required (righ?).
Hereunder a minimal reproducible example. What am I missing?
from lxml import etree
import requests
# GET
page = requests.get('https://notariato.it/it/notary/luca-vitale/')
# PARSING
parser = etree.HTMLParser()
tree = etree.parse(StringIO(page.text), parser)
nome_raw = tree.xpath('//html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4')
for i in range(len(nome_raw)):
print(nome_raw[i].text)
Thank you!
答案1
得分: 1
Xpath以单个/
开头的是绝对路径,而//
是相对路径。在xpath中使用索引的元素使其在页面结构更改时容易失败。
将元素的“label”作为参考使xpath更易读(其余代码相同)
nome_raw = tree.xpath('//div[preceding-sibling::div/div/p[. = "Codice fiscale"]]/div/h4')
for i in range(len(nome_raw)):
print(nome_raw[i].text)
VTLLCU86S03I348V
关于为什么带索引的xpath不起作用,我用xmllint
进行了一些测试,它找到了该元素
xmllint --html --recover --xpath '/html/body/div[2]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4' aaa.html 2>/dev/null
<h4 class="elementor-heading-title elementor-size-default">VTLLCU86S03I348V</h4>
lxml
和xmllint
都基于libxml2
,但在两者之间找到了版本差异
libxml2(lxml):2.9.9
libxml2(xmllint):2.9.14
from lxml import etree
print("libxml used: ", etree.LIBXML_VERSION)
print("libxml compiled: ", etree.LIBXML_COMPILED_VERSION)
从lxml文档中得知:
HTML解析同样简单。解析器有一个
recover
关键字参数,默认情况下由HTMLParser设置。它允许libxml2尽力返回一个带有它能够解析的所有内容的有效HTML树。它不会在解析器错误时引发异常。您应该使用libxml2版本2.6.21或更新版本以利用此功能。
不确定为什么它无法解析此破碎的HTML(xmllint
抛出了许多错误),但简单的测试表明找到的元素比预期的少。应该找到6个divs
divs = tree.xpath('/html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div')
for i in range(len(divs)):
print(i, divs[i].tag)
结果:
0 div
1 div
2 div
3 div
英文:
Xpaths starting with a single /
are absolute ones whereas //
are relative.
Using indexed elements in xpath makes it prone to fail if the page structure changes.
Giving the "label" of the element as a reference makes the xpath more readable (the rest of the code is the same)
nome_raw = tree.xpath('//div[preceding-sibling::div/div/p[. = "Codice fiscale"]]/div/h4')
for i in range(len(nome_raw)):
print(nome_raw[i].text)
VTLLCU86S03I348V
Regarding why the indexed xpath didn't work I did some tests with xmllint
and it found the element
xmllint --html --recover --xpath '/html/body/div[2]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4' aaa.html 2>/dev/null
<h4 class="elementor-heading-title elementor-size-default">VTLLCU86S03I348V</h4>
Both lxml
and xmllint
are based on libxml2
but found a version difference between both
libxml2 (lxml) : 2.9.9
libxml2 (xmllint): 2.9.14
>>> from lxml import etree
>>> print ("libxml used: ", etree.LIBXML_VERSION)
libxml used: (2, 9, 9)
>>> print ("libxml compiled: ", etree.LIBXML_COMPILED_VERSION)
libxml compiled: (2, 9, 9)
From lxml docs
> HTML parsing is similarly simple. The parsers have a recover keyword argument that the HTMLParser sets by default. It lets libxml2 try its best to return a valid HTML tree with all content it can manage to parse. It will not raise an exception on parser errors. You should use libxml2 version 2.6.21 or newer to take advantage of this feature.
Not sure why it fails to parse this BROKEN html (xmllint
threw lots of errors) but a simple test shows that finds less elements than expected. Should find 6 divs
divs = tree.xpath('/html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div')
for i in range(len(divs)):
print(i, divs[i].tag)
Result
0 div
1 div
2 div
3 div
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论