lxml解析不识别索引化。

huangapple go评论63阅读模式
英文:

lxml parse doesn't recognize Indexing

问题

I am trying to scrape and parse some HTML with Python and extract Italian fiscal code 'VTLLCU86S03I348V'.

If I inspect the page and copy the full XPath of that object I get no results and I do not understand why. I tried with normal XPath like '//div/p/a' for example and it worked. Things become messy when the XPath contains indexing like 'div[3]/div' even though it sounds weird to me as well. I have also noticed that the inspect XPath contains just one '/' at the beginning even though I think that two are required (right?).

Hereunder a minimal reproducible example. What am I missing?

from lxml import etree
import requests

GET

page = requests.get('https://notariato.it/it/notary/luca-vitale/')

PARSING

parser = etree.HTMLParser()
tree = etree.parse(StringIO(page.text), parser)

nome_raw = tree.xpath('//html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4')

for i in range(len(nome_raw)):
print(nome_raw[i].text)

Thank you!

英文:

I am trying to scrape and parse some HMLT with Python and extract Italian fiscal code 'VTLLCU86S03I348V'.

If I inspect the page and copy the full xPath of that object I get no results and I do not understand why. I tried with normal xpath like '//div/p/a' for example and it worked. Things become messy when the xpath contains indexing like 'div[3]/div' even though it sounds weird to me as well. I have also noticed that the inspect xpath contains just one '/' at the beginning even though I think that two are required (righ?).

Hereunder a minimal reproducible example. What am I missing?

from lxml import etree
import requests

# GET 
page = requests.get('https://notariato.it/it/notary/luca-vitale/')

# PARSING
parser = etree.HTMLParser()
tree = etree.parse(StringIO(page.text), parser)

nome_raw = tree.xpath('//html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4')

for i in range(len(nome_raw)):
    print(nome_raw[i].text)

Thank you!

答案1

得分: 1

Xpath以单个/开头的是绝对路径,而//是相对路径。在xpath中使用索引的元素使其在页面结构更改时容易失败。

将元素的“label”作为参考使xpath更易读(其余代码相同)

nome_raw = tree.xpath('//div[preceding-sibling::div/div/p[. = "Codice fiscale"]]/div/h4')
for i in range(len(nome_raw)):
    print(nome_raw[i].text)
VTLLCU86S03I348V

关于为什么带索引的xpath不起作用,我用xmllint进行了一些测试,它找到了该元素

xmllint --html --recover --xpath '/html/body/div[2]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4' aaa.html 2>/dev/null
<h4 class="elementor-heading-title elementor-size-default">VTLLCU86S03I348V</h4>

lxmlxmllint都基于libxml2,但在两者之间找到了版本差异

libxml2(lxml):2.9.9
libxml2(xmllint):2.9.14

from lxml import etree
print("libxml used: ", etree.LIBXML_VERSION)
print("libxml compiled: ", etree.LIBXML_COMPILED_VERSION)

lxml文档中得知:

HTML解析同样简单。解析器有一个recover关键字参数,默认情况下由HTMLParser设置。它允许libxml2尽力返回一个带有它能够解析的所有内容的有效HTML树。它不会在解析器错误时引发异常。您应该使用libxml2版本2.6.21或更新版本以利用此功能。

不确定为什么它无法解析此破碎的HTML(xmllint抛出了许多错误),但简单的测试表明找到的元素比预期的少。应该找到6个divs

divs = tree.xpath('/html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div')
for i in range(len(divs)):
    print(i, divs[i].tag)

结果:

0 div
1 div
2 div
3 div
英文:

Xpaths starting with a single / are absolute ones whereas // are relative.
Using indexed elements in xpath makes it prone to fail if the page structure changes.

Giving the "label" of the element as a reference makes the xpath more readable (the rest of the code is the same)

 nome_raw = tree.xpath(&#39;//div[preceding-sibling::div/div/p[. = &quot;Codice fiscale&quot;]]/div/h4&#39;)
 for i in range(len(nome_raw)):
     print(nome_raw[i].text) 
VTLLCU86S03I348V

Regarding why the indexed xpath didn't work I did some tests with xmllint and it found the element

xmllint --html --recover --xpath &#39;/html/body/div[2]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4&#39; aaa.html 2&gt;/dev/null
&lt;h4 class=&quot;elementor-heading-title elementor-size-default&quot;&gt;VTLLCU86S03I348V&lt;/h4&gt;

Both lxml and xmllint are based on libxml2 but found a version difference between both

libxml2 (lxml) : 2.9.9
libxml2 (xmllint): 2.9.14

&gt;&gt;&gt; from lxml import etree
&gt;&gt;&gt; print (&quot;libxml used:      &quot;, etree.LIBXML_VERSION)
libxml used:       (2, 9, 9)
&gt;&gt;&gt; print (&quot;libxml compiled:  &quot;, etree.LIBXML_COMPILED_VERSION)
libxml compiled:   (2, 9, 9)

From lxml docs

> HTML parsing is similarly simple. The parsers have a recover keyword argument that the HTMLParser sets by default. It lets libxml2 try its best to return a valid HTML tree with all content it can manage to parse. It will not raise an exception on parser errors. You should use libxml2 version 2.6.21 or newer to take advantage of this feature.

Not sure why it fails to parse this BROKEN html (xmllintthrew lots of errors) but a simple test shows that finds less elements than expected. Should find 6 divs

divs = tree.xpath(&#39;/html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div&#39;)

for i in range(len(divs)):
    print(i, divs[i].tag)

Result

0 div
1 div
2 div
3 div

huangapple
  • 本文由 发表于 2023年5月18日 05:16:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76276258.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定