2023年5月18日 05:16:14go评论177阅读模式

英文:

lxml parse doesn't recognize Indexing

问题

I am trying to scrape and parse some HTML with Python and extract Italian fiscal code 'VTLLCU86S03I348V'.

If I inspect the page and copy the full XPath of that object I get no results and I do not understand why. I tried with normal XPath like '//div/p/a' for example and it worked. Things become messy when the XPath contains indexing like 'div[3]/div' even though it sounds weird to me as well. I have also noticed that the inspect XPath contains just one '/' at the beginning even though I think that two are required (right?).

Hereunder a minimal reproducible example. What am I missing?

from lxml import etree
import requests

GET

page = requests.get('https://notariato.it/it/notary/luca-vitale/')

PARSING

parser = etree.HTMLParser()
tree = etree.parse(StringIO(page.text), parser)

nome_raw = tree.xpath('//html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4')

for i in range(len(nome_raw)):
print(nome_raw[i].text)

Thank you!

英文:

I am trying to scrape and parse some HMLT with Python and extract Italian fiscal code 'VTLLCU86S03I348V'.

If I inspect the page and copy the full xPath of that object I get no results and I do not understand why. I tried with normal xpath like '//div/p/a' for example and it worked. Things become messy when the xpath contains indexing like 'div[3]/div' even though it sounds weird to me as well. I have also noticed that the inspect xpath contains just one '/' at the beginning even though I think that two are required (righ?).

Hereunder a minimal reproducible example. What am I missing?

from lxml import etree
import requests

# GET 
page = requests.get(&#39;https://notariato.it/it/notary/luca-vitale/&#39;)

# PARSING
parser = etree.HTMLParser()
tree = etree.parse(StringIO(page.text), parser)

nome_raw = tree.xpath(&#39;//html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4&#39;)

for i in range(len(nome_raw)):
    print(nome_raw[i].text)

Thank you!

答案1

得分: 1

Xpath以单个/开头的是绝对路径，而//是相对路径。在xpath中使用索引的元素使其在页面结构更改时容易失败。

将元素的“label”作为参考使xpath更易读（其余代码相同）

nome_raw = tree.xpath('//div[preceding-sibling::div/div/p[. = "Codice fiscale"]]/div/h4')
for i in range(len(nome_raw)):
    print(nome_raw[i].text)
VTLLCU86S03I348V

关于为什么带索引的xpath不起作用，我用xmllint进行了一些测试，它找到了该元素

xmllint --html --recover --xpath '/html/body/div[2]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4' aaa.html 2>/dev/null
<h4 class="elementor-heading-title elementor-size-default">VTLLCU86S03I348V</h4>

lxml和xmllint都基于libxml2，但在两者之间找到了版本差异

libxml2（lxml）：2.9.9
libxml2（xmllint）：2.9.14

from lxml import etree
print("libxml used: ", etree.LIBXML_VERSION)
print("libxml compiled: ", etree.LIBXML_COMPILED_VERSION)

从lxml文档中得知：

HTML解析同样简单。解析器有一个recover关键字参数，默认情况下由HTMLParser设置。它允许libxml2尽力返回一个带有它能够解析的所有内容的有效HTML树。它不会在解析器错误时引发异常。您应该使用libxml2版本2.6.21或更新版本以利用此功能。

不确定为什么它无法解析此破碎的HTML（xmllint抛出了许多错误），但简单的测试表明找到的元素比预期的少。应该找到6个divs

divs = tree.xpath('/html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div')
for i in range(len(divs)):
    print(i, divs[i].tag)

结果：

0 div
1 div
2 div
3 div

英文:

Xpaths starting with a single / are absolute ones whereas // are relative.
Using indexed elements in xpath makes it prone to fail if the page structure changes.

Giving the "label" of the element as a reference makes the xpath more readable (the rest of the code is the same)

 nome_raw = tree.xpath(&#39;//div[preceding-sibling::div/div/p[. = &quot;Codice fiscale&quot;]]/div/h4&#39;)
 for i in range(len(nome_raw)):
     print(nome_raw[i].text) 
VTLLCU86S03I348V

Regarding why the indexed xpath didn't work I did some tests with xmllint and it found the element

xmllint --html --recover --xpath &#39;/html/body/div[2]/section[2]/div/div[1]/div/section/div/div/div/div[6]/div/h4&#39; aaa.html 2&gt;/dev/null
&lt;h4 class=&quot;elementor-heading-title elementor-size-default&quot;&gt;VTLLCU86S03I348V&lt;/h4&gt;

Both lxml and xmllint are based on libxml2 but found a version difference between both

libxml2 (lxml) : 2.9.9
libxml2 (xmllint): 2.9.14

&gt;&gt;&gt; from lxml import etree
&gt;&gt;&gt; print (&quot;libxml used:      &quot;, etree.LIBXML_VERSION)
libxml used:       (2, 9, 9)
&gt;&gt;&gt; print (&quot;libxml compiled:  &quot;, etree.LIBXML_COMPILED_VERSION)
libxml compiled:   (2, 9, 9)

From lxml docs

> HTML parsing is similarly simple. The parsers have a recover keyword argument that the HTMLParser sets by default. It lets libxml2 try its best to return a valid HTML tree with all content it can manage to parse. It will not raise an exception on parser errors. You should use libxml2 version 2.6.21 or newer to take advantage of this feature.

Not sure why it fails to parse this BROKEN html (xmllintthrew lots of errors) but a simple test shows that finds less elements than expected. Should find 6 divs

divs = tree.xpath(&#39;/html/body/div[3]/section[2]/div/div[1]/div/section/div/div/div/div&#39;)

for i in range(len(divs)):
    print(i, divs[i].tag)

Result

0 div
1 div
2 div
3 div

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

lxml解析不识别索引化。

问题

GET

PARSING

答案1

如何将PowerBIClient和Streamlit与QuickVisualize集成

Sort dates in mm/dd/yy and dd/mm/yy where I know the month they are from

`pd.fillna(pd.Series())` 无法填充所有的 NaN 值。

Async.io的`as_completed`返回ClientResponse协程而不是包含内容的实际响应。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论