2023年6月13日 18:12:54go评论78阅读模式

英文:

Cannot extract span text content using python selenium

问题

我正在创建一个Python项目，其目标是从房地产门户网站提取一些数据。
我使用Python并使用Selenium包。为了查找元素，我使用XPath。

总体而言，一切都很顺利，但当我尝试提取span元素的文本时，遇到了问题。

span元素的HTML如下：

<span class="some-class">
    <svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
        <path d="some-path" fill="currentColor" fill-rule="evenodd">
        </path>
    </svg>
    text to scrap
</span>

我使用XPath提取此span元素：

my_obj = i.find_element(By.XPATH, './div/div/div[2]/div[3]/div/span')

我认为这是正确的，因为它返回Selenium对象，当我尝试使用以下方式获取class属性时：

print('my_obj', my_obj.get_attribute('class'))

它返回正确的class some-class。

我的问题是，我无法提取此span的文本。我指的是 text to scrap。

我认为我已经尝试了一切。

my_obj.text
my_obj.get_attribute('innerText')
my_obj.get_attribute('textContent')
my_obj.get_attribute('innerHTML')

上述这些都不起作用。

有任何想法我做错了什么？

英文:

I'm creating python project which goal is to extract some data from estate portal.
I work in python and I use selenium package. To find elements I use Xpath's .

Generally every works fine but when i try to extract text of span i encounter a problem.

span's html:

&lt;span class=&quot;some-class&quot;&gt;
	&lt;svg width=&quot;1em&quot; height=&quot;1em&quot; viewBox=&quot;0 0 24 24&quot; xmlns=&quot;http://www.ty.org/1000/svg&quot;  class=&quot;other-some-class&quot;&gt;
		&lt;path d=&quot;some-path&quot; fill=&quot;currentColor&quot; fill-rule=&quot;evenodd&quot;&gt;
		&lt;/path&gt;
	&lt;/svg&gt; 
text to scrap
&lt;/span&gt;

I extract this span using xpath .

my_obj = i.find_element(By.XPATH, &#39;./div/div/div[2]/div[3]/div/span&#39;

I think it is correct because it returns selenium object and when i try to get class attribute using:

print(&#39;my_obj&#39;,my_obj.get_attribute(&#39;class&#39;))

it returns correct class some-class

My problem is that's i cannot extract text of this span. I mean text to scrap.

I think i have tried everything .

my_obj.text
my_obj.get_attribute(&#39;innetText&#39;)
my_obj.get_attribute(&#39;textContent&#39;)
my_obj.get_attribute(&#39;innerHTML&#39;)

These obove doesnt't work.

Any Idea whats's I 'm doing wrong?

答案1

得分: 1

以下是已翻译的内容：

给定HTML：

<span class="some-class">
    <svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
        <path d="some-path" fill="currentColor" fill-rule="evenodd">
        </path>
    </svg> 
    text to scrap
</span>

文本即text to scrap位于一个Text Node中，是其父元素<p>的_lastChild_。因此，要提取所需文本，可以使用以下任一定位策略：

使用_xpath，execute_script()和_textContent_：

print(driver.execute_script('return arguments[0].lastChild.textContent;', driver.find_element(By.XPATH, "//span[@class='some-class']")).strip())

使用_xpath，get_attribute()和splitlines()：

print(driver.find_element(By.CSS_SELECTOR, "span.some-class").get_attribute("innerHTML").splitlines()[2])

替代方法

作为替代方法，您还可以使用Beautiful Soup，如下所示：

from bs4 import BeautifulSoup

html_text = '''
<span class="some-class">
    <svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
        <path d="some-path" fill="currentColor" fill-rule="evenodd">
        </path>
    </svg> 
    text to scrap
</span>
'''

soup = BeautifulSoup(html_text, 'html.parser')
last_text = soup.find("span", {"class": "some-class"}).contents[2]
print(last_text.strip())

控制台输出：

text to scrap

另一种替代方法

作为另一种替代方法，您还可以使用lxml.etree，如下所示：

from lxml import etree

html_text = '''
<span class="some-class">
    <svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
        <path d="some-path" fill="currentColor" fill-rule="evenodd">
        </path>
    </svg> 
    text to scrap
</span>
'''
x = etree.HTML(html_text)
result = x.xpath('//span[@class="some-class"]/text()[2]') # 获取span内的文本
print(result[0].strip()) # 由于LXML返回一个列表，您需要获取第一个元素

控制台输出：

text to scrap

参考

您可以在以下链接中找到一些相关的详细讨论：

英文:

Given the HTML:

&lt;span class=&quot;some-class&quot;&gt;
	&lt;svg width=&quot;1em&quot; height=&quot;1em&quot; viewBox=&quot;0 0 24 24&quot; xmlns=&quot;http://www.ty.org/1000/svg&quot;  class=&quot;other-some-class&quot;&gt;
		&lt;path d=&quot;some-path&quot; fill=&quot;currentColor&quot; fill-rule=&quot;evenodd&quot;&gt;
		&lt;/path&gt;
	&lt;/svg&gt; 
text to scrap
&lt;/span&gt;

The text i.e. text to scrap is a within a Text Node and the lastChild of it's parent <p>. So to extract the desired text you can use either of the following locator strategies:

Using xpath, execute_script() and textContent:

print(driver.execute_script(&#39;return arguments[0].lastChild.textContent;&#39;, driver.find_element(By.XPATH, &quot;//span[@class=&quot;some-class&quot;]&quot;)).strip())

Using xpath, get_attribute() and splitlines():

print(driver.find_element(By.CSS_SELECTOR, &quot;span.some-class&quot;).get_attribute(&quot;innerHTML&quot;).splitlines()[2])

Alternative

As an alternative you can also use Beautiful Soup as follows:

Code Block:

from bs4 import BeautifulSoup

html_text = &#39;&#39;&#39;
&lt;span class=&quot;some-class&quot;&gt;
    &lt;svg width=&quot;1em&quot; height=&quot;1em&quot; viewBox=&quot;0 0 24 24&quot; xmlns=&quot;http://www.ty.org/1000/svg&quot;  class=&quot;other-some-class&quot;&gt;
	    &lt;path d=&quot;some-path&quot; fill=&quot;currentColor&quot; fill-rule=&quot;evenodd&quot;&gt;
	    &lt;/path&gt;
    &lt;/svg&gt; 
    text to scrap
&lt;/span&gt;
&#39;&#39;&#39;

soup = BeautifulSoup(html_text, &#39;html.parser&#39;)
last_text = soup.find(&quot;span&quot;, {&quot;class&quot;: &quot;some-class&quot;}).contents[2]
print(last_text.strip())

Console Output:

text to scrap

Another Alternative

As another alternative you can also use lxml.etree as follows:

Code Block:

from lxml import etree

html_text = &#39;&#39;&#39;
&lt;span class=&quot;some-class&quot;&gt;
    &lt;svg width=&quot;1em&quot; height=&quot;1em&quot; viewBox=&quot;0 0 24 24&quot; xmlns=&quot;http://www.ty.org/1000/svg&quot;  class=&quot;other-some-class&quot;&gt;
	    &lt;path d=&quot;some-path&quot; fill=&quot;currentColor&quot; fill-rule=&quot;evenodd&quot;&gt;
	    &lt;/path&gt;
    &lt;/svg&gt; 
    text to scrap
&lt;/span&gt;
&#39;&#39;&#39;
x = etree.HTML(html)
result = x.xpath(&#39;//span[@class=&quot;some-class&quot;]/text()[2]&#39;) # get the text inside span
print(result[0].strip()) # since LXML return a list, you need to get the first one

Console Output:

text to scrap

References

You can find a couple of relevant detailed discussions in:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法使用Python Selenium提取跨度文本内容。

问题

答案1

替代方法

另一种替代方法

参考

Alternative

Another Alternative

References

将Python字典转换为JSON并通过Flask发送到HTML。

getting error java.lang.NullPointerException in data driven testing using @DataProvider in testng

如何将Pandas DataFrame 转换为相关矩阵的形状

How to use .split() to extract HH,MM,SS separately from a 1970-1-1T00:00:00Z and get "00" instead of "0"

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论