无法使用Python Selenium提取跨度文本内容。

huangapple go评论67阅读模式
英文:

Cannot extract span text content using python selenium

问题

我正在创建一个Python项目,其目标是从房地产门户网站提取一些数据。
我使用Python并使用Selenium包。为了查找元素,我使用XPath。

总体而言,一切都很顺利,但当我尝试提取span元素的文本时,遇到了问题。

span元素的HTML如下:

<span class="some-class">
    <svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
        <path d="some-path" fill="currentColor" fill-rule="evenodd">
        </path>
    </svg>
    text to scrap
</span>

我使用XPath提取此span元素:

my_obj = i.find_element(By.XPATH, './div/div/div[2]/div[3]/div/span')

我认为这是正确的,因为它返回Selenium对象,当我尝试使用以下方式获取class属性时:

print('my_obj', my_obj.get_attribute('class'))

它返回正确的class some-class

我的问题是,我无法提取此span的文本。我指的是 text to scrap

我认为我已经尝试了一切。

my_obj.text
my_obj.get_attribute('innerText')
my_obj.get_attribute('textContent')
my_obj.get_attribute('innerHTML')

上述这些都不起作用。

有任何想法我做错了什么?

英文:

I'm creating python project which goal is to extract some data from estate portal.
I work in python and I use selenium package. To find elements I use Xpath's .

Generally every works fine but when i try to extract text of span i encounter a problem.

span's html:

&lt;span class=&quot;some-class&quot;&gt;
	&lt;svg width=&quot;1em&quot; height=&quot;1em&quot; viewBox=&quot;0 0 24 24&quot; xmlns=&quot;http://www.ty.org/1000/svg&quot;  class=&quot;other-some-class&quot;&gt;
		&lt;path d=&quot;some-path&quot; fill=&quot;currentColor&quot; fill-rule=&quot;evenodd&quot;&gt;
		&lt;/path&gt;
	&lt;/svg&gt; 
text to scrap
&lt;/span&gt;

I extract this span using xpath .

my_obj = i.find_element(By.XPATH, &#39;./div/div/div[2]/div[3]/div/span&#39;

I think it is correct because it returns selenium object and when i try to get class attribute using:

print(&#39;my_obj&#39;,my_obj.get_attribute(&#39;class&#39;))

it returns correct class some-class

My problem is that's i cannot extract text of this span. I mean text to scrap.

I think i have tried everything .

my_obj.text
my_obj.get_attribute(&#39;innetText&#39;)
my_obj.get_attribute(&#39;textContent&#39;)
my_obj.get_attribute(&#39;innerHTML&#39;)

These obove doesnt't work.

Any Idea whats's I 'm doing wrong?

答案1

得分: 1

以下是已翻译的内容:

给定HTML:

<span class="some-class">
    <svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
        <path d="some-path" fill="currentColor" fill-rule="evenodd">
        </path>
    </svg> 
    text to scrap
</span>

文本即text to scrap位于一个Text Node中,是其父元素<p>的_lastChild_。因此,要提取所需文本,可以使用以下任一定位策略

  • 使用_xpath,execute_script()和_textContent_:
print(driver.execute_script('return arguments[0].lastChild.textContent;', driver.find_element(By.XPATH, "//span[@class='some-class']")).strip())
  • 使用_xpath,get_attribute()splitlines()
print(driver.find_element(By.CSS_SELECTOR, "span.some-class").get_attribute("innerHTML").splitlines()[2])

替代方法

作为替代方法,您还可以使用Beautiful Soup,如下所示:

from bs4 import BeautifulSoup

html_text = '''
<span class="some-class">
    <svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
        <path d="some-path" fill="currentColor" fill-rule="evenodd">
        </path>
    </svg> 
    text to scrap
</span>
'''

soup = BeautifulSoup(html_text, 'html.parser')
last_text = soup.find("span", {"class": "some-class"}).contents[2]
print(last_text.strip())

控制台输出:

text to scrap

另一种替代方法

作为另一种替代方法,您还可以使用lxml.etree,如下所示:

from lxml import etree

html_text = '''
<span class="some-class">
    <svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
        <path d="some-path" fill="currentColor" fill-rule="evenodd">
        </path>
    </svg> 
    text to scrap
</span>
'''
x = etree.HTML(html_text)
result = x.xpath('//span[@class="some-class"]/text()[2]') # 获取span内的文本
print(result[0].strip()) # 由于LXML返回一个列表,您需要获取第一个元素

控制台输出:

text to scrap

参考

您可以在以下链接中找到一些相关的详细讨论:

英文:

Given the HTML:

&lt;span class=&quot;some-class&quot;&gt;
	&lt;svg width=&quot;1em&quot; height=&quot;1em&quot; viewBox=&quot;0 0 24 24&quot; xmlns=&quot;http://www.ty.org/1000/svg&quot;  class=&quot;other-some-class&quot;&gt;
		&lt;path d=&quot;some-path&quot; fill=&quot;currentColor&quot; fill-rule=&quot;evenodd&quot;&gt;
		&lt;/path&gt;
	&lt;/svg&gt; 
text to scrap
&lt;/span&gt;

The text i.e. text to scrap is a within a Text Node and the lastChild of it's parent &lt;p&gt;. So to extract the desired text you can use either of the following locator strategies:

  • Using xpath, execute_script() and textContent:

    print(driver.execute_script(&#39;return arguments[0].lastChild.textContent;&#39;, driver.find_element(By.XPATH, &quot;//span[@class=&quot;some-class&quot;]&quot;)).strip())
    
  • Using xpath, get_attribute() and splitlines():

    print(driver.find_element(By.CSS_SELECTOR, &quot;span.some-class&quot;).get_attribute(&quot;innerHTML&quot;).splitlines()[2])
    

Alternative

As an alternative you can also use Beautiful Soup as follows:

Code Block:

from bs4 import BeautifulSoup

html_text = &#39;&#39;&#39;
&lt;span class=&quot;some-class&quot;&gt;
    &lt;svg width=&quot;1em&quot; height=&quot;1em&quot; viewBox=&quot;0 0 24 24&quot; xmlns=&quot;http://www.ty.org/1000/svg&quot;  class=&quot;other-some-class&quot;&gt;
	    &lt;path d=&quot;some-path&quot; fill=&quot;currentColor&quot; fill-rule=&quot;evenodd&quot;&gt;
	    &lt;/path&gt;
    &lt;/svg&gt; 
    text to scrap
&lt;/span&gt;
&#39;&#39;&#39;

soup = BeautifulSoup(html_text, &#39;html.parser&#39;)
last_text = soup.find(&quot;span&quot;, {&quot;class&quot;: &quot;some-class&quot;}).contents[2]
print(last_text.strip())

Console Output:

text to scrap

Another Alternative

As another alternative you can also use lxml.etree as follows:

Code Block:

from lxml import etree

html_text = &#39;&#39;&#39;
&lt;span class=&quot;some-class&quot;&gt;
    &lt;svg width=&quot;1em&quot; height=&quot;1em&quot; viewBox=&quot;0 0 24 24&quot; xmlns=&quot;http://www.ty.org/1000/svg&quot;  class=&quot;other-some-class&quot;&gt;
	    &lt;path d=&quot;some-path&quot; fill=&quot;currentColor&quot; fill-rule=&quot;evenodd&quot;&gt;
	    &lt;/path&gt;
    &lt;/svg&gt; 
    text to scrap
&lt;/span&gt;
&#39;&#39;&#39;
x = etree.HTML(html)
result = x.xpath(&#39;//span[@class=&quot;some-class&quot;]/text()[2]&#39;) # get the text inside span
print(result[0].strip()) # since LXML return a list, you need to get the first one

Console Output:

text to scrap

References

You can find a couple of relevant detailed discussions in:

huangapple
  • 本文由 发表于 2023年6月13日 18:12:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76463837.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定