英文:
Cannot extract span text content using python selenium
问题
我正在创建一个Python项目,其目标是从房地产门户网站提取一些数据。
我使用Python并使用Selenium包。为了查找元素,我使用XPath。
总体而言,一切都很顺利,但当我尝试提取span元素的文本时,遇到了问题。
span元素的HTML如下:
<span class="some-class">
<svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
<path d="some-path" fill="currentColor" fill-rule="evenodd">
</path>
</svg>
text to scrap
</span>
我使用XPath提取此span元素:
my_obj = i.find_element(By.XPATH, './div/div/div[2]/div[3]/div/span')
我认为这是正确的,因为它返回Selenium对象,当我尝试使用以下方式获取class属性时:
print('my_obj', my_obj.get_attribute('class'))
它返回正确的class some-class
。
我的问题是,我无法提取此span的文本。我指的是 text to scrap
。
我认为我已经尝试了一切。
my_obj.text
my_obj.get_attribute('innerText')
my_obj.get_attribute('textContent')
my_obj.get_attribute('innerHTML')
上述这些都不起作用。
有任何想法我做错了什么?
英文:
I'm creating python project which goal is to extract some data from estate portal.
I work in python and I use selenium package. To find elements I use Xpath's .
Generally every works fine but when i try to extract text of span i encounter a problem.
span's html:
<span class="some-class">
<svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
<path d="some-path" fill="currentColor" fill-rule="evenodd">
</path>
</svg>
text to scrap
</span>
I extract this span using xpath .
my_obj = i.find_element(By.XPATH, './div/div/div[2]/div[3]/div/span'
I think it is correct because it returns selenium object and when i try to get class attribute using:
print('my_obj',my_obj.get_attribute('class'))
it returns correct class some-class
My problem is that's i cannot extract text of this span. I mean text to scrap
.
I think i have tried everything .
my_obj.text
my_obj.get_attribute('innetText')
my_obj.get_attribute('textContent')
my_obj.get_attribute('innerHTML')
These obove doesnt't work.
Any Idea whats's I 'm doing wrong?
答案1
得分: 1
以下是已翻译的内容:
给定HTML:
<span class="some-class">
<svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
<path d="some-path" fill="currentColor" fill-rule="evenodd">
</path>
</svg>
text to scrap
</span>
文本即text to scrap位于一个Text Node中,是其父元素<p>
的_lastChild_。因此,要提取所需文本,可以使用以下任一定位策略:
- 使用_xpath,
execute_script()
和_textContent_:
print(driver.execute_script('return arguments[0].lastChild.textContent;', driver.find_element(By.XPATH, "//span[@class='some-class']")).strip())
- 使用_xpath,
get_attribute()
和splitlines()
:
print(driver.find_element(By.CSS_SELECTOR, "span.some-class").get_attribute("innerHTML").splitlines()[2])
替代方法
作为替代方法,您还可以使用Beautiful Soup,如下所示:
from bs4 import BeautifulSoup
html_text = '''
<span class="some-class">
<svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
<path d="some-path" fill="currentColor" fill-rule="evenodd">
</path>
</svg>
text to scrap
</span>
'''
soup = BeautifulSoup(html_text, 'html.parser')
last_text = soup.find("span", {"class": "some-class"}).contents[2]
print(last_text.strip())
控制台输出:
text to scrap
另一种替代方法
作为另一种替代方法,您还可以使用lxml.etree,如下所示:
from lxml import etree
html_text = '''
<span class="some-class">
<svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
<path d="some-path" fill="currentColor" fill-rule="evenodd">
</path>
</svg>
text to scrap
</span>
'''
x = etree.HTML(html_text)
result = x.xpath('//span[@class="some-class"]/text()[2]') # 获取span内的文本
print(result[0].strip()) # 由于LXML返回一个列表,您需要获取第一个元素
控制台输出:
text to scrap
参考
您可以在以下链接中找到一些相关的详细讨论:
英文:
Given the HTML:
<span class="some-class">
<svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
<path d="some-path" fill="currentColor" fill-rule="evenodd">
</path>
</svg>
text to scrap
</span>
The text i.e. text to scrap is a within a Text Node and the lastChild of it's parent <p>
. So to extract the desired text you can use either of the following locator strategies:
-
Using xpath,
execute_script()
and textContent:print(driver.execute_script('return arguments[0].lastChild.textContent;', driver.find_element(By.XPATH, "//span[@class="some-class"]")).strip())
-
Using xpath,
get_attribute()
andsplitlines()
:print(driver.find_element(By.CSS_SELECTOR, "span.some-class").get_attribute("innerHTML").splitlines()[2])
Alternative
As an alternative you can also use Beautiful Soup as follows:
Code Block:
from bs4 import BeautifulSoup
html_text = '''
<span class="some-class">
<svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
<path d="some-path" fill="currentColor" fill-rule="evenodd">
</path>
</svg>
text to scrap
</span>
'''
soup = BeautifulSoup(html_text, 'html.parser')
last_text = soup.find("span", {"class": "some-class"}).contents[2]
print(last_text.strip())
Console Output:
text to scrap
Another Alternative
As another alternative you can also use lxml.etree as follows:
Code Block:
from lxml import etree
html_text = '''
<span class="some-class">
<svg width="1em" height="1em" viewBox="0 0 24 24" xmlns="http://www.ty.org/1000/svg" class="other-some-class">
<path d="some-path" fill="currentColor" fill-rule="evenodd">
</path>
</svg>
text to scrap
</span>
'''
x = etree.HTML(html)
result = x.xpath('//span[@class="some-class"]/text()[2]') # get the text inside span
print(result[0].strip()) # since LXML return a list, you need to get the first one
Console Output:
text to scrap
References
You can find a couple of relevant detailed discussions in:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论