英文:
Scrapy: select last decendant node?
问题
我有一个带有选择器的 dict
,用于获取数据:
for key, selector in selectors.items():
data[key] = response.css(selector).get().strip()
其中一个选择器是 span::text
,但有时文本包裹在额外的 a
标签中。我的解决方案是将该条目制作成一个列表,包括 span a::text
:
for key, selector in selectors.items():
if type(selector) == list:
for sel in selector:
data[key] = response.css(sel).get().strip()
if data[key] not in ["", None]: break
else:
data[key] = response.css(selector).get().strip()
是否有一种方法可以更改选择器,以便无论是否有 a
标签,都可以获取我想要的文本?我希望脚本可以是一行代码,使用 .get().strip()
。
英文:
I have a dict
with selectors which I use to get data:
for key, selector in selectors.items():
data[key] = response.css(selector).get().strip()
One of the selectors is span::text
, but sometimes the text is wrapped in an additional a
tag. My solution is to make that entry a list including span a::text
:
for key, selector in selectors.items():
if type(selector) == list:
for sel in selector:
data[key] = response.css(sel).get().strip()
if data[key] not in ["", None]: break
else:
data[key] = response.css(selector).get().strip()
Is there a way to change the selector so that it will get the text I want whether there's an a
tag or not? I would like the script to be a single line with .get().strip()
.
答案1
得分: 1
Sure, you can just use 'span *::text'
.
to Demonstrate:
In [1]: from scrapy.selector import Selector
In [2]: html1 = ''<span><a>text contents</a></span>''
In [3]: html2 = ''<span>text contents</span>''
In [4]: selector1 = Selector(text=html1)
In [5]: selector2 = Selector(text=html2)
In [6]: selector1.css('span *::text').get().strip()
Out[6]: 'text contents'
In [7]: selector2.css('span *::text').get().strip()
Out[7]: 'text contents'
英文:
Sure you can just use 'span *::text'
.
to Demonstrate:
In [1]: from scrapy.selector import Selector
In [2]: html1 = '<span><a>text contents</a></span>'
In [3]: html2 = '<span>text contents</span>'
In [4]: selector1 = Selector(text=html1)
In [5]: selector2 = Selector(text=html2)
In [6]: selector1.css('span *::text').get().strip()
Out[6]: 'text contents'
In [7]: selector2.css('span *::text').get().strip()
Out[7]: 'text contents'
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论