Scrapy:选择最后一个子节点?

huangapple go评论68阅读模式
英文:

Scrapy: select last decendant node?

问题

我有一个带有选择器的 dict,用于获取数据:

for key, selector in selectors.items():
    data[key] = response.css(selector).get().strip()

其中一个选择器是 span::text,但有时文本包裹在额外的 a 标签中。我的解决方案是将该条目制作成一个列表,包括 span a::text

for key, selector in selectors.items():
    if type(selector) == list:
        for sel in selector:
            data[key] = response.css(sel).get().strip()
            if data[key] not in ["", None]: break
    else:
        data[key] = response.css(selector).get().strip()

是否有一种方法可以更改选择器,以便无论是否有 a 标签,都可以获取我想要的文本?我希望脚本可以是一行代码,使用 .get().strip()

英文:

I have a dict with selectors which I use to get data:

for key, selector in selectors.items():
    data[key] = response.css(selector).get().strip()

One of the selectors is span::text, but sometimes the text is wrapped in an additional a tag. My solution is to make that entry a list including span a::text:

for key, selector in selectors.items():
    if type(selector) == list:
        for sel in selector:
            data[key] = response.css(sel).get().strip()
            if data[key] not in ["", None]: break
    else:
        data[key] = response.css(selector).get().strip()

Is there a way to change the selector so that it will get the text I want whether there's an a tag or not? I would like the script to be a single line with .get().strip().

答案1

得分: 1

Sure, you can just use 'span *::text'.

to Demonstrate:

In [1]: from scrapy.selector import Selector

In [2]: html1 = ''<span><a>text contents</a></span>''

In [3]: html2 = ''<span>text contents</span>''

In [4]: selector1 = Selector(text=html1)

In [5]: selector2 = Selector(text=html2)

In [6]: selector1.css('span *::text').get().strip()
Out[6]: 'text contents'

In [7]: selector2.css('span *::text').get().strip()
Out[7]: 'text contents'
英文:

Sure you can just use 'span *::text'.

to Demonstrate:

In [1]: from scrapy.selector import Selector

In [2]: html1 = '<span><a>text contents</a></span>'

In [3]: html2 = '<span>text contents</span>'

In [4]: selector1 = Selector(text=html1)

In [5]: selector2 = Selector(text=html2)

In [6]: selector1.css('span *::text').get().strip()
Out[6]: 'text contents'

In [7]: selector2.css('span *::text').get().strip()
Out[7]: 'text contents'

huangapple
  • 本文由 发表于 2023年5月25日 21:28:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76332840.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定