移除从Python爬取的提取文本中的空格和换行符。

huangapple go评论58阅读模式
英文:

Remove white spaces line breaks from the extracted text Python scraping

问题

I am facing an issue regarding extracting text from the website page. I am using the XPath selector and Scrapy for this.

The page contains the markup like this:

<div class="snippet-content">
    <h2>First Child</h2>
    <p>Hello</p>
    This is large text ..........

I basically need the text after the 2 immediate children. The selector which I am using is this:

text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get()

The text is extracted correctly but it contains white spaces, NBPS, and new line break \r\n characters.

For example:

Extracting text is like this:

"         \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023.                                "

Is there a way to get sanitized and clean text without all trailing whitespaces, linebreaks characters, and NBPS characters?

英文:

I am facing an issue regarding extracting text from the website page. I am using the XPath selector and Scrapy for this.

The page contains the markup like this:

<div class="snippet-content">
    <h2>First Child</h2>
    <p>Hello</p>
    This is large text ..........
</div>

I basically need the text after the 2 immediate children. The selector which I am using is this:

text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get()

The text is extracted correctly but it contains white spaces, NBPS, and new line break \r\n characters.

For example:

Extracting text is like this:

"         \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023.                                "

Is there a way to get sanitized and clean text without all trailing whitespaces, linebreaks characters, and NBPS characters?

答案1

得分: 1

你可以使用xpath函数 normalize-space,但这不仅仅是从字符串的开头和结尾删除空格。如果字符串中还包含连续的空格或其他空白字符,它也会将它们减少为单个空格,而不管它们在字符串中的位置。

另外,你可以使用Python的 str.strip 方法,它默认(不设置参数的情况下)只会从字符串的开头和结尾删除空白字符。

示例:

text = response.xpath('normalize-space(//div[contains(@class, "snippet-content")]/text()[last()])').get()
text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get().strip()
英文:

You can use the xpath function normalize-space, but this does more than simply removing whitespace from the beginning and end of a string. If the string also contains runs of spaces or other whitespace characters it would also reduce them down to a single whitespace regardless of where they are located in the string.

Alternatively you can use the python str.strip method which by default(without setting a parameter) only removes whitespace characters from the beginning and end of a string.

Examples:

text = response.xpath('normalize-space(//div[contains(@class, "snippet-content")]/text()[last()])').get()
text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get().strip()

huangapple
  • 本文由 发表于 2023年4月11日 01:38:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75979362.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定