英文:
Remove white spaces line breaks from the extracted text Python scraping
问题
I am facing an issue regarding extracting text from the website page. I am using the XPath
selector and Scrapy
for this.
The page contains the markup like this:
<div class="snippet-content">
<h2>First Child</h2>
<p>Hello</p>
This is large text ..........
I basically need the text after the 2 immediate children. The selector which I am using is this:
text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get()
The text is extracted correctly but it contains white spaces
, NBPS
, and new line break \r\n
characters.
For example:
Extracting text is like this:
" \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023. "
Is there a way to get sanitized and clean text without all trailing whitespaces
, linebreaks
characters, and NBPS characters?
英文:
I am facing an issue regarding extracting text from the website page. I am using the XPath
selector and Scrapy
for this.
The page contains the markup like this:
<div class="snippet-content">
<h2>First Child</h2>
<p>Hello</p>
This is large text ..........
</div>
I basically need the text after the 2 immediate children. The selector which I am using is this:
text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get()
The text is extracted correctly but it contains white spaces
, NBPS
, and new line break \r\n
characters.
For example:
Extracting text is like this:
" \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023. "
Is there a way to get sanitized and clean text without all trailing whitespaces
, linebreaks
characters, and NBPS characters?
答案1
得分: 1
你可以使用xpath函数 normalize-space
,但这不仅仅是从字符串的开头和结尾删除空格。如果字符串中还包含连续的空格或其他空白字符,它也会将它们减少为单个空格,而不管它们在字符串中的位置。
另外,你可以使用Python的 str.strip
方法,它默认(不设置参数的情况下)只会从字符串的开头和结尾删除空白字符。
示例:
text = response.xpath('normalize-space(//div[contains(@class, "snippet-content")]/text()[last()])').get()
text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get().strip()
英文:
You can use the xpath function normalize-space
, but this does more than simply removing whitespace from the beginning and end of a string. If the string also contains runs of spaces or other whitespace characters it would also reduce them down to a single whitespace regardless of where they are located in the string.
Alternatively you can use the python str.strip
method which by default(without setting a parameter) only removes whitespace characters from the beginning and end of a string.
Examples:
text = response.xpath('normalize-space(//div[contains(@class, "snippet-content")]/text()[last()])').get()
text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get().strip()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论