2023年4月11日 01:38:49go评论100阅读模式

英文:

Remove white spaces line breaks from the extracted text Python scraping

问题

I am facing an issue regarding extracting text from the website page. I am using the XPath selector and Scrapy for this.

The page contains the markup like this:

&lt;div class=&quot;snippet-content&quot;&gt;
    &lt;h2&gt;First Child&lt;/h2&gt;
    &lt;p&gt;Hello&lt;/p&gt;
    This is large text ..........

I basically need the text after the 2 immediate children. The selector which I am using is this:

text = response.xpath(&#39;//div[contains(@class, &quot;snippet-content&quot;)]/text()[last()]&#39;).get()

The text is extracted correctly but it contains white spaces, NBPS, and new line break \r\n characters.

For example:

Extracting text is like this:

&quot;         \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023.                                &quot;

Is there a way to get sanitized and clean text without all trailing whitespaces, linebreaks characters, and NBPS characters?

英文:

I am facing an issue regarding extracting text from the website page. I am using the XPath selector and Scrapy for this.

The page contains the markup like this:

&lt;div class=&quot;snippet-content&quot;&gt;
    &lt;h2&gt;First Child&lt;/h2&gt;
    &lt;p&gt;Hello&lt;/p&gt;
    This is large text ..........
&lt;/div&gt;

I basically need the text after the 2 immediate children. The selector which I am using is this:

text = response.xpath(&#39;//div[contains(@class, &quot;snippet-content&quot;)]/text()[last()]&#39;).get()

The text is extracted correctly but it contains white spaces, NBPS, and new line break \r\n characters.

For example:

Extracting text is like this:

&quot;         \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023.                                &quot;

Is there a way to get sanitized and clean text without all trailing whitespaces, linebreaks characters, and NBPS characters?

答案1

得分: 1

你可以使用xpath函数 normalize-space，但这不仅仅是从字符串的开头和结尾删除空格。如果字符串中还包含连续的空格或其他空白字符，它也会将它们减少为单个空格，而不管它们在字符串中的位置。

另外，你可以使用Python的 str.strip 方法，它默认（不设置参数的情况下）只会从字符串的开头和结尾删除空白字符。

示例：

text = response.xpath('normalize-space(//div[contains(@class, "snippet-content")]/text()[last()])').get()

text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get().strip()

英文:

You can use the xpath function normalize-space, but this does more than simply removing whitespace from the beginning and end of a string. If the string also contains runs of spaces or other whitespace characters it would also reduce them down to a single whitespace regardless of where they are located in the string.

Alternatively you can use the python str.strip method which by default(without setting a parameter) only removes whitespace characters from the beginning and end of a string.

Examples:

text = response.xpath(&#39;normalize-space(//div[contains(@class, &quot;snippet-content&quot;)]/text()[last()])&#39;).get()

text = response.xpath(&#39;//div[contains(@class, &quot;snippet-content&quot;)]/text()[last()]&#39;).get().strip()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

移除从Python爬取的提取文本中的空格和换行符。

问题

答案1

在列表元素中查找并替换符号

How can I use os.walk and PIL to save an edited image in a new directory but with the same subdirectories as in the original source?

Tkcalendar 配置 DateEntry 小部件

保持纵横比的同时调整图像大小

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。