问题

Here's the translated text without the code part:

拿一个普通的主流网址来说：

https://people.com/books/jay-shetty-announces-new-book-8-rules-of-love/

只需复制并粘贴文章的文本作为人工操作是相当容易的。但在2023年，是否有一种标准的方式来仅获取文本？

使用curl仅获取呈现的HTML并不完美，因为有时网站只通过JavaScript呈现文本。
使用phantomjs或无头浏览器听起来像是一种方法，但那么获取只文本并忽略非文本的现代技术是什么？

英文:

Take your average mainstream url like:

https://people.com/books/jay-shetty-announces-new-book-8-rules-of-love/

And it's pretty easy to just copy and paste the text of the article as a human. But is there any standard way in 2023 to get just the text?

Using curl to just get the rendered html isn't perfect because sometimes the site only renders the text via javascript.
Using phantomjs or a headless browser sounds like the way, but then what's the modern technique for getting just the text and ignore the non-text?

答案1

得分: 2

我将翻译您提供的代码部分：

前往回答自己的问题并在Golang中推荐使用chromedp。如果您有`chromedp.WaitReady("body")`和`chromedp.Nodes("//p[text()] | //li[text()]", &res)`，您将首先执行页面上的所有JavaScript，然后可以读取p或li文本元素，如下所示。

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/chromedp/cdproto/cdp"
	"github.com/chromedp/chromedp"
)

func main() {
	url := "https://anyurl.com"

	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()

	// 运行任务列表
	var res []*cdp.Node
	err := chromedp.Run(ctx,
		chromedp.Navigate(url),
		chromedp.WaitReady("body"),
		chromedp.Nodes("//p[text()] | //li[text()]", &res),
	)
	if err != nil {
		log.Fatal(err)
	}

	for _, item := range res {

		var innerHTML string
		chromedp.Run(ctx,
			chromedp.InnerHTML(item.FullXPath(), &innerHTML),
		)

		fmt.Println(innerHTML)
	}
}

请注意，我已将代码中的HTML实体（如"）替换为其对应的字符，以便更好地理解代码。

英文:

going to answer my own question and recommend chromedp in golang. If you have chromedp.WaitReady("body"), chromedp.Nodes("//p[text()] | //li[text()]", &res), you get all the javascript to execute on the page first and then you can read p or li text elements like so.

package main

import (
	&quot;context&quot;
	&quot;fmt&quot;
	&quot;log&quot;

	&quot;github.com/chromedp/cdproto/cdp&quot;
	&quot;github.com/chromedp/chromedp&quot;
)

func main() {
	url := &quot;https://anyurl.com&quot;

	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()

	// run task list
	var res []*cdp.Node
	err := chromedp.Run(ctx,
		chromedp.Navigate(url),
		chromedp.WaitReady(&quot;body&quot;),
		chromedp.Nodes(&quot;//p[text()] | //li[text()]&quot;, &amp;res),
	)
	if err != nil {
		log.Fatal(err)
	}

	for _, item := range res {

		var innerHTML string
		chromedp.Run(ctx,
			chromedp.InnerHTML(item.FullXPath(), &amp;innerHTML),
		)

		fmt.Println(innerHTML)
	}
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

最佳方法是获取页面上仅用户可读的文字内容。

问题

答案1

使用正则表达式或`next_child_element`来查找正确的标签元素（Beautiful Soup）。

“Beautiful Soup: AttributeError: ‘NoneType’ object has no attribute ‘text'”

为什么我在Java中过一段时间后会收到403状态码？

请求返回“必须提供查询字符串”是因为被抓取时需要提供查询字符串。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论