最佳方法是获取页面上仅用户可读的文字内容。

huangapple go评论58阅读模式
英文:

What's the best way to get just the user readble word content of a page?

问题

Here's the translated text without the code part:

拿一个普通的主流网址来说:

https://people.com/books/jay-shetty-announces-new-book-8-rules-of-love/

只需复制并粘贴文章的文本作为人工操作是相当容易的。但在2023年,是否有一种标准的方式来仅获取文本?

  1. 使用curl仅获取呈现的HTML并不完美,因为有时网站只通过JavaScript呈现文本。

  2. 使用phantomjs或无头浏览器听起来像是一种方法,但那么获取只文本并忽略非文本的现代技术是什么?

英文:

Take your average mainstream url like:

https://people.com/books/jay-shetty-announces-new-book-8-rules-of-love/

And it's pretty easy to just copy and paste the text of the article as a human. But is there any standard way in 2023 to get just the text?

  1. Using curl to just get the rendered html isn't perfect because sometimes the site only renders the text via javascript.

  2. Using phantomjs or a headless browser sounds like the way, but then what's the modern technique for getting just the text and ignore the non-text?

答案1

得分: 2

我将翻译您提供的代码部分:

前往回答自己的问题并在Golang中推荐使用chromedp如果您有`chromedp.WaitReady("body")``chromedp.Nodes("//p[text()] | //li[text()]", &res)`您将首先执行页面上的所有JavaScript然后可以读取p或li文本元素如下所示

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/chromedp/cdproto/cdp"
	"github.com/chromedp/chromedp"
)

func main() {
	url := "https://anyurl.com"

	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()

	// 运行任务列表
	var res []*cdp.Node
	err := chromedp.Run(ctx,
		chromedp.Navigate(url),
		chromedp.WaitReady("body"),
		chromedp.Nodes("//p[text()] | //li[text()]", &res),
	)
	if err != nil {
		log.Fatal(err)
	}

	for _, item := range res {

		var innerHTML string
		chromedp.Run(ctx,
			chromedp.InnerHTML(item.FullXPath(), &innerHTML),
		)

		fmt.Println(innerHTML)
	}
}

请注意,我已将代码中的HTML实体(如")替换为其对应的字符,以便更好地理解代码。

英文:

going to answer my own question and recommend chromedp in golang. If you have chromedp.WaitReady("body"),
chromedp.Nodes("//p[text()] | //li[text()]", &res),
you get all the javascript to execute on the page first and then you can read p or li text elements like so.

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/chromedp/cdproto/cdp"
	"github.com/chromedp/chromedp"
)

func main() {
	url := "https://anyurl.com"

	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()

	// run task list
	var res []*cdp.Node
	err := chromedp.Run(ctx,
		chromedp.Navigate(url),
		chromedp.WaitReady("body"),
		chromedp.Nodes("//p[text()] | //li[text()]", &res),
	)
	if err != nil {
		log.Fatal(err)
	}

	for _, item := range res {

		var innerHTML string
		chromedp.Run(ctx,
			chromedp.InnerHTML(item.FullXPath(), &innerHTML),
		)

		fmt.Println(innerHTML)
	}
}

huangapple
  • 本文由 发表于 2023年5月30日 01:07:13
  • 转载请务必保留本文链接:https://go.coder-hub.com/76359161.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定