英文:
What's the best way to get just the user readble word content of a page?
问题
Here's the translated text without the code part:
拿一个普通的主流网址来说:
https://people.com/books/jay-shetty-announces-new-book-8-rules-of-love/
只需复制并粘贴文章的文本作为人工操作是相当容易的。但在2023年,是否有一种标准的方式来仅获取文本?
-
使用curl仅获取呈现的HTML并不完美,因为有时网站只通过JavaScript呈现文本。
-
使用phantomjs或无头浏览器听起来像是一种方法,但那么获取只文本并忽略非文本的现代技术是什么?
英文:
Take your average mainstream url like:
https://people.com/books/jay-shetty-announces-new-book-8-rules-of-love/
And it's pretty easy to just copy and paste the text of the article as a human. But is there any standard way in 2023 to get just the text?
-
Using curl to just get the rendered html isn't perfect because sometimes the site only renders the text via javascript.
-
Using phantomjs or a headless browser sounds like the way, but then what's the modern technique for getting just the text and ignore the non-text?
答案1
得分: 2
我将翻译您提供的代码部分:
前往回答自己的问题并在Golang中推荐使用chromedp。如果您有`chromedp.WaitReady("body")`和`chromedp.Nodes("//p[text()] | //li[text()]", &res)`,您将首先执行页面上的所有JavaScript,然后可以读取p或li文本元素,如下所示。
package main
import (
"context"
"fmt"
"log"
"github.com/chromedp/cdproto/cdp"
"github.com/chromedp/chromedp"
)
func main() {
url := "https://anyurl.com"
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// 运行任务列表
var res []*cdp.Node
err := chromedp.Run(ctx,
chromedp.Navigate(url),
chromedp.WaitReady("body"),
chromedp.Nodes("//p[text()] | //li[text()]", &res),
)
if err != nil {
log.Fatal(err)
}
for _, item := range res {
var innerHTML string
chromedp.Run(ctx,
chromedp.InnerHTML(item.FullXPath(), &innerHTML),
)
fmt.Println(innerHTML)
}
}
请注意,我已将代码中的HTML实体(如"
)替换为其对应的字符,以便更好地理解代码。
英文:
going to answer my own question and recommend chromedp in golang. If you have chromedp.WaitReady("body"),
you get all the javascript to execute on the page first and then you can read p or li text elements like so.
chromedp.Nodes("//p[text()] | //li[text()]", &res),
package main
import (
"context"
"fmt"
"log"
"github.com/chromedp/cdproto/cdp"
"github.com/chromedp/chromedp"
)
func main() {
url := "https://anyurl.com"
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// run task list
var res []*cdp.Node
err := chromedp.Run(ctx,
chromedp.Navigate(url),
chromedp.WaitReady("body"),
chromedp.Nodes("//p[text()] | //li[text()]", &res),
)
if err != nil {
log.Fatal(err)
}
for _, item := range res {
var innerHTML string
chromedp.Run(ctx,
chromedp.InnerHTML(item.FullXPath(), &innerHTML),
)
fmt.Println(innerHTML)
}
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论