英文:
How to extract only text from HTML in Golang?
问题
提取HTML中的文本,我使用一个完全符合HTML5标准的标记解析器,像这样:
s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`
domDocTest := html.NewTokenizer(strings.NewReader(s))
for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
if tokenType != html.TextToken {
tokenType = domDocTest.Next()
continue
}
TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
if len(TxtContent) > 0 {
fmt.Printf("%s\n", TxtContent)
}
tokenType = domDocTest.Next()
}
但是我得到了以下结果:
Links:
Foo
BarBaz
TEXT
I
WANT
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]>
我不想要CDATA
内容。有什么办法只获取文本内容吗?
英文:
To extract text from HTML, I use a fully HTML5-compliant tokenizer and parser, like this
s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`
domDocTest := html.NewTokenizer(strings.NewReader(s))
for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
if tokenType != html.TextToken {
tokenType = domDocTest.Next()
continue
}
TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
if len(TxtContent) > 0 {
fmt.Printf("%s\n", TxtContent)
}
tokenType = domDocTest.Next()
}
but I got this result
Links:
Foo
BarBaz
TEXT
I
WANT
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
I don't want CDATA
content. Some idea, how to get only the text content?
答案1
得分: 10
根据 @Eric Pauley 的指示,我查看了 TextTokens
和 StartTagTokens
。这是我的解决方案:
s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`
domDocTest := html.NewTokenizer(strings.NewReader(s))
previousStartTokenTest := domDocTest.Token()
loopDomTest:
for {
tt := domDocTest.Next()
switch {
case tt == html.ErrorToken:
break loopDomTest // 文档结束,完成
case tt == html.StartTagToken:
previousStartTokenTest = domDocTest.Token()
case tt == html.TextToken:
if previousStartTokenTest.Data == "script" {
continue
}
TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
if len(TxtContent) > 0 {
fmt.Printf("%s\n", TxtContent)
}
}
}
英文:
As indicated by @Eric Pauley, I look at TextTokens
& StartTagTokens
.
Here is my solution
s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`
domDocTest := html.NewTokenizer(strings.NewReader(s))
previousStartTokenTest := domDocTest.Token()
loopDomTest:
for {
tt := domDocTest.Next()
switch {
case tt == html.ErrorToken:
break loopDomTest // End of the document, done
case tt == html.StartTagToken:
previousStartTokenTest = domDocTest.Token()
case tt == html.TextToken:
if previousStartTokenTest.Data == "script" {
continue
}
TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
if len(TxtContent) > 0 {
fmt.Printf("%s\n", TxtContent)
}
}
}
答案2
得分: 5
如果你使用github.com/PuerkitoBio/goquery,就可以很容易地实现你想要的效果。
-
首先,你需要使用document.Find()来识别你想要删除的元素,在你的情况下是
scripts
,所以是document.Find(scripts)
。 -
然后,你需要使用element.Remove()将其从文档中删除。
-
最后,你可以使用document.Text()来打印/获取文本。
所以,最终的代码将是:
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main(){
s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span><script type='text/javascript'>/* <![CDATA[ */var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};/* ]]> */</script>`
p := strings.NewReader(s)
doc, _ := goquery.NewDocumentFromReader(p)
doc.Find("script").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
fmt.Println(doc.Text()) // Links:FooBarBazTEXT I WANT
}
英文:
If you use github.com/PuerkitoBio/goquery it's pretty easy to achieve what you want.
-
You first need to use document.Find() to identify the element you want to remove, in your case
scripts
, sodocument.Find(scripts)
-
Then, you need to remove it from the document using element.Remove()
-
Finally, you print/get the text using document.Text()
So, the final code would be
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main(){
s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span><script type='text/javascript'>/* <![CDATA[ */var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};/* ]]> */</script>`
p := strings.NewReader(s)
doc, _ := goquery.NewDocumentFromReader(p)
doc.Find("script").Each(func(i int, el *goquery.Selection) {
el.Remove()
})
fmt.Println(doc.Text()) // Links:FooBarBazTEXT I WANT
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论