如何在Golang中从HTML中提取纯文本?

huangapple go评论109阅读模式
英文:

How to extract only text from HTML in Golang?

问题

提取HTML中的文本,我使用一个完全符合HTML5标准的标记解析器,像这样:

s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`

domDocTest := html.NewTokenizer(strings.NewReader(s))
for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
    if tokenType != html.TextToken {
        tokenType = domDocTest.Next()
        continue
    }
    TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
    if len(TxtContent) > 0 {
        fmt.Printf("%s\n", TxtContent)
    }
    tokenType = domDocTest.Next()
}

但是我得到了以下结果:

Links:
Foo
BarBaz
TEXT
I
WANT
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]>

我不想要CDATA内容。有什么办法只获取文本内容吗?

英文:

To extract text from HTML, I use a fully HTML5-compliant tokenizer and parser, like this

	s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`

	domDocTest := html.NewTokenizer(strings.NewReader(s))
	for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
		if tokenType != html.TextToken {
			tokenType = domDocTest.Next()
			continue
		}
		TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
		if len(TxtContent) > 0 {
			fmt.Printf("%s\n", TxtContent)
		}
		tokenType = domDocTest.Next()
	}

but I got this result

Links:
Foo
BarBaz
TEXT
I
WANT
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */

I don't want CDATA content. Some idea, how to get only the text content?

答案1

得分: 10

根据 @Eric Pauley 的指示,我查看了 TextTokensStartTagTokens。这是我的解决方案:

	s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`

	domDocTest := html.NewTokenizer(strings.NewReader(s))
	previousStartTokenTest := domDocTest.Token()
loopDomTest:
	for {
		tt := domDocTest.Next()
		switch {
		case tt == html.ErrorToken:
			break loopDomTest // 文档结束,完成
		case tt == html.StartTagToken:
			previousStartTokenTest = domDocTest.Token()
		case tt == html.TextToken:
			if previousStartTokenTest.Data == "script" {
				continue
			}
			TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
			if len(TxtContent) > 0 {
				fmt.Printf("%s\n", TxtContent)
			}
		}
	}
英文:

As indicated by @Eric Pauley, I look at TextTokens & StartTagTokens.
Here is my solution

	s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`

	domDocTest := html.NewTokenizer(strings.NewReader(s))
	previousStartTokenTest := domDocTest.Token()
loopDomTest:
	for {
		tt := domDocTest.Next()
		switch {
		case tt == html.ErrorToken:
			break loopDomTest // End of the document,  done
		case tt == html.StartTagToken:
			previousStartTokenTest = domDocTest.Token()
		case tt == html.TextToken:
			if previousStartTokenTest.Data == "script" {
				continue
			}
			TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
			if len(TxtContent) > 0 {
				fmt.Printf("%s\n", TxtContent)
			}
		}
	}

答案2

得分: 5

如果你使用github.com/PuerkitoBio/goquery,就可以很容易地实现你想要的效果。

  • 首先,你需要使用document.Find()来识别你想要删除的元素,在你的情况下是scripts,所以是document.Find(scripts)

  • 然后,你需要使用element.Remove()将其从文档中删除。

  • 最后,你可以使用document.Text()来打印/获取文本。

所以,最终的代码将是:

package main

import (
  "fmt"
  "strings"
  "github.com/PuerkitoBio/goquery"
)

func main(){
  s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span><script type='text/javascript'>/* <![CDATA[ */var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};/* ]]> */</script>`

  p := strings.NewReader(s)
  doc, _ := goquery.NewDocumentFromReader(p)

  doc.Find("script").Each(func(i int, el *goquery.Selection) {
      el.Remove()
  })

  fmt.Println(doc.Text()) // Links:FooBarBazTEXT I WANT
  
}
英文:

If you use github.com/PuerkitoBio/goquery it's pretty easy to achieve what you want.

  • You first need to use document.Find() to identify the element you want to remove, in your case scripts, so document.Find(scripts)

  • Then, you need to remove it from the document using element.Remove()

  • Finally, you print/get the text using document.Text()

So, the final code would be

package main

import (
  &quot;fmt&quot;
  &quot;strings&quot;
  &quot;github.com/PuerkitoBio/goquery&quot;
)

func main(){
  s := `&lt;p&gt;Links:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;foo&quot;&gt;Foo&lt;/a&gt;&lt;li&gt;&lt;a href=&quot;/bar/baz&quot;&gt;BarBaz&lt;/a&gt;&lt;/ul&gt;&lt;span&gt;TEXT &lt;b&gt;I&lt;/b&gt; WANT&lt;/span&gt;&lt;script type=&#39;text/javascript&#39;&gt;/* &lt;![CDATA[ */var post_notif_widget_ajax_obj = {&quot;ajax_url&quot;:&quot;http:\/\/site.com\/wp-admin\/admin-ajax.php&quot;,&quot;nonce&quot;:&quot;9b8270e2ef&quot;,&quot;processing_msg&quot;:&quot;Processing...&quot;};/* ]]&gt; */&lt;/script&gt;`

  p := strings.NewReader(s)
  doc, _ := goquery.NewDocumentFromReader(p)

  doc.Find(&quot;script&quot;).Each(func(i int, el *goquery.Selection) {
      el.Remove()
  })

  fmt.Println(doc.Text()) // Links:FooBarBazTEXT I WANT
  
}

huangapple
  • 本文由 发表于 2017年6月9日 01:00:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/44441665.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定