goquery- 从一个HTML标签中提取文本并添加到下一个标签中

huangapple go评论97阅读模式
英文:

goquery- Extract text from one html tag and add it to the next tag

问题

是的,抱歉标题没有解释清楚。我需要使用一个例子来说明。

这是我之前发布的另一个问题的延续,解决了一个问题,但并没有解决所有问题。我已经将那个问题中的大部分背景信息放入了这个问题中。此外,我只学习Go语言大约5天(几个月前才开始学习编程),所以我90%确定我接近找到我想要的解决方法,问题在于我犯了一些愚蠢的语法错误。

###情况###

我试图使用goquery解析一个网页。(最终我想将一些数据放入数据库)。这是它的样子:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

###目标###

我想要:

  1. 提取<h1... "text">的内容。
  2. 插入(并连接)这个提取的内容到<p... "text">的内容中。
  3. 只对紧跟在<h1>标签后面的<p>标签进行操作。
  4. 对页面上的所有<h1>标签都进行操作。

再次,一个例子可以更好地解释上述内容。这是我想要的结果:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

###尝试的解决方案###

因为进一步区分<h1>标签和<p>标签会提供更多的解析选项,我已经找出了如何将<h1>标签的class属性更改为以下内容:

<html>
    <body>
        <h1>
            <span class="title">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="title">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

使用以下代码:

html_code := strings.NewReader(`
code_example_above
`)
doc, _ := goquery.NewDocumentFromReader(html_code)
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
	s.SetAttr("class", "title")
	class, _ := s.Attr("class")
	if class == "title" {
		fmt.Println(class, s.Text())
	}
})

我知道我可以使用doc.Find("h1+p")或者在doc.Find("h1").Each函数内部使用s.Next()来选择紧跟在<h1>标签后面的<p... "text">

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
	s.SetAttr("class", "title")
	class, _ := s.Attr("class")
	if class == "title" {
		fmt.Println(class, s.Text())
		fmt.Println(s.Next().Text())
	}
})

我无法弄清楚如何将<h1... "title">的文本插入到<p... "text">中。我尝试了很多 s.After()s.Before()s.Append()的变体,例如:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
	s.SetAttr("class", "title")
	class, _ := s.Attr("class")
	if class == "title" {
		s.After(s.Text())
		fmt.Println(s.Next().Text())
	}
})

但我无法弄清楚如何完全实现我想要的效果。

如果我使用s.After(s.Next().Text()),我会得到以下错误输出:

panic: expected identifier, found 5 instead

goroutine 1 [running]:
code.google.com/p/cascadia.MustCompile(0xc2082f09a0, 0x62, 0x62)
	/home/*/go/src/code.google.com/p/cascadia/selector.go:59 +0x77
github.com/PuerkitoBio/goquery.(*Selection).After(0xc2082ea630, 0xc2082f09a0, 0x62, 0x5)
	/home/*/go/src/github.com/PuerkitoBio/goquery/manipulation.go:18 +0x32
main.func·001(0x0, 0xc2082ea630)
	/home/*/go/test2.go:78 +0x106
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc2082ea600, 0x7cb678, 0x2)
	/home/*/go/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0x173
main.ExampleScrape()
	/home/*/go/test2.go:82 +0x213
main.main()
	/home/*/go/test2.go:175 +0x1b

goroutine 9 [runnable]:
net/http.(*persistConn).readLoop(0xc208047ef0)
	/usr/lib/go/src/net/http/transport.go:928 +0x9ce
created by net/http.(*Transport).dialConn
	/usr/lib/go/src/net/http/transport.go:660 +0xc9f

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
	/usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 10 [select]:
net/http.(*persistConn).writeLoop(0xc208047ef0)
	/usr/lib/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
	/usr/lib/go/src/net/http/transport.go:661 +0xcbc
exit status 2

(我的脚本行与上面的示例行不匹配,但是我的脚本的“第72行”包含代码s.After(s.Next().Text())。我不知道panic: expected identifier, found 5 instead的确切含义。)

###总结###

总之,我的问题是我无法完全理解如何使用goquery向标签中添加文本。

我认为我离成功很近。是否有任何愿意帮助这个初学者的Go语言高手?

英文:

Yeah, sorry that the title explains nothing. I'll need to use an example.

This is a continuation of another question I posted which solved one problem but not all of them. I've put most of the background info from that question into this one. Also, I've only been looking into Go for about 5 days (and I only started learning code a couple months ago), so I'm 90% sure that I'm close to figuring out what I want and that the problem is that I've got some silly syntax mistakes.

###Situation###

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). Here's what it looks like:

&lt;html&gt;
    &lt;body&gt;
        &lt;h1&gt;
            &lt;span class=&quot;text&quot;&gt;Go &lt;/span&gt;
        &lt;/h1&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;totally &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;kicks &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;hacks &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;its &lt;/span&gt;
        &lt;/p&gt;
        &lt;h1&gt;
            &lt;span class=&quot;text&quot;&gt;debugger &lt;/span&gt;
        &lt;/h1&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;should &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;be &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;called &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;ogle &lt;/span&gt;
        &lt;/p&gt;
        &lt;h3&gt;
            &lt;span class=&quot;statement&quot;&gt;true&lt;/span&gt;
        &lt;/h3&gt;
    &lt;/body&gt;
&lt;html&gt;

###Objective###

I'd like to:

  1. Extract the content of &lt;h1...&quot;text&quot;.
  2. Insert (and concatenate) this extracted content into the content of &lt;p...&quot;text&quot;.
  3. Only do this for the &lt;p&gt; tag that immediately follows the &lt;h1&gt; tag.
  4. Do this for all of the &lt;h1&gt; tags on the page.

Once again, an example explains ^this better. This is what I want it to look like:

&lt;html&gt;
    &lt;body&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;Go totally &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;kicks &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;hacks &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;its &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;debugger should &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;be &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;called &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;ogle&lt;/span&gt;
        &lt;/p&gt;
        &lt;h3&gt;
            &lt;span class=&quot;statement&quot;&gt;true&lt;/span&gt;
        &lt;/h3&gt;
    &lt;/body&gt;
&lt;html&gt;

###Solution Attempts###

Because distinguishing further the &lt;h1&gt; tags from the &lt;p&gt; tags would provide more parsing options, I've figured out how to change the class attributes of the &lt;h1&gt; tags to this:

&lt;html&gt;
    &lt;body&gt;
        &lt;h1&gt;
            &lt;span class=&quot;title&quot;&gt;Go &lt;/span&gt;
        &lt;/h1&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;totally &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;kicks &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;hacks &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;its &lt;/span&gt;
        &lt;/p&gt;
        &lt;h1&gt;
            &lt;span class=&quot;title&quot;&gt;debugger &lt;/span&gt;
        &lt;/h1&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;should &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;be &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;called &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;ogle &lt;/span&gt;
        &lt;/p&gt;
        &lt;h3&gt;
            &lt;span class=&quot;statement&quot;&gt;true&lt;/span&gt;
        &lt;/h3&gt;
    &lt;/body&gt;
&lt;html&gt;

with this code:

html_code := strings.NewReader(`
code_example_above
`)
doc, _ := goquery.NewDocumentFromReader(html_code)
doc.Find(&quot;h1&quot;).Each(func(i int, s *goquery.Selection) {
	s.SetAttr(&quot;class&quot;, &quot;title&quot;)
	class, _ := s.Attr(&quot;class&quot;)
	if class == &quot;title&quot; {
		fmt.Println(class, s.Text())
	}
})

I know that I can select the &lt;p...&quot;text&quot; following the &lt;h1...&quot;title&quot; with either doc.Find(&quot;h1+p&quot;) or s.Next() inside the doc.Find(&quot;h1&quot;).Each function:

doc.Find(&quot;h1&quot;).Each(func(i int, s *goquery.Selection) {
	s.SetAttr(&quot;class&quot;, &quot;title&quot;)
	class, _ := s.Attr(&quot;class&quot;)
	if class == &quot;title&quot; {
		fmt.Println(class, s.Text())
		fmt.Println(s.Next().Text())
	}
})

I can't figure out how to insert the text from &lt;h1...&quot;title&quot; to &lt;p...&quot;text&quot;. I've tried using quite a few variations of s.After(), s.Before(), and s.Append(), e.g., like this:

doc.Find(&quot;h1&quot;).Each(func(i int, s *goquery.Selection) {
	s.SetAttr(&quot;class&quot;, &quot;title&quot;)
	class, _ := s.Attr(&quot;class&quot;)
	if class == &quot;title&quot; {
		s.After(s.Text())
		fmt.Println(s.Next().Text())
	}
})

but I can't figure out how to do exactly what I want.

If I use s.After(s.Next().Text()) instead, I get this error output:

panic: expected identifier, found 5 instead

goroutine 1 [running]:
code.google.com/p/cascadia.MustCompile(0xc2082f09a0, 0x62, 0x62)
	/home/*/go/src/code.google.com/p/cascadia/selector.go:59 +0x77
github.com/PuerkitoBio/goquery.(*Selection).After(0xc2082ea630, 0xc2082f09a0, 0x62, 0x5)
	/home/*/go/src/github.com/PuerkitoBio/goquery/manipulation.go:18 +0x32
main.func&#183;001(0x0, 0xc2082ea630)
	/home/*/go/test2.go:78 +0x106
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc2082ea600, 0x7cb678, 0x2)
	/home/*/go/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0x173
main.ExampleScrape()
	/home/*/go/test2.go:82 +0x213
main.main()
	/home/*/go/test2.go:175 +0x1b

goroutine 9 [runnable]:
net/http.(*persistConn).readLoop(0xc208047ef0)
	/usr/lib/go/src/net/http/transport.go:928 +0x9ce
created by net/http.(*Transport).dialConn
	/usr/lib/go/src/net/http/transport.go:660 +0xc9f

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
	/usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 10 [select]:
net/http.(*persistConn).writeLoop(0xc208047ef0)
	/usr/lib/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
	/usr/lib/go/src/net/http/transport.go:661 +0xcbc
exit status 2

(The lines of my script don't match the lines of the examples above, but "line 72" of my script contains the code s.After(s.Next().Text()). I don't know what exactly panic: expected identifier, found 5 instead means.)

###Summary###

In summary, my problem is that I can't quite wrap my head around how to use goquery to add text to a tag.

I think I'm close. Would any gopher Jedis be able and willing to help this padawan?

答案1

得分: 3

像这样的代码可以完成任务,它会找到所有的<h1>节点,然后找到这些<h1>节点内部的所有<span>节点,并查找其中一个具有text类的节点。然后它获取到<h1>节点的下一个元素,如果它是一个包含<span><p>节点,那么它将用新的文本替换这个最后的<span>,并移除<h1>节点。

我想知道是否可以使用goquery创建节点而不编写HTML代码...

package main

import (
	"fmt"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

var htmlCode string = `<html>
...
<html>`

func main() {
	doc, _ := goquery.NewDocumentFromReader(strings.NewReader(htmlCode))
	doc.Find("h1").Each(func(i int, h1 *goquery.Selection) {
		h1.Find("span").Each(func(j int, s *goquery.Selection) {
			if s.HasClass("text") {
				if p := h1.Next(); p != nil {
					if ps := p.Children().First(); ps != nil && ps.HasClass("text") {
						ps.ReplaceWithHtml(
							fmt.Sprintf("<span class=\"text\">%s%s</span>", s.Text(), ps.Text()))
						h1.Remove()
					}
				}
			}
		})
	})
	htmlResult, _ := doc.Html()
	fmt.Println(htmlResult)
}
英文:

Something like this code does the job, it finds all &lt;h1&gt; nodes, then all &lt;span&gt; nodes inside these &lt;h1&gt; nodes, looking for one with class text. Then it gets the next element to the &lt;h1&gt; node, if it is a &lt;p&gt;, that has inside a &lt;span&gt;, then it replaces this last &lt;span&gt; with a new &lt;span&gt; with the new text and removes the &lt;h1&gt;.

I wonder if it's possible to create nodes using goquery without writing html...

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;

	&quot;github.com/PuerkitoBio/goquery&quot;
)

var htmlCode string = `&lt;html&gt;
...
&lt;html&gt;`

func main() {
	doc, _ := goquery.NewDocumentFromReader(strings.NewReader((htmlCode)))
	doc.Find(&quot;h1&quot;).Each(func(i int, h1 *goquery.Selection) {
		h1.Find(&quot;span&quot;).Each(func(j int, s *goquery.Selection) {
			if s.HasClass(&quot;text&quot;) {
				if p := h1.Next(); p != nil {
					if ps := p.Children().First(); ps != nil &amp;&amp; ps.HasClass(&quot;text&quot;) {
						ps.ReplaceWithHtml(
							fmt.Sprintf(&quot;&lt;span class=\&quot;text\&quot;&gt;%s%s&lt;/span&gt;)&quot;, s.Text(), ps.Text()))
						h1.Remove()
					}
				}
			}
		})
	})
	htmlResult, _ := doc.Html()
	fmt.Println(htmlResult)
}

huangapple
  • 本文由 发表于 2015年1月8日 04:39:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/27828242.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定