goquery- 将一个标签与其后面的标签连接起来

huangapple go评论86阅读模式
英文:

goquery- Concatenate a tag with the one that follows

问题

一些背景信息,我刚开始学习Go语言(3或4天),但我开始对它感到更加熟悉了。

我正在尝试使用goquery来解析一个网页(最终我想将一些数据放入数据库)。为了解释清楚,我将用一个示例来说明:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

我想要做以下操作:

  1. 提取<h1...class="text">的内容。
  2. 插入(并连接)提取的内容到<p...class="text">的内容中。
  3. 只对紧随在<h1>标签后面的<p>标签进行操作。
  4. 对页面上的所有<h1>标签都进行操作。

所以我希望它的结果如下所示:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

代码的开头如下:

package main

import (
    "fmt"
    "strings"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    html_code := strings.NewReader(`code_example_above`)
    doc, _ := goquery.NewDocumentFromReader(html_code)

我知道可以使用以下代码读取<h1...class="text">

h3_tag := doc.Find("h3 .text")

我也知道可以使用以下代码将<h1...class="text">的内容添加到<p...class="text">的内容中:

doc.Find("p .text").Before("h3 .text")

^但是这个命令会将<h1...class="text">的内容插入到每个<p...class="text">的内容之前。

然后,我发现了一个更接近我想要的结果的方法:

doc.Find("p .text").First().Before("h3 .text")

^这个命令将<h1...class="text">的内容插入到第一个<p...class="text">的内容之前(更接近我想要的结果)。

我还尝试使用goqueryEach()函数,但是我无法通过该方法更接近我想要的结果(尽管我确定可以使用Each()来实现,对吗?)

我最大的问题是我无法找出如何将每个<h1...class="text">与紧随其后的<p...class="text">关联起来。

如果有帮助的话,<h1...class="text">在我尝试解析的网页上总是紧随着<p...class="text">

我的大脑已经不够用了。有没有Go的天才知道如何做到这一点并愿意解释一下?提前谢谢。

编辑

我发现了另一种方法:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    nex := s.Next().Text()
    fmt.Println(s.Text(), nex, "\n\n")
})

^这会打印出我想要的内容,即每个<h1...class="text">的内容,后面紧跟着它的<p...class="text">的内容。我原以为s.Next()会输出下一个<h1>的实例,但它输出的是doc中的下一个标签——即正在迭代的*goquery.Selection。这样正确吗?

或者,正如mattn指出的,我也可以使用doc.Find("h1+p")

我仍然无法将<h1...class="text">附加到<p...class="text">上。我将其作为另一个问题发布,因为你可以将这个问题分解为多个问题,而且Mattn已经回答了其中一个。

英文:

For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:

&lt;html&gt;
    &lt;body&gt;
        &lt;h1&gt;
            &lt;span class=&quot;text&quot;&gt;Go &lt;/span&gt;
        &lt;/h1&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;totally &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;kicks &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;hacks &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;its &lt;/span&gt;
        &lt;/p&gt;
        &lt;h1&gt;
            &lt;span class=&quot;text&quot;&gt;debugger &lt;/span&gt;
        &lt;/h1&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;should &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;be &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;called &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;ogle &lt;/span&gt;
        &lt;/p&gt;
        &lt;h3&gt;
            &lt;span class=&quot;statement&quot;&gt;true&lt;/span&gt;
        &lt;/h3&gt;
    &lt;/body&gt;
&lt;html&gt;

I'd like to:

  1. Extract the content of &lt;h1...&quot;text&quot;.
  2. Insert (and concatenate) this extracted content into the content of &lt;p...&quot;text&quot;.
  3. Only do this for the &lt;p&gt; tag that immediately follows the &lt;h1&gt; tag.
  4. Do this for all of the &lt;h1&gt; tags on the page.

So this is what I want it to look like:

&lt;html&gt;
    &lt;body&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;Go totally &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;kicks &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;hacks &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;its &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;debugger should &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;be &lt;/span&gt;
        &lt;/p&gt;
        &lt;p&gt;
            &lt;span class=&quot;text&quot;&gt;called &lt;/span&gt;
            &lt;span class=&quot;post&quot;&gt;ogle&lt;/span&gt;
        &lt;/p&gt;
        &lt;h3&gt;
            &lt;span class=&quot;statement&quot;&gt;true&lt;/span&gt;
        &lt;/h3&gt;
    &lt;/body&gt;
&lt;html&gt;

With the code starting off like this,

package main

import (
    &quot;fmt&quot;
    &quot;strings&quot;
    &quot;github.com/PuerkitoBio/goquery&quot;
)

func main() {
    html_code := strings.NewReader(`code_example_above`)
    doc, _ := goquery.NewDocumentFromReader(html_code)

I know that I can read &lt;h1...&quot;text&quot; with:

h3_tag := doc.Find(&quot;h3 .text&quot;)

I also know that I can add the content of &lt;h1...&quot;text&quot; to the content of &lt;p...&quot;text&quot; with this:

doc.Find(&quot;p .text&quot;).Before(&quot;h3 .text&quot;)

^But this command inserts the content from every single case of &lt;h1...&quot;text&quot; before every single case of &lt;p...&quot;text&quot;.

Then, I found out how to get a step closer to what I want:

doc.Find(&quot;p .text&quot;).First().Before(&quot;h3 .text&quot;)

^This command inserts the content from every single case of &lt;h1...&quot;text&quot; only before the first case of &lt;p...&quot;text&quot; (which is closer to what I want).

I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)

My biggest issue is that I can't figure out how to associate each instance of &lt;h1...&quot;text&quot; with the &lt;p...&quot;text&quot; instance that immediately follows it.

If it helps, &lt;h1...&quot;text&quot; is always followed by &lt;p...&quot;text&quot; on the web pages I'm trying to parse.

My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.

EDIT

I found out something else I can do:

doc.Find(&quot;h1&quot;).Each(func(i int, s *goquery.Selection) {
    nex := s.Next().Text()
    fmt.Println(s.Text(), nex, &quot;\n\n&quot;)
})

^This prints out what I want--the contents of each instance of &lt;h1...&quot;text&quot; followed by its immediate instance of &lt;p...&quot;text&quot;. I had thought that s.Next() would output the next instance of &lt;h1&gt;, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?

Or, as mattn pointed out, I could also use doc.Find(&quot;h1+p&quot;).

I'm still having trouble appending &lt;h1...&quot;text&quot; to &lt;p...&quot;text&quot;. I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.

答案1

得分: 1

我不知道你正在使用goquery编写什么代码。但也许,你期望的是邻居选择器。

h1+p

这将返回具有相邻的p标签的h1标签。

英文:

I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.

h1+p

This returns h1 tags which has p tag in neighbor.

huangapple
  • 本文由 发表于 2015年1月6日 07:04:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/27789446.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定