2013年12月7日 06:05:42go评论90阅读模式

英文:

Go: regexp FindAll and ReplaceAll in a single pass

问题

我正在解析一个网页，以获取标签内的一些值，但我对标签本身不感兴趣，只对内容感兴趣。

我使用regexp.FindAll来获取所有匹配的表达式（包括标签），然后使用ReplaceAll替换每个子表达式，删除标签。当然，两次运行正则表达式会花费两倍的时间，我想避免这种情况。

有没有办法同时应用这两个函数，或者有一个等效的正则表达式呢？

当然，我可以编写一个函数来删除标签，但在某些情况下可能会更复杂，因为标签的长度是可变的（例如<a href="">），而正则表达式可以处理这个问题。

这里有一个简单的代码示例（在playground中无法运行）：http://play.golang.org/p/uGKjzmylSY

func main() {
    res, err := http.Get("http://www.elpais.es")
    if err != nil {
        panic(err)
    }

    body, err := ioutil.ReadAll(res.Body)
    fmt.Println("body: ", len(body), cap(body))
    res.Body.Close()
    if err != nil {
        panic(err)
    }

    r := regexp.MustCompile("<li>(.+)</li>")

    // 查找所有包含标签<li>的子表达式
    out := r.FindAll(body, -1)

    for i, v := range out[:10] {
        fmt.Printf("%d: %s\n", i, v)
    }

    // 替换以删除标签
    out2 := make([][]byte, len(out))
    for i, v := range out {
        out2[i] = r.ReplaceAll(v, []byte("$1"))
    }

    for i, v := range out2[:10] {
        fmt.Printf("%d: %s\n", i, v)
    }
}

顺便说一下，我知道正则表达式不能用于解析HTML。我只对最内层的一些标签感兴趣，而不关心结构或嵌套，所以我想这样做应该没问题

英文:

I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.

I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.

Is there a way apply both functions simultaneously, or an equivalent regexp?

Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like <a href="">) and a regexp can take care of this.

A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY

func main() {
	res, err := http.Get(&quot;http://www.elpais.es&quot;)
	if err != nil {
		panic(err)
	}

	body, err := ioutil.ReadAll(res.Body)
	fmt.Println(&quot;body: &quot;, len(body), cap(body))
	res.Body.Close()
	if err != nil {
		panic(err)
	}

	r := regexp.MustCompile(&quot;&lt;li&gt;(.+)&lt;/li&gt;&quot;)

	// Find all subexpressions, containing the label &lt;li&gt;
	out := r.FindAll(body, -1)

	for i, v := range out[:10] {
		fmt.Printf(&quot;%d: %s\n&quot;, i, v)
	}

	//Replace to remove the label.
	out2 := make([][]byte, len(out))
	for i, v := range out {
		out2[i] = r.ReplaceAll(v, []byte(&quot;$1&quot;))
	}

	for i, v := range out2[:10] {
		fmt.Printf(&quot;%d: %s\n&quot;, i, v)
	}
}

By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK

答案1

得分: 5

推荐使用goquery来完成这个任务，它非常简单易用，可以大大减少你的代码量。

示例：

doc, _ := goquery.NewDocument("http://www.elpais.es")
text := doc.Find("li").Slice(10, -1).Text()

关于你的问题，可以使用FindAllSubmatch直接提取匹配结果：

r := regexp.MustCompile("<li>(.+)</li>")

// 查找所有包含标签<li>的子表达式
out := r.FindAllSubmatch(body, -1)

for i, v := range out[:10] {
    fmt.Printf("%d: %s\n", i, v[1])
}

英文:

Recommendation: Use goquery for that task, very simple to use and reduces your code by so much.
Example:

doc, _ := goquery.NewDocument(&quot;http://www.elpais.es&quot;)
text := doc.Find(&quot;li&quot;).Slice(10, -1).Text()

Regarding your question, use FindAllSubmatch to extract the match directly:

r := regexp.MustCompile(&quot;&lt;li&gt;(.+)&lt;/li&gt;&quot;)

// Find all subexpressions, containing the label &lt;li&gt;
out := r.FindAllSubmatch(body, -1)

for i, v := range out[:10] {
	fmt.Printf(&quot;%d: %s\n&quot;, i, v[1])
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Go：在一次遍历中使用正则表达式的FindAll和ReplaceAll函数。

问题

答案1

$GOPATH/src/…和vendor/…的版本不同

How do I add data to an interface that is taken as an argument in Go?

计时器通道 – 在循环内部输出stdout

Read random lines off a text file in go

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论