英文:
Go: regexp FindAll and ReplaceAll in a single pass
问题
我正在解析一个网页,以获取标签内的一些值,但我对标签本身不感兴趣,只对内容感兴趣。
我使用regexp.FindAll来获取所有匹配的表达式(包括标签),然后使用ReplaceAll替换每个子表达式,删除标签。当然,两次运行正则表达式会花费两倍的时间,我想避免这种情况。
有没有办法同时应用这两个函数,或者有一个等效的正则表达式呢?
当然,我可以编写一个函数来删除标签,但在某些情况下可能会更复杂,因为标签的长度是可变的(例如<a href="">),而正则表达式可以处理这个问题。
这里有一个简单的代码示例(在playground中无法运行):http://play.golang.org/p/uGKjzmylSY
func main() {
res, err := http.Get("http://www.elpais.es")
if err != nil {
panic(err)
}
body, err := ioutil.ReadAll(res.Body)
fmt.Println("body: ", len(body), cap(body))
res.Body.Close()
if err != nil {
panic(err)
}
r := regexp.MustCompile("<li>(.+)</li>")
// 查找所有包含标签<li>的子表达式
out := r.FindAll(body, -1)
for i, v := range out[:10] {
fmt.Printf("%d: %s\n", i, v)
}
// 替换以删除标签
out2 := make([][]byte, len(out))
for i, v := range out {
out2[i] = r.ReplaceAll(v, []byte("$1"))
}
for i, v := range out2[:10] {
fmt.Printf("%d: %s\n", i, v)
}
}
顺便说一下,我知道正则表达式不能用于解析HTML。我只对最内层的一些标签感兴趣,而不关心结构或嵌套,所以我想这样做应该没问题
英文:
I'm parsing a web page to get some values inside labels, but I'm not interested in the label, only in the content.
I'm using regexp.FindAll to get all the matching expressions (including the label) and then ReplaceAll to replace every subexpression, removing the label. Running the regexp twice takes double of time, of course, and I'd like to avoid it.
Is there a way apply both functions simultaneously, or an equivalent regexp?
Of course, I could make a function to remove the label but in some cases could be more complex because of the variable-length labels (like <a href="">) and a regexp can take care of this.
A simple example of my code is here (it won't run in the playground): http://play.golang.org/p/uGKjzmylSY
func main() {
res, err := http.Get("http://www.elpais.es")
if err != nil {
panic(err)
}
body, err := ioutil.ReadAll(res.Body)
fmt.Println("body: ", len(body), cap(body))
res.Body.Close()
if err != nil {
panic(err)
}
r := regexp.MustCompile("<li>(.+)</li>")
// Find all subexpressions, containing the label <li>
out := r.FindAll(body, -1)
for i, v := range out[:10] {
fmt.Printf("%d: %s\n", i, v)
}
//Replace to remove the label.
out2 := make([][]byte, len(out))
for i, v := range out {
out2[i] = r.ReplaceAll(v, []byte("$1"))
}
for i, v := range out2[:10] {
fmt.Printf("%d: %s\n", i, v)
}
}
By the way, I understand that regex cannot be used to parse HTML. I'm only interested in some of the innermost labels, not in the structure or nestings, so I suppose it is OK
答案1
得分: 5
推荐使用goquery来完成这个任务,它非常简单易用,可以大大减少你的代码量。
示例:
doc, _ := goquery.NewDocument("http://www.elpais.es")
text := doc.Find("li").Slice(10, -1).Text()
关于你的问题,可以使用FindAllSubmatch
直接提取匹配结果:
r := regexp.MustCompile("<li>(.+)</li>")
// 查找所有包含标签<li>的子表达式
out := r.FindAllSubmatch(body, -1)
for i, v := range out[:10] {
fmt.Printf("%d: %s\n", i, v[1])
}
英文:
Recommendation: Use goquery for that task, very simple to use and reduces your code by so much.
Example:
doc, _ := goquery.NewDocument("http://www.elpais.es")
text := doc.Find("li").Slice(10, -1).Text()
Regarding your question, use FindAllSubmatch
to extract the match directly:
r := regexp.MustCompile("<li>(.+)</li>")
// Find all subexpressions, containing the label <li>
out := r.FindAllSubmatch(body, -1)
for i, v := range out[:10] {
fmt.Printf("%d: %s\n", i, v[1])
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论