从Golang中的HTML中提取文本内容

huangapple go评论152阅读模式
英文:

Extract text content from HTML in Golang

问题

在Golang中提取字符串的内部子字符串的最佳方法是使用正则表达式。你可以使用regexp包来实现这个功能。下面是一个示例代码:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	longString := "Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

	newString := getInnerStrings("<p>", "</p>", longString)

	fmt.Println(newString)
	//output: this is paragraph
	//        this is paragraph 2

}

func getInnerStrings(start, end, str string) string {
	re := regexp.MustCompile(start + "(.*?)" + end)
	matches := re.FindAllStringSubmatch(str, -1)

	result := ""
	for _, match := range matches {
		result += match[1] + "\n"
	}

	return result
}

这段代码使用正则表达式来匹配<p></p>之间的内容,并将匹配到的结果拼接成一个字符串返回。

英文:

What's the best way to extract inner substrings from strings in Golang?

input:

&quot;Hello &lt;p&gt; this is paragraph &lt;/p&gt; this is junk &lt;p&gt; this is paragraph 2 &lt;/p&gt; this is junk 2&quot;

output:

&quot;this is paragraph \n
 this is paragraph 2&quot;

Is there any string package/library for Go that already does something like this?

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;
)

func main() {
	longString := &quot;Hello world &lt;p&gt; this is paragraph &lt;/p&gt; this is junk &lt;p&gt; this is paragraph 2 &lt;/p&gt; this is junk 2&quot;

	newString := getInnerStrings(&quot;&lt;p&gt;&quot;, &quot;&lt;/p&gt;&quot;, longString)
	
	fmt.Println(newString)
   //output: this is paragraph \n
	//        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
	//Brain Freeze
        //Regex?
        //Bytes Loop?
}

thanks

答案1

得分: 6

不要使用正则表达式来尝试解析HTML。使用一个完全功能的HTML标记器和解析器。

我建议你阅读CodingHorror上的这篇文章。

英文:

Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and parser.

I recommend you read this article on CodingHorror.

答案2

得分: 1

这是我经常使用的函数。

func GetInnerSubstring(str string, prefix string, suffix string) string {
    var beginIndex, endIndex int
    beginIndex = strings.Index(str, prefix)
    if beginIndex == -1 {
        beginIndex = 0
        endIndex = 0
    } else if len(prefix) == 0 {
        beginIndex = 0
        endIndex = strings.Index(str, suffix)
        if endIndex == -1 || len(suffix) == 0 {
            endIndex = len(str)
        }
    } else {
        beginIndex += len(prefix)
        endIndex = strings.Index(str[beginIndex:], suffix)
        if endIndex == -1 {
            if strings.Index(str, suffix) < beginIndex {
                endIndex = beginIndex
            } else {
                endIndex = len(str)
            }
        } else {
            if len(suffix) == 0 {
                endIndex = len(str)
            } else {
                endIndex += beginIndex
            }
        }
    }

    return str[beginIndex:endIndex]
}

你可以在 playground 上尝试它,链接为:https://play.golang.org/p/Xo0SJu0Vq4。

英文:

Here is my function that I have been using it a lot.

func GetInnerSubstring(str string, prefix string, suffix string) string {
	var beginIndex, endIndex int
	beginIndex = strings.Index(str, prefix)
	if beginIndex == -1 {
		beginIndex = 0
		endIndex = 0
	} else if len(prefix) == 0 {
		beginIndex = 0
		endIndex = strings.Index(str, suffix)
		if endIndex == -1 || len(suffix) == 0 {
			endIndex = len(str)
		}
	} else {
		beginIndex += len(prefix)
		endIndex = strings.Index(str[beginIndex:], suffix)
		if endIndex == -1 {
			if strings.Index(str, suffix) &lt; beginIndex {
				endIndex = beginIndex
			} else {
				endIndex = len(str)
			}
		} else {
			if len(suffix) == 0 {
				endIndex = len(str)
			} else {
				endIndex += beginIndex
			}
		}
	}

	return str[beginIndex:endIndex]
}

You can try it at the playground, https://play.golang.org/p/Xo0SJu0Vq4.

答案3

得分: 0

<b>StrExtract 检索两个分隔符之间的字符串。</b>

> StrExtract(sExper, cAdelim, cCdelim, nOccur)
>
> sExper: 指定要搜索的表达式。sAdelim: 指定分隔sExper开头的字符。
>
> sCdelim: 指定分隔sExper结尾的字符。
>
> nOccur: 指定在sExper中的第几个cAdelim出现时开始提取。

Go Play

package main

import (
	"fmt"
	"strings"
)

func main() {
	s := "a11ba22ba333ba4444ba55555ba666666b"
	fmt.Println("StrExtract1: ", StrExtract(s, "a", "b", 5))
}

func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {

	aExper := strings.Split(sExper, sAdelim)

	if len(aExper) <= nOccur {
		return ""
	}

	sMember := aExper[nOccur]
	aExper = strings.Split(sMember, sCdelim)

	if len(aExper) == 1 {
		return ""
	}

	return aExper[0]
}
英文:

<b>StrExtract Retrieves a string between two delimiters.</b>

> StrExtract(sExper, cAdelim, cCdelim, nOccur)
>
> sExper: Specifies the expression to search. sAdelim: Specifies the
> character that delimits the beginning of sExper.
>
> sCdelim: Specifies the character that delimits the end of sExper.
>
> nOccur: Specifies at which occurrence of cAdelim in sExper to start
> the extraction.

Go Play

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;
)

func main() {
	s := &quot;a11ba22ba333ba4444ba55555ba666666b&quot;
	fmt.Println(&quot;StrExtract1: &quot;, StrExtract(s, &quot;a&quot;, &quot;b&quot;, 5))
}

func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {

	aExper := strings.Split(sExper, sAdelim)

	if len(aExper) &lt;= nOccur {
		return &quot;&quot;
	}

	sMember := aExper[nOccur]
	aExper = strings.Split(sMember, sCdelim)

	if len(aExper) == 1 {
		return &quot;&quot;
	}

	return aExper[0]
}

huangapple
  • 本文由 发表于 2014年1月8日 23:48:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/21000277.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定