英文:
Extract text content from HTML in Golang
问题
在Golang中提取字符串的内部子字符串的最佳方法是使用正则表达式。你可以使用regexp
包来实现这个功能。下面是一个示例代码:
package main
import (
"fmt"
"regexp"
)
func main() {
longString := "Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"
newString := getInnerStrings("<p>", "</p>", longString)
fmt.Println(newString)
//output: this is paragraph
// this is paragraph 2
}
func getInnerStrings(start, end, str string) string {
re := regexp.MustCompile(start + "(.*?)" + end)
matches := re.FindAllStringSubmatch(str, -1)
result := ""
for _, match := range matches {
result += match[1] + "\n"
}
return result
}
这段代码使用正则表达式来匹配<p>
和</p>
之间的内容,并将匹配到的结果拼接成一个字符串返回。
英文:
What's the best way to extract inner substrings from strings in Golang?
input:
"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"
output:
"this is paragraph \n
this is paragraph 2"
Is there any string package/library for Go that already does something like this?
package main
import (
"fmt"
"strings"
)
func main() {
longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"
newString := getInnerStrings("<p>", "</p>", longString)
fmt.Println(newString)
//output: this is paragraph \n
// this is paragraph 2
}
func getInnerStrings(start, end, str string) string {
//Brain Freeze
//Regex?
//Bytes Loop?
}
thanks
答案1
得分: 6
不要使用正则表达式来尝试解析HTML。使用一个完全功能的HTML标记器和解析器。
我建议你阅读CodingHorror上的这篇文章。
英文:
Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and parser.
I recommend you read this article on CodingHorror.
答案2
得分: 1
这是我经常使用的函数。
func GetInnerSubstring(str string, prefix string, suffix string) string {
var beginIndex, endIndex int
beginIndex = strings.Index(str, prefix)
if beginIndex == -1 {
beginIndex = 0
endIndex = 0
} else if len(prefix) == 0 {
beginIndex = 0
endIndex = strings.Index(str, suffix)
if endIndex == -1 || len(suffix) == 0 {
endIndex = len(str)
}
} else {
beginIndex += len(prefix)
endIndex = strings.Index(str[beginIndex:], suffix)
if endIndex == -1 {
if strings.Index(str, suffix) < beginIndex {
endIndex = beginIndex
} else {
endIndex = len(str)
}
} else {
if len(suffix) == 0 {
endIndex = len(str)
} else {
endIndex += beginIndex
}
}
}
return str[beginIndex:endIndex]
}
你可以在 playground 上尝试它,链接为:https://play.golang.org/p/Xo0SJu0Vq4。
英文:
Here is my function that I have been using it a lot.
func GetInnerSubstring(str string, prefix string, suffix string) string {
var beginIndex, endIndex int
beginIndex = strings.Index(str, prefix)
if beginIndex == -1 {
beginIndex = 0
endIndex = 0
} else if len(prefix) == 0 {
beginIndex = 0
endIndex = strings.Index(str, suffix)
if endIndex == -1 || len(suffix) == 0 {
endIndex = len(str)
}
} else {
beginIndex += len(prefix)
endIndex = strings.Index(str[beginIndex:], suffix)
if endIndex == -1 {
if strings.Index(str, suffix) < beginIndex {
endIndex = beginIndex
} else {
endIndex = len(str)
}
} else {
if len(suffix) == 0 {
endIndex = len(str)
} else {
endIndex += beginIndex
}
}
}
return str[beginIndex:endIndex]
}
You can try it at the playground, https://play.golang.org/p/Xo0SJu0Vq4.
答案3
得分: 0
<b>StrExtract 检索两个分隔符之间的字符串。</b>
> StrExtract(sExper, cAdelim, cCdelim, nOccur)
>
> sExper: 指定要搜索的表达式。sAdelim: 指定分隔sExper开头的字符。
>
> sCdelim: 指定分隔sExper结尾的字符。
>
> nOccur: 指定在sExper中的第几个cAdelim出现时开始提取。
package main
import (
"fmt"
"strings"
)
func main() {
s := "a11ba22ba333ba4444ba55555ba666666b"
fmt.Println("StrExtract1: ", StrExtract(s, "a", "b", 5))
}
func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {
aExper := strings.Split(sExper, sAdelim)
if len(aExper) <= nOccur {
return ""
}
sMember := aExper[nOccur]
aExper = strings.Split(sMember, sCdelim)
if len(aExper) == 1 {
return ""
}
return aExper[0]
}
英文:
<b>StrExtract Retrieves a string between two delimiters.</b>
> StrExtract(sExper, cAdelim, cCdelim, nOccur)
>
> sExper: Specifies the expression to search. sAdelim: Specifies the
> character that delimits the beginning of sExper.
>
> sCdelim: Specifies the character that delimits the end of sExper.
>
> nOccur: Specifies at which occurrence of cAdelim in sExper to start
> the extraction.
package main
import (
"fmt"
"strings"
)
func main() {
s := "a11ba22ba333ba4444ba55555ba666666b"
fmt.Println("StrExtract1: ", StrExtract(s, "a", "b", 5))
}
func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {
aExper := strings.Split(sExper, sAdelim)
if len(aExper) <= nOccur {
return ""
}
sMember := aExper[nOccur]
aExper = strings.Split(sMember, sCdelim)
if len(aExper) == 1 {
return ""
}
return aExper[0]
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论