2014年1月8日 23:48:36go评论152阅读模式

英文:

Extract text content from HTML in Golang

问题

在Golang中提取字符串的内部子字符串的最佳方法是使用正则表达式。你可以使用regexp包来实现这个功能。下面是一个示例代码：

package main

import (
	"fmt"
	"regexp"
)

func main() {
	longString := "Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

	newString := getInnerStrings("<p>", "</p>", longString)

	fmt.Println(newString)
	//output: this is paragraph
	//        this is paragraph 2

}

func getInnerStrings(start, end, str string) string {
	re := regexp.MustCompile(start + "(.*?)" + end)
	matches := re.FindAllStringSubmatch(str, -1)

	result := ""
	for _, match := range matches {
		result += match[1] + "\n"
	}

	return result
}

这段代码使用正则表达式来匹配和之间的内容，并将匹配到的结果拼接成一个字符串返回。

英文:

What's the best way to extract inner substrings from strings in Golang?

input:

&quot;Hello &lt;p&gt; this is paragraph &lt;/p&gt; this is junk &lt;p&gt; this is paragraph 2 &lt;/p&gt; this is junk 2&quot;

output:

&quot;this is paragraph \n
 this is paragraph 2&quot;

Is there any string package/library for Go that already does something like this?

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;
)

func main() {
	longString := &quot;Hello world &lt;p&gt; this is paragraph &lt;/p&gt; this is junk &lt;p&gt; this is paragraph 2 &lt;/p&gt; this is junk 2&quot;

	newString := getInnerStrings(&quot;&lt;p&gt;&quot;, &quot;&lt;/p&gt;&quot;, longString)
	
	fmt.Println(newString)
   //output: this is paragraph \n
	//        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
	//Brain Freeze
        //Regex?
        //Bytes Loop?
}

thanks

答案1

得分: 6

不要使用正则表达式来尝试解析HTML。使用一个完全功能的HTML标记器和解析器。

我建议你阅读CodingHorror上的这篇文章。

英文:

Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and parser.

I recommend you read this article on CodingHorror.

答案2

得分: 1

这是我经常使用的函数。

func GetInnerSubstring(str string, prefix string, suffix string) string {
    var beginIndex, endIndex int
    beginIndex = strings.Index(str, prefix)
    if beginIndex == -1 {
        beginIndex = 0
        endIndex = 0
    } else if len(prefix) == 0 {
        beginIndex = 0
        endIndex = strings.Index(str, suffix)
        if endIndex == -1 || len(suffix) == 0 {
            endIndex = len(str)
        }
    } else {
        beginIndex += len(prefix)
        endIndex = strings.Index(str[beginIndex:], suffix)
        if endIndex == -1 {
            if strings.Index(str, suffix) < beginIndex {
                endIndex = beginIndex
            } else {
                endIndex = len(str)
            }
        } else {
            if len(suffix) == 0 {
                endIndex = len(str)
            } else {
                endIndex += beginIndex
            }
        }
    }

    return str[beginIndex:endIndex]
}

你可以在 playground 上尝试它，链接为：https://play.golang.org/p/Xo0SJu0Vq4。

英文:

Here is my function that I have been using it a lot.

func GetInnerSubstring(str string, prefix string, suffix string) string {
	var beginIndex, endIndex int
	beginIndex = strings.Index(str, prefix)
	if beginIndex == -1 {
		beginIndex = 0
		endIndex = 0
	} else if len(prefix) == 0 {
		beginIndex = 0
		endIndex = strings.Index(str, suffix)
		if endIndex == -1 || len(suffix) == 0 {
			endIndex = len(str)
		}
	} else {
		beginIndex += len(prefix)
		endIndex = strings.Index(str[beginIndex:], suffix)
		if endIndex == -1 {
			if strings.Index(str, suffix) &lt; beginIndex {
				endIndex = beginIndex
			} else {
				endIndex = len(str)
			}
		} else {
			if len(suffix) == 0 {
				endIndex = len(str)
			} else {
				endIndex += beginIndex
			}
		}
	}

	return str[beginIndex:endIndex]
}

You can try it at the playground, https://play.golang.org/p/Xo0SJu0Vq4.

答案3

得分: 0

StrExtract 检索两个分隔符之间的字符串。

> StrExtract(sExper, cAdelim, cCdelim, nOccur)
>
> sExper: 指定要搜索的表达式。sAdelim: 指定分隔sExper开头的字符。
>
> sCdelim: 指定分隔sExper结尾的字符。
>
> nOccur: 指定在sExper中的第几个cAdelim出现时开始提取。

Go Play

package main

import (
	"fmt"
	"strings"
)

func main() {
	s := "a11ba22ba333ba4444ba55555ba666666b"
	fmt.Println("StrExtract1: ", StrExtract(s, "a", "b", 5))
}

func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {

	aExper := strings.Split(sExper, sAdelim)

	if len(aExper) <= nOccur {
		return ""
	}

	sMember := aExper[nOccur]
	aExper = strings.Split(sMember, sCdelim)

	if len(aExper) == 1 {
		return ""
	}

	return aExper[0]
}

英文:

StrExtract Retrieves a string between two delimiters.

> StrExtract(sExper, cAdelim, cCdelim, nOccur)
>
> sExper: Specifies the expression to search. sAdelim: Specifies the
> character that delimits the beginning of sExper.
>
> sCdelim: Specifies the character that delimits the end of sExper.
>
> nOccur: Specifies at which occurrence of cAdelim in sExper to start
> the extraction.

Go Play

package main

import (
	&quot;fmt&quot;
	&quot;strings&quot;
)

func main() {
	s := &quot;a11ba22ba333ba4444ba55555ba666666b&quot;
	fmt.Println(&quot;StrExtract1: &quot;, StrExtract(s, &quot;a&quot;, &quot;b&quot;, 5))
}

func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {

	aExper := strings.Split(sExper, sAdelim)

	if len(aExper) &lt;= nOccur {
		return &quot;&quot;
	}

	sMember := aExper[nOccur]
	aExper = strings.Split(sMember, sCdelim)

	if len(aExper) == 1 {
		return &quot;&quot;
	}

	return aExper[0]
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从Golang中的HTML中提取文本内容

问题

答案1

答案2

答案3

Helm中是否有类似于’tpl’的函数，但如果找不到变量时不会失败？

如何更改 GoLand 在生成自动化测试时使用的默认消息结构

最佳实践：r.PostFormValue(“key”) VS r.PostForm.Get(“key”)

在多维数组中替换一个字符串

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论