2014年3月21日 03:50:03go评论116阅读模式

英文:

Go - Getting the text of a single particular HTML element from a document with a known structure

问题

在我正在编写的一个小脚本中，我向一个Web服务发送POST请求，并收到一个HTML文档作为响应。除了一个textarea的内容之外，这个文档对我的需求来说基本无关紧要。这个textarea是页面上唯一的textarea，并且它有一个我事先知道的特定的name。我想获取那个文本，而不用担心文档中的其他内容。目前我正在使用正则表达式来获取正确的行，然后删除标签，但我觉得可能有更好的方法。

这是文档的样子：

&lt;html&gt;&lt;body&gt;
&lt;form name=&quot;query&quot; action=&quot;http://www.example.net/action.php&quot; method=&quot;post&quot;&gt;
	&lt;textarea type=&quot;text&quot; name=&quot;nameiknow&quot;/&gt;The text I want&lt;/textarea&gt;
	&lt;div id=&quot;button&quot;&gt;
		&lt;input type=&quot;submit&quot; value=&quot;Submit&quot; /&gt;
	&lt;/div&gt;
&lt;/form&gt;
&lt;/body&gt;&lt;/html&gt;

这是我目前获取文本的方法：

s := string(body)

// 获取我想要的行
r, _ := regexp.Compile(&quot;&lt;textarea.*name=(\&quot;|&#39;)nameiknow(\&quot;|&#39;).*textarea&gt;&quot;)
s = r.FindString(s)

// 删除标签
r, _ = regexp.Compile(&quot;&lt;[^&gt;]*&gt;&quot;)
s = r.ReplaceAllString(s, &quot;&quot;)

我认为在这种情况下使用完整的HTML解析器可能有点过头了，这就是为什么我选择了这个方向，尽管我不知道是否有更好的方法。

感谢您可能提供的任何建议。

英文:

In a little script I'm writing, I make a POST to a web service and receive an HTML document in response. This document is largely irrelevant to my needs, with the exception of the contents of a single textarea. This textarea is the only textarea in the page and it has a particular name that I know ahead of time. I want to grab that text without worrying about anything else in the document. Currently I'm using regex to get the correct line and then to delete the tags, but I feel like there's probably a better way.

Here's what the document looks like:

&lt;html&gt;&lt;body&gt;
&lt;form name=&quot;query&quot; action=&quot;http://www.example.net/action.php&quot; method=&quot;post&quot;&gt;
	&lt;textarea type=&quot;text&quot; name=&quot;nameiknow&quot;/&gt;The text I want&lt;/textarea&gt;
	&lt;div id=&quot;button&quot;&gt;
		&lt;input type=&quot;submit&quot; value=&quot;Submit&quot; /&gt;
	&lt;/div&gt;
&lt;/form&gt;
&lt;/body&gt;&lt;/html&gt;

And here's how I'm currently getting the text:

s := string(body)

// Gets the line I want
r, _ := regexp.Compile(&quot;&lt;textarea.*name=(\&quot;|&#39;)nameiknow(\&quot;|&#39;).*textarea&gt;&quot;)
s = r.FindString(s)

// Deletes the tags
r, _ = regexp.Compile(&quot;&lt;[^&gt;]*&gt;&quot;)
s = r.ReplaceAllString(s, &quot;&quot;)

I think using a full HTML parser might be a bit too much in this case, which is why I went in this direction, though for all I know there's something much better out there.

I appreciate any advice you may have.

答案1

得分: 4

请看这个包：https://github.com/PuerkitoBio/goquery。它类似于Go语言的jQuery。它允许你做一些事情，比如：

text := doc.Find("strong").Text()

完整的工作示例：

package main

import (
    "bytes"
    "fmt"

    "github.com/PuerkitoBio/goquery"
)

var s = `<html><body>
<form name="query" action="http://www.example.net/action.php" method="post">
    <textarea type="text" name="nameiknow">The text I want</textarea>
    <div id="button">
        <input type="submit" value="Submit" />
    </div>
</form>
</body></html>`

func main() {
    r := bytes.NewReader([]byte(s))
    doc, _ := goquery.NewDocumentFromReader(r)
    text := doc.Find("textarea").Text()
    fmt.Println(text)
}

输出结果为："The text I want"。

英文:

Take a look at this package: https://github.com/PuerkitoBio/goquery. It's like jQuery but for Go. It allows you to do things like

text := doc.Find(&quot;strong&quot;).Text()

Full working example:

package main

import (
	&quot;bytes&quot;
	&quot;fmt&quot;

	&quot;github.com/PuerkitoBio/goquery&quot;
)

var s = `&lt;html&gt;&lt;body&gt;
&lt;form name=&quot;query&quot; action=&quot;http://www.example.net/action.php&quot; method=&quot;post&quot;&gt;
    &lt;textarea type=&quot;text&quot; name=&quot;nameiknow&quot;&gt;The text I want&lt;/textarea&gt;
    &lt;div id=&quot;button&quot;&gt;
        &lt;input type=&quot;submit&quot; value=&quot;Submit&quot; /&gt;
    &lt;/div&gt;
&lt;/form&gt;
&lt;/body&gt;&lt;/html&gt;`

func main() {
	r := bytes.NewReader([]byte(s))
	doc, _ := goquery.NewDocumentFromReader(r)
	text := doc.Find(&quot;textarea&quot;).Text()
	fmt.Println(text)
}

Prints: "The text I want".

答案2

得分: 2

尽管使用正则表达式解析HTML并不是最佳实践，但根据您的要求，以下是代码：

(&lt;textarea\b[^&gt;]*\bname\s*=\s*(?:\&quot;|&#39;)\s*nameiknow\s*(?:\&quot;|&#39;)[^&lt;]*&lt;\/textarea&gt;)

英文:

Though this is not the best practice to parse HTML using regex. But as you wished, here it is:

(&lt;textarea\b[^&gt;]*\bname\s*=\s*(?:\&quot;|&#39;)\s*nameiknow\s*(?:\&quot;|&#39;)[^&lt;]*&lt;\/textarea&gt;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Go – Getting the text of a single particular HTML element from a document with a known structure

问题

答案1

答案2

获取未解组接口内的值

Go语言有哪些简便的方法可以将数据转换为字节或字符串？

谷歌应用引擎在实例休眠后是否缓存编译的Go代码？

Goroutines和调度程序

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论