2016年5月1日 19:18:05go评论142阅读模式

英文:

Regexp to find images in html (golang)

问题

我正在解析来自几个不同来源的 XML RSS 源，并且我想在 HTML 中找到图片。

我进行了一些研究，找到了一个正则表达式，我认为可能会起作用：

/&lt;img[^&gt;]+src=&quot;?([^&quot;\s]+)&quot;?\s*\/&gt;/g

但是我在使用 Go 时遇到了问题。它给我报错，因为我不知道如何使用这个表达式进行搜索。

我尝试将其作为字符串使用，但它无法正确转义单引号或双引号。我尝试直接使用它，但是会报错。

有什么想法吗？

英文:

I'm parsing an xml rss feed from a couple of different sources and I want to find the images in the html.

I did some research and I found a regex that I think might work

/&lt;img[^&gt;]+src=&quot;?([^&quot;\s]+)&quot;?\s*\/&gt;/g

but I have trouble using it in go. It gives me errors because I don't know how to make it search with that expression.

I tried using it as a string, it doesn't escape properly with single or with double quotes. I tried using it just like that, bare, and it gives me an error.

Any ideas?

答案1

得分: 8

使用适当的HTML解析器来解析HTML总是更好的选择，然而一个简单/粗糙的正则表达式也可以很好地工作，这里有一个例子：

var imgRE = regexp.MustCompile(`&lt;img[^&gt;]+\bsrc=[&quot;&#39;]([^&quot;&#39;]+)[&quot;&#39;]`)
// 如果你的img标签是用双引号正确形成的，那么使用这个正则表达式会更高效。
// var imgRE = regexp.MustCompile(`&lt;img[^&gt;]+\bsrc=&quot;([^&quot;]+)&quot;`)
func findImages(htm string) []string {
    imgs := imgRE.FindAllStringSubmatch(htm, -1)
    out := make([]string, len(imgs))
    for i := range out {
        out[i] = imgs[i][1]
    }
    return out
}

playground

英文:

Using a proper html parser is always better for parsing html, however a cheap / hackish regex can also work fine, here's an example:

var imgRE = regexp.MustCompile(`&lt;img[^&gt;]+\bsrc=[&quot;&#39;]([^&quot;&#39;]+)[&quot;&#39;]`)
// if your img&#39;s are properly formed with doublequotes then use this, it&#39;s more efficient.
// var imgRE = regexp.MustCompile(`&lt;img[^&gt;]+\bsrc=&quot;([^&quot;]+)&quot;`)
func findImages(htm string) []string {
	imgs := imgRE.FindAllStringSubmatch(htm, -1)
	out := make([]string, len(imgs))
	for i := range out {
		out[i] = imgs[i][1]
	}
	return out
}

<kbd>playground</kbd>

答案2

得分: -3

啊，抱歉，我之前没有使用过Go，但是这段代码看起来是可以工作的。
你可以在这个网址上尝试一下：
https://tour.golang.org/welcome/1

package main
import (
	"fmt"
	"regexp"
)
func main() {
	var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
	var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
	var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
	out := make([]string, len(imgTags))
	for i := range out {
		fmt.Println(imgTags[i][1])
	}
}

我建议你使用htmlagility来解析任何DOM/XML类型的内容。

读取文档的方法如下：

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);

使用XPath定义来解析，正则表达式也可以，但是分组的问题会使工作变得复杂。

doc.DocumentNode.SelectSingleNode(XPath here)

或者

doc.DocumentNode.SelectNodes("//img")  // 这将返回所有的img标签

我建议使用这种方法，因为它似乎是用于解析一些HTML内容的RSS服务；所以获取XML，使用XMLDoc解析获取所需的HTML内容，然后获取所有的图片。

对于开放式的回答，我认为只需要正则表达式；
我的模式是：

<img.+?src=["'](.*?)["'].*?>

对于输入：

<img src='img1single.jpg'>
<img src="img2double.jpg">

结果看起来是正确的。
在.NET中，你可以通过foreach来获取：

.Groups[1].Value

祝好。

英文:

Ah so, sorry,Not worked with Go before but this seems work.
tryed at

https://tour.golang.org/welcome/1

package main
import (
     &quot;fmt&quot;
 	 &quot;regexp&quot;
)
func main() {
   var myString = `&lt;img src=&#39;img1single.jpg&#39;&gt;&lt;img src=&quot;img2double.jpg&quot;&gt;`
   var myRegex = regexp.MustCompile(`&lt;img[^&gt;]+\bsrc=[&quot;&#39;]([^&quot;&#39;]+)[&quot;&#39;]`)
   var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
   out := make([]string, len(imgTags))
  for i := range out {
    fmt.Println(imgTags[i][1])
   }
 }

I suggest to use htmlagility to parse any dom/xml kind a.

Read document by;

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);

Parse by Xpath definition RegX fine but group ext. issues makes job complex

doc.DocumentNode.SelectSingleNode(XPath here)

doc.DocumentNode.SelectNodes(&quot;//img&quot;)  // this should give all img tags

like.

i suggest this becouse it seems rss serves some html content
So get xml,
parse with XMLDoc get html content that you need
then get all images by this.
For open answer.

after comment just need regex i think ;
my pattern is

 &lt;img.+?src=[\&quot;&#39;](.+?)[\&quot;&#39;].*?&gt;

for input

&lt;img src=&#39;img1single.jpg&#39;&gt;
&lt;img src=&quot;img2double.jpg&quot;&gt;

and result seems fine
in .net you must get by foreach via

.Groups[1].Value

regards.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Regexp to find images in html (golang)

问题

答案1

答案2

匿名结构体返回空字段值

Golang测试中的固定装置

json.Unmarshal工作不正常

Linux System.d 单元服务无法看到我的特定环境变量。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。