Regexp to find images in html (golang)

huangapple go评论105阅读模式
英文:

Regexp to find images in html (golang)

问题

我正在解析来自几个不同来源的 XML RSS 源,并且我想在 HTML 中找到图片。

我进行了一些研究,找到了一个正则表达式,我认为可能会起作用:

/<img[^>]+src="?([^"\s]+)"?\s*\/>/g

但是我在使用 Go 时遇到了问题。它给我报错,因为我不知道如何使用这个表达式进行搜索。

我尝试将其作为字符串使用,但它无法正确转义单引号或双引号。我尝试直接使用它,但是会报错。

有什么想法吗?

英文:

I'm parsing an xml rss feed from a couple of different sources and I want to find the images in the html.

I did some research and I found a regex that I think might work

/<img[^>]+src="?([^"\s]+)"?\s*\/>/g

but I have trouble using it in go. It gives me errors because I don't know how to make it search with that expression.

I tried using it as a string, it doesn't escape properly with single or with double quotes. I tried using it just like that, bare, and it gives me an error.

Any ideas?

答案1

得分: 8

使用适当的HTML解析器来解析HTML总是更好的选择,然而一个简单/粗糙的正则表达式也可以很好地工作,这里有一个例子:

var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// 如果你的img标签是用双引号正确形成的,那么使用这个正则表达式会更高效。
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
    imgs := imgRE.FindAllStringSubmatch(htm, -1)
    out := make([]string, len(imgs))
    for i := range out {
        out[i] = imgs[i][1]
    }
    return out
}

playground

英文:

Using a proper html parser is always better for parsing html, however a cheap / hackish regex can also work fine, here's an example:

var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// if your img's are properly formed with doublequotes then use this, it's more efficient.
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
	imgs := imgRE.FindAllStringSubmatch(htm, -1)
	out := make([]string, len(imgs))
	for i := range out {
		out[i] = imgs[i][1]
	}
	return out
}

<kbd>playground</kbd>

答案2

得分: -3

啊,抱歉,我之前没有使用过Go,但是这段代码看起来是可以工作的。
你可以在这个网址上尝试一下:
https://tour.golang.org/welcome/1

package main

import (
	"fmt"
	"regexp"
)

func main() {
	var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
	var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
	var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
	out := make([]string, len(imgTags))
	for i := range out {
		fmt.Println(imgTags[i][1])
	}
}

我建议你使用htmlagility来解析任何DOM/XML类型的内容。

读取文档的方法如下:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);

使用XPath定义来解析,正则表达式也可以,但是分组的问题会使工作变得复杂。

doc.DocumentNode.SelectSingleNode(XPath here)

或者

doc.DocumentNode.SelectNodes("//img")  // 这将返回所有的img标签

我建议使用这种方法,因为它似乎是用于解析一些HTML内容的RSS服务;所以获取XML,使用XMLDoc解析获取所需的HTML内容,然后获取所有的图片。

对于开放式的回答,我认为只需要正则表达式;
我的模式是:

<img.+?src=["'](.*?)["'].*?>

对于输入:

<img src='img1single.jpg'>
<img src="img2double.jpg">

结果看起来是正确的。
在.NET中,你可以通过foreach来获取:

.Groups[1].Value

祝好。

英文:

Ah so, sorry,Not worked with Go before but this seems work.
tryed at

https://tour.golang.org/welcome/1

.

package main

import (
     &quot;fmt&quot;
 	 &quot;regexp&quot;
)

func main() {
   var myString = `&lt;img src=&#39;img1single.jpg&#39;&gt;&lt;img src=&quot;img2double.jpg&quot;&gt;`
   var myRegex = regexp.MustCompile(`&lt;img[^&gt;]+\bsrc=[&quot;&#39;]([^&quot;&#39;]+)[&quot;&#39;]`)
   var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
   out := make([]string, len(imgTags))
  for i := range out {
    fmt.Println(imgTags[i][1])
   }
 }

I suggest to use htmlagility to parse any dom/xml kind a.

Read document by;

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml); 

Parse by Xpath definition RegX fine but group ext. issues makes job complex

doc.DocumentNode.SelectSingleNode(XPath here)      

or

doc.DocumentNode.SelectNodes(&quot;//img&quot;)  // this should give all img tags 

like.

i suggest this becouse it seems rss serves some html content Regexp to find images in html (golang)
So get xml,
parse with XMLDoc get html content that you need
then get all images by this.
For open answer.

after comment just need regex i think ;
my pattern is

 &lt;img.+?src=[\&quot;&#39;](.+?)[\&quot;&#39;].*?&gt;

for input

&lt;img src=&#39;img1single.jpg&#39;&gt;
&lt;img src=&quot;img2double.jpg&quot;&gt;

and result seems fine
in .net you must get by foreach via

.Groups[1].Value

regards.

huangapple
  • 本文由 发表于 2016年5月1日 19:18:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/36966035.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定