英文:
Regexp to find images in html (golang)
问题
我正在解析来自几个不同来源的 XML RSS 源,并且我想在 HTML 中找到图片。
我进行了一些研究,找到了一个正则表达式,我认为可能会起作用:
/<img[^>]+src="?([^"\s]+)"?\s*\/>/g
但是我在使用 Go 时遇到了问题。它给我报错,因为我不知道如何使用这个表达式进行搜索。
我尝试将其作为字符串使用,但它无法正确转义单引号或双引号。我尝试直接使用它,但是会报错。
有什么想法吗?
英文:
I'm parsing an xml rss feed from a couple of different sources and I want to find the images in the html.
I did some research and I found a regex that I think might work
/<img[^>]+src="?([^"\s]+)"?\s*\/>/g
but I have trouble using it in go. It gives me errors because I don't know how to make it search with that expression.
I tried using it as a string, it doesn't escape properly with single or with double quotes. I tried using it just like that, bare, and it gives me an error.
Any ideas?
答案1
得分: 8
使用适当的HTML解析器来解析HTML总是更好的选择,然而一个简单/粗糙的正则表达式也可以很好地工作,这里有一个例子:
var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// 如果你的img标签是用双引号正确形成的,那么使用这个正则表达式会更高效。
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
imgs := imgRE.FindAllStringSubmatch(htm, -1)
out := make([]string, len(imgs))
for i := range out {
out[i] = imgs[i][1]
}
return out
}
英文:
Using a proper html parser is always better for parsing html, however a cheap / hackish regex can also work fine, here's an example:
var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// if your img's are properly formed with doublequotes then use this, it's more efficient.
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
imgs := imgRE.FindAllStringSubmatch(htm, -1)
out := make([]string, len(imgs))
for i := range out {
out[i] = imgs[i][1]
}
return out
}
答案2
得分: -3
啊,抱歉,我之前没有使用过Go,但是这段代码看起来是可以工作的。
你可以在这个网址上尝试一下:
https://tour.golang.org/welcome/1
package main
import (
"fmt"
"regexp"
)
func main() {
var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
out := make([]string, len(imgTags))
for i := range out {
fmt.Println(imgTags[i][1])
}
}
我建议你使用htmlagility来解析任何DOM/XML类型的内容。
读取文档的方法如下:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
使用XPath定义来解析,正则表达式也可以,但是分组的问题会使工作变得复杂。
doc.DocumentNode.SelectSingleNode(XPath here)
或者
doc.DocumentNode.SelectNodes("//img") // 这将返回所有的img标签
我建议使用这种方法,因为它似乎是用于解析一些HTML内容的RSS服务;所以获取XML,使用XMLDoc解析获取所需的HTML内容,然后获取所有的图片。
对于开放式的回答,我认为只需要正则表达式;
我的模式是:
<img.+?src=["'](.*?)["'].*?>
对于输入:
<img src='img1single.jpg'>
<img src="img2double.jpg">
结果看起来是正确的。
在.NET中,你可以通过foreach来获取:
.Groups[1].Value
祝好。
英文:
Ah so, sorry,Not worked with Go before but this seems work.
tryed at
https://tour.golang.org/welcome/1
.
package main
import (
"fmt"
"regexp"
)
func main() {
var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
out := make([]string, len(imgTags))
for i := range out {
fmt.Println(imgTags[i][1])
}
}
I suggest to use htmlagility to parse any dom/xml kind a.
Read document by;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
Parse by Xpath definition RegX fine but group ext. issues makes job complex
doc.DocumentNode.SelectSingleNode(XPath here)
or
doc.DocumentNode.SelectNodes("//img") // this should give all img tags
like.
i suggest this becouse it seems rss serves some html content ![]()
So get xml,
parse with XMLDoc get html content that you need
then get all images by this.
For open answer.
after comment just need regex i think ;
my pattern is
<img.+?src=[\"'](.+?)[\"'].*?>
for input
<img src='img1single.jpg'>
<img src="img2double.jpg">
and result seems fine
in .net you must get by foreach via
.Groups[1].Value
regards.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论