英文:
Regexp to find images in html (golang)
问题
我正在解析来自几个不同来源的 XML RSS 源,并且我想在 HTML 中找到图片。
我进行了一些研究,找到了一个正则表达式,我认为可能会起作用:
/<img[^>]+src="?([^"\s]+)"?\s*\/>/g
但是我在使用 Go 时遇到了问题。它给我报错,因为我不知道如何使用这个表达式进行搜索。
我尝试将其作为字符串使用,但它无法正确转义单引号或双引号。我尝试直接使用它,但是会报错。
有什么想法吗?
英文:
I'm parsing an xml rss feed from a couple of different sources and I want to find the images in the html.
I did some research and I found a regex that I think might work
/<img[^>]+src="?([^"\s]+)"?\s*\/>/g
but I have trouble using it in go. It gives me errors because I don't know how to make it search with that expression.
I tried using it as a string, it doesn't escape properly with single or with double quotes. I tried using it just like that, bare, and it gives me an error.
Any ideas?
答案1
得分: 8
使用适当的HTML解析器来解析HTML总是更好的选择,然而一个简单/粗糙的正则表达式也可以很好地工作,这里有一个例子:
var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// 如果你的img标签是用双引号正确形成的,那么使用这个正则表达式会更高效。
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
imgs := imgRE.FindAllStringSubmatch(htm, -1)
out := make([]string, len(imgs))
for i := range out {
out[i] = imgs[i][1]
}
return out
}
英文:
Using a proper html parser is always better for parsing html, however a cheap / hackish regex can also work fine, here's an example:
var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// if your img's are properly formed with doublequotes then use this, it's more efficient.
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
imgs := imgRE.FindAllStringSubmatch(htm, -1)
out := make([]string, len(imgs))
for i := range out {
out[i] = imgs[i][1]
}
return out
}
答案2
得分: -3
啊,抱歉,我之前没有使用过Go,但是这段代码看起来是可以工作的。
你可以在这个网址上尝试一下:
https://tour.golang.org/welcome/1
package main
import (
"fmt"
"regexp"
)
func main() {
var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
out := make([]string, len(imgTags))
for i := range out {
fmt.Println(imgTags[i][1])
}
}
我建议你使用htmlagility来解析任何DOM/XML类型的内容。
读取文档的方法如下:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
使用XPath定义来解析,正则表达式也可以,但是分组的问题会使工作变得复杂。
doc.DocumentNode.SelectSingleNode(XPath here)
或者
doc.DocumentNode.SelectNodes("//img") // 这将返回所有的img标签
我建议使用这种方法,因为它似乎是用于解析一些HTML内容的RSS服务;所以获取XML,使用XMLDoc解析获取所需的HTML内容,然后获取所有的图片。
对于开放式的回答,我认为只需要正则表达式;
我的模式是:
<img.+?src=["'](.*?)["'].*?>
对于输入:
<img src='img1single.jpg'>
<img src="img2double.jpg">
结果看起来是正确的。
在.NET中,你可以通过foreach来获取:
.Groups[1].Value
祝好。
英文:
Ah so, sorry,Not worked with Go before but this seems work.
tryed at
https://tour.golang.org/welcome/1
.
package main
import (
"fmt"
"regexp"
)
func main() {
var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
out := make([]string, len(imgTags))
for i := range out {
fmt.Println(imgTags[i][1])
}
}
I suggest to use htmlagility to parse any dom/xml kind a.
Read document by;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
Parse by Xpath definition RegX fine but group ext. issues makes job complex
doc.DocumentNode.SelectSingleNode(XPath here)
or
doc.DocumentNode.SelectNodes("//img") // this should give all img tags
like.
i suggest this becouse it seems rss serves some html content
So get xml,
parse with XMLDoc get html content that you need
then get all images by this.
For open answer.
after comment just need regex i think ;
my pattern is
<img.+?src=[\"'](.+?)[\"'].*?>
for input
<img src='img1single.jpg'>
<img src="img2double.jpg">
and result seems fine
in .net you must get by foreach via
.Groups[1].Value
regards.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论