如何使用PHP/正则表达式从HTML页面中提取链接的图像。

huangapple go评论74阅读模式
英文:

How to extract linked images out of a html page with PHP/regexp

问题

我正在寻找一些PHP代码或正则表达式(我对正则表达式不太熟悉),以从HTML文件中提取链接的图像。换句话说,只提取看起来像这样的HTML片段:

<a href=...><img src=...></a>

我知道如何分别提取图像和链接:

$links = $dom->getElementsByTagName('a');
$images = $dom->getElementsByTagName('img');

但不知道如何提取两个标签一个嵌套在另一个内部。我也没有在谷歌上找到任何有用的信息。所以我想知道我想做的事情是否不常见或者非常困难?

你能帮我吗?谢谢。

英文:

I'm looking for some PHP code or a rexeg expression (i'm not that skilled about regexp) to extract from a html file just the linked images. In other words, just the chunk of html that looks like:

&lt;a href=...&gt;&lt;img src=...&gt;&lt;/a&gt;

I know how to extract images and links separately

$links = $dom-&gt;getElementsByTagName(&#39;a&#39;);
$images = $dom-&gt;getElementsByTagName(&#39;img&#39;);

but not how to extract the two tags one inside the other. I have also not found anything by googling it. So is it maybe uncommon or very difficult what I want to do?

Could you help me? Thanks.

答案1

得分: 1

你可以使用以下的XPath查询:

//a[./img]

这意味着任何<a>元素其直接子元素是<img>

在使用PHP的DOM API时,代码如下:

$domDocument = new \DOMDocument();
$domDocument->loadHTML($html);

$xpath = new DOMXPath($domDocument);
$imageLinks = $xpath->query('//a[./img]');

演示: https://3v4l.org/GXAbC

如果图片在DOM树中进一步下层,你可以将XPath查询更改为:

//a[.//img]
英文:

You could use the following XPath query:

//a[./img]

which means any &lt;a&gt; element which has a &lt;img&gt; as its direct child.

Using PHP's DOM API, this would look like this:

$domDocument = new \DOMDocument();
$domDocument-&gt;loadHTML($html);

$xpath = new DOMXPath($domDocument);
$imageLinks = $xpath-&gt;query(&#39;//a[./img]&#39;);

Demo: https://3v4l.org/GXAbC

If the image can be further down the DOM tree, you can change the XPath query to this:

//a[.//img]

答案2

得分: 0

解决方案 不使用 xpath 可以是:

$links = $domDocument->getElementsByTagName('a');
foreach ($links as $link) {
    $img = $link->getElementsByTagName('img');
    // 获取 DOMNodeList 的第一个元素
    print_r($img->item(0));
}
英文:

Solution without xpath can be:

$links = $domDocument-&gt;getElementsByTagName(&#39;a&#39;);
foreach ($links as $link) {
    $img = $link-&gt;getElementsByTagName(&#39;img&#39;);
    // getting first element of DOMNodeList
    print_r($img-&gt;item(0));
}

huangapple
  • 本文由 发表于 2020年1月3日 19:46:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/59578071.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定