2023年7月27日 21:10:36go评论96阅读模式

英文:

How to obtain all URLs from an external page?

问题

以下是翻译好的部分：

我的目标是从[此页面][1]获取所有URL。

这是我尝试过的：
```php
&lt;?php
$url = &#39;https://www.coop.ch/de/aktionen/wochenaktionen/aktionen-fleisch-fisch/c/m_1380?q=&#39; . urlencode(&#39;:relevance&#39;) . &#39;&amp;sort=specialOffers&amp;pageSize=10000&amp;page=1&#39;;

$html = file_get_contents($url);

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom-&gt;loadHTML($html);
libxml_clear_errors();

$links = $dom-&gt;getElementsByTagName(&#39;a&#39;);

foreach ($links as $index =&gt; $link) {
    $href = $link-&gt;getAttribute(&#39;href&#39;);
    echo &quot;$href &lt;br&gt;&quot;;
}

这给了我一些URL，但并非全部。很可能是因为某些内容是动态的。

我也尝试了JavaScript的Fetch API，但这不会起作用，因为受到了CORS策略的限制。

如何获取这些URL？

更新

来自KIKO Software的评论解决了这个问题。但我也想获取这个网站的URL。这使得情况变得有些困难，因为在源代码中我甚至看不到URL。可能是因为这个网站基于Angular。有没有办法处理这个？


[1]: https://www.coop.ch/de/aktionen/wochenaktionen/aktionen-fleisch-fisch/c/m_1380?q=%3Arelevance&amp;sort=specialOffers&amp;pageSize=10000&amp;page=1
[2]: https://stackoverflow.com/users/3986005/kiko-software
[3]: https://www.migros.ch/de/offers/home?context=instore&amp;gad=1&amp;gclid=Cj0KCQjw2eilBhCCARIsAG0Pf8tJLOLm9ncwTQe0b-h2aWV2GIr7iQeEg8cAO8GK6eCl5ggnyLxBnPUaAlKWEALw_wcB

英文:

My goal is to obtain all URLs from this page.

This is what I have tried:

&lt;?php
$url = &#39;https://www.coop.ch/de/aktionen/wochenaktionen/aktionen-fleisch-fisch/c/m_1380?q=&#39; . urlencode(&#39;:relevance&#39;) . &#39;&amp;sort=specialOffers&amp;pageSize=10000&amp;page=1&#39;;

$html = file_get_contents($url);

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom-&gt;loadHTML($html);
libxml_clear_errors();

$links = $dom-&gt;getElementsByTagName(&#39;a&#39;);

foreach ($links as $index =&gt; $link) {
    $href = $link-&gt;getAttribute(&#39;href&#39;);
    echo &quot;$href &lt;br&gt;&quot;;
}

That gives me some of the URLs but not all of them. Most likely it is because some of the contents are dynamic.

I also tried it with the Fetch API of JavaScript, but this will not work because of CORS Policy.

How do I get the URLs?

Update

The comment from KIKO Software resolved this issue. But I would also like to obtain the URLs from this website. This makes it a bit more difficult since I don't even see the URLs in the source code. Probably because this website is based on Angular. Is there a way to handle this?

答案1

得分: 1

以下是您要翻译的部分：

"I think the URL's you're interested in are located in a script element that is used as data block and contains JSON-LD.

Now I agree that you would normally use DOMDocument to parse HTML, but perhaps here a simply string extraction would do the job.

Here is my attempt:

$url = 'https://www.coop.ch/de/aktionen/wochenaktionen/aktionen-fleisch-fisch/c/m_1380?q=' . urlencode(':relevance') . '&sort=specialOffers&pageSize=10000&page=1';
$html = file_get_contents($url);
$json = str_before('',
str_after('

确定

昵称

邮箱

网址

Address

取消