如何从外部页面获取所有网址?

huangapple go评论96阅读模式
英文:

How to obtain all URLs from an external page?

问题

以下是翻译好的部分:

我的目标是从[此页面][1]获取所有URL。

这是我尝试过的:
```php
<?php
$url = 'https://www.coop.ch/de/aktionen/wochenaktionen/aktionen-fleisch-fisch/c/m_1380?q=' . urlencode(':relevance') . '&sort=specialOffers&pageSize=10000&page=1';

$html = file_get_contents($url);

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$links = $dom->getElementsByTagName('a');

foreach ($links as $index => $link) {
    $href = $link->getAttribute('href');
    echo "$href <br>";
}

这给了我一些URL,但并非全部。很可能是因为某些内容是动态的。

我也尝试了JavaScript的Fetch API,但这不会起作用,因为受到了CORS策略的限制。

如何获取这些URL?

更新

来自KIKO Software的评论解决了这个问题。但我也想获取这个网站的URL。这使得情况变得有些困难,因为在源代码中我甚至看不到URL。可能是因为这个网站基于Angular。有没有办法处理这个?


[1]: https://www.coop.ch/de/aktionen/wochenaktionen/aktionen-fleisch-fisch/c/m_1380?q=%3Arelevance&sort=specialOffers&pageSize=10000&page=1
[2]: https://stackoverflow.com/users/3986005/kiko-software
[3]: https://www.migros.ch/de/offers/home?context=instore&gad=1&gclid=Cj0KCQjw2eilBhCCARIsAG0Pf8tJLOLm9ncwTQe0b-h2aWV2GIr7iQeEg8cAO8GK6eCl5ggnyLxBnPUaAlKWEALw_wcB
英文:

My goal is to obtain all URLs from this page.

This is what I have tried:

<?php
$url = 'https://www.coop.ch/de/aktionen/wochenaktionen/aktionen-fleisch-fisch/c/m_1380?q=' . urlencode(':relevance') . '&sort=specialOffers&pageSize=10000&page=1';

$html = file_get_contents($url);

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

$links = $dom->getElementsByTagName('a');

foreach ($links as $index => $link) {
    $href = $link->getAttribute('href');
    echo "$href <br>";
}

That gives me some of the URLs but not all of them. Most likely it is because some of the contents are dynamic.

I also tried it with the Fetch API of JavaScript, but this will not work because of CORS Policy.

How do I get the URLs?

Update

The comment from KIKO Software resolved this issue. But I would also like to obtain the URLs from this website. This makes it a bit more difficult since I don't even see the URLs in the source code. Probably because this website is based on Angular. Is there a way to handle this?

答案1

得分: 1

以下是您要翻译的部分:

"I think the URL's you're interested in are located in a script element that is used as data block and contains JSON-LD.

Now I agree that you would normally use DOMDocument to parse HTML, but perhaps here a simply string extraction would do the job.

Here is my attempt:

$url = 'https://www.coop.ch/de/aktionen/wochenaktionen/aktionen-fleisch-fisch/c/m_1380?q=' . urlencode(':relevance') . '&sort=specialOffers&pageSize=10000&page=1';
$html = file_get_contents($url);
$json = str_before('',
str_after(' :?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定