从网站的HTML中提取特定领域的链接。

huangapple go评论74阅读模式
英文:

Extract a specific domain links from HTML of a website

问题

// 提取链接的代码
英文:

Below is my code to extract links from a given link and my issue is when we view the source of the given Url there is a link with domain https://fs1.pdisk.pro:183 , but when i extracted links its not coming.

<?php
function extractLinks($url) {

  // Get the HTML content of the page.
  $html = file_get_contents($url);

  // Create a DOMDocument object.
  $dom = new DOMDocument();
  @$dom->loadHTML($html);

  // Get all the anchor elements.
  $anchors = $dom->getElementsByTagName('source');

  // Create an array to store the links.
  $links = array();

  // Loop through the anchor elements.
  foreach ($anchors as $anchor) 
  {
    // Get the href attribute of the anchor element.
    $href = $anchor->getAttribute('src');

    // Add the link to the links array.
    $links[] = $href;
  }

  // Return the links array as JSON.
  return json_encode($links);
}

// Get the URL of the website to extract links from.
$url = 'http://pdisk.investro1.com/how-to-buy-life-insurance-online-qfevac8cq8x4.html';

// Extract the links from the website.
$links = extractLinks($url);

// Print the links in JSON format.
echo json_encode($links);

Can someone help me to extract the all the needed domain link from the given url and if possible redirect to the link of that domain link which is extracted from the given url and give response in json format url=link like this.

答案1

得分: 0

你正在请求一段用于抓取网站内容的代码。
未经源所有者同意获取特定内容是非法的。

换句话说,带有:183端口的链接,如果不在<a>标签下,而是在<video>--><source>标签下。

请更正以下代码行:
$anchors = $dom->getElementsByTagName('a');
改为
$anchors = $dom->getElementsByTagName('source');

同时将以下代码行:
$href = $anchor->getAttribute('href');
改为
$href = $anchor->getAttribute('src');

注意
网络抓取需要从源网站提取数据的所有者许可。

英文:

You are asking a code to scrape a website.
This is illegal to get certain contents without the source owner's concern.

By saying this, the links with :183 port, if not under &lt;a&gt; tag. Its under &lt;video&gt;-->&lt;source&gt; tag.

Please correct your line
$anchors = $dom-&gt;getElementsByTagName(&#39;a&#39;); accordingly to $anchors = $dom-&gt;getElementsByTagName(&#39;source&#39;);.

Also change the line $href = $anchor-&gt;getAttribute(&#39;href&#39;); to $href = $anchor-&gt;getAttribute(&#39;src&#39;);.

Beware :
Web Scrapping need owner's permission to extract data from source website.

huangapple
  • 本文由 发表于 2023年5月15日 14:04:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76251259.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定