Golang:如何使用绝对链接从互联网下载网页

huangapple go评论86阅读模式
英文:

Golang: How to download a page from Internet with absolute links in html

问题

从这个代码片段中:

<head>
  <link rel="stylesheet" href="styles.css">
</head>
<body>
  <img src="img.jpg" alt="" width="500" height="600">

我想要得到这个结果:

<head>
  <link rel="stylesheet" href="http://bbc.com/styles.css">
</head>
<body>
  <img src="http://bbc.com/img.jpg" alt="" width="500" height="600">

当我下载一个网页时,其中包含相对链接到css、图片等。如何在下载时将HTML页面转换为所有链接都是绝对链接而不是相对链接?我使用这个答案来下载一个页面(https://stackoverflow.com/questions/40643030/how-to-get-webpage-content-into-a-string-using-go):

func main() {

	s := OnPage("http://bbc.com/")

    fmt.Printf(s)
}

func OnPage(link string) string {
	res, err := http.Get(link)
	if err != nil {
		log.Fatal(err)
	}
	content, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	return string(content)
}
英文:

From this:

<head>
  <link rel="stylesheet" href="styles.css">
</head>
<body>
  <img src="img.jpg" alt="" width="500" height="600">

I want to get this:

<head>
  <link rel="stylesheet" href="http://bbc.com/styles.css">
</head>
<body>
  <img src="http://bbc.com/img.jpg" alt="" width="500" height="600">

When I download a page there are relative links to css, images, etc. How to convert an HTML page while downloading to have all links in it as absolute not relative? I use this answer to download a page (https://stackoverflow.com/questions/40643030/how-to-get-webpage-content-into-a-string-using-go):

func main() {

	s := OnPage("http://bbc.com/")

    fmt.Printf(s)
}

func OnPage(link string) string {
	res, err := http.Get(link)
	if err != nil {
		log.Fatal(err)
	}
	content, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	return string(content)
}

答案1

得分: 1

你需要使用正则表达式来替换HTML字符串中需要的部分。以下是如何做到这一点的代码(假设页面上的所有链接都是相对链接,如果不是,你需要调整代码):

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"regexp"
)

func main() {

	s := OnPage("http://bbc.com/")

	fmt.Printf(s)
}

func OnPage(link string) string {
	res, err := http.Get(link)
	if err != nil {
		log.Fatal(err)
	}
	content, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	html := string(content)
	var re = regexp.MustCompile(`(<img[^>]+src)="([^"]+)"`)
	updatedHTML := re.ReplaceAllString(html, `$1="`+link+`$2"`)
	re = regexp.MustCompile(`(<link[^>]+href)="([^"]+)"`)
	updatedHTML = re.ReplaceAllString(html, `$1="`+link+`$2"`)
	return updatedHTML
}

请注意,这是一个使用Go语言编写的示例代码,用于从指定链接的网页中提取并替换图片和链接的URL。

英文:

You have to use Regular Expressions to replace the needed portions of the html string. Here is how you can do it (I suppose all links on the page are relative, if not, you should adjust the code):

package main

import (
	&quot;fmt&quot;
	&quot;io/ioutil&quot;
	&quot;log&quot;
	&quot;net/http&quot;
	&quot;regexp&quot;
)

func main() {

	s := OnPage(&quot;http://bbc.com/&quot;)

	fmt.Printf(s)
}

func OnPage(link string) string {
	res, err := http.Get(link)
	if err != nil {
		log.Fatal(err)
	}
	content, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	html := string(content)
	var re = regexp.MustCompile(`(&lt;img[^&gt;]+src)=&quot;([^&quot;]+)&quot;`)
	updatedHTML := re.ReplaceAllString(html, `$1=&quot;`+link+`$2&quot;`)
	re = regexp.MustCompile(`(&lt;link[^&gt;]+href)=&quot;([^&quot;]+)&quot;`)
	updatedHTML = re.ReplaceAllString(html, `$1=&quot;`+link+`$2&quot;`)
	return updatedHTML
}

答案2

得分: 1

我为下载任何URL的内容(包括图像、CSS、JS和视频)构建了一个包。

请查看:https://github.com/Riaz-Mahmud/Websitebackup

安装

composer require backdoor/websitebackup

用法

use Backdoor\WebsiteBackup\WebsiteBackup;

function siteBackup(){

    $url = '要备份的网站页面链接';
    $path = '保存备份文件的路径';

    $websiteBackup = new WebsiteBackup();
    $backup = $websiteBackup->backup($url, $path);

}
英文:

I built a package for downloading content from any URL, including images, CSS, JS, and video.

Check it out: https://github.com/Riaz-Mahmud/Websitebackup

Installation

composer require backdoor/websitebackup

Usage

use Backdoor\WebsiteBackup\WebsiteBackup;

function siteBackup(){

    $url = &#39;link to your website page to backup&#39;;
    $path = &#39;path to save backup file&#39;;

    $websiteBackup = new WebsiteBackup();
    $backup = $websiteBackup-&gt;backup($url, $path);

}

huangapple
  • 本文由 发表于 2022年10月11日 20:10:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/74027919.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定