2022年10月11日 20:10:02go评论95阅读模式

英文:

Golang: How to download a page from Internet with absolute links in html

问题

从这个代码片段中：

&lt;head&gt;
  &lt;link rel=&quot;stylesheet&quot; href=&quot;styles.css&quot;&gt;
&lt;/head&gt;
&lt;body&gt;
  &lt;img src=&quot;img.jpg&quot; alt=&quot;&quot; width=&quot;500&quot; height=&quot;600&quot;&gt;

我想要得到这个结果：

&lt;head&gt;
  &lt;link rel=&quot;stylesheet&quot; href=&quot;http://bbc.com/styles.css&quot;&gt;
&lt;/head&gt;
&lt;body&gt;
  &lt;img src=&quot;http://bbc.com/img.jpg&quot; alt=&quot;&quot; width=&quot;500&quot; height=&quot;600&quot;&gt;

当我下载一个网页时，其中包含相对链接到css、图片等。如何在下载时将HTML页面转换为所有链接都是绝对链接而不是相对链接？我使用这个答案来下载一个页面（https://stackoverflow.com/questions/40643030/how-to-get-webpage-content-into-a-string-using-go）：

func main() {

	s := OnPage("http://bbc.com/")

    fmt.Printf(s)
}

func OnPage(link string) string {
	res, err := http.Get(link)
	if err != nil {
		log.Fatal(err)
	}
	content, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	return string(content)
}

英文:

From this:

&lt;head&gt;
  &lt;link rel=&quot;stylesheet&quot; href=&quot;styles.css&quot;&gt;
&lt;/head&gt;
&lt;body&gt;
  &lt;img src=&quot;img.jpg&quot; alt=&quot;&quot; width=&quot;500&quot; height=&quot;600&quot;&gt;

I want to get this:

&lt;head&gt;
  &lt;link rel=&quot;stylesheet&quot; href=&quot;http://bbc.com/styles.css&quot;&gt;
&lt;/head&gt;
&lt;body&gt;
  &lt;img src=&quot;http://bbc.com/img.jpg&quot; alt=&quot;&quot; width=&quot;500&quot; height=&quot;600&quot;&gt;

When I download a page there are relative links to css, images, etc. How to convert an HTML page while downloading to have all links in it as absolute not relative? I use this answer to download a page (https://stackoverflow.com/questions/40643030/how-to-get-webpage-content-into-a-string-using-go):

func main() {

	s := OnPage(&quot;http://bbc.com/&quot;)

    fmt.Printf(s)
}

func OnPage(link string) string {
	res, err := http.Get(link)
	if err != nil {
		log.Fatal(err)
	}
	content, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	return string(content)
}

答案1

得分: 1

你需要使用正则表达式来替换HTML字符串中需要的部分。以下是如何做到这一点的代码（假设页面上的所有链接都是相对链接，如果不是，你需要调整代码）：

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"regexp"
)

func main() {

	s := OnPage("http://bbc.com/")

	fmt.Printf(s)
}

func OnPage(link string) string {
	res, err := http.Get(link)
	if err != nil {
		log.Fatal(err)
	}
	content, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	html := string(content)
	var re = regexp.MustCompile(`(<img[^>]+src)="([^"]+)"`)
	updatedHTML := re.ReplaceAllString(html, `$1="`+link+`$2"`)
	re = regexp.MustCompile(`(<link[^>]+href)="([^"]+)"`)
	updatedHTML = re.ReplaceAllString(html, `$1="`+link+`$2"`)
	return updatedHTML
}

请注意，这是一个使用Go语言编写的示例代码，用于从指定链接的网页中提取并替换图片和链接的URL。

英文:

You have to use Regular Expressions to replace the needed portions of the html string. Here is how you can do it (I suppose all links on the page are relative, if not, you should adjust the code):

package main

import (
	&quot;fmt&quot;
	&quot;io/ioutil&quot;
	&quot;log&quot;
	&quot;net/http&quot;
	&quot;regexp&quot;
)

func main() {

	s := OnPage(&quot;http://bbc.com/&quot;)

	fmt.Printf(s)
}

func OnPage(link string) string {
	res, err := http.Get(link)
	if err != nil {
		log.Fatal(err)
	}
	content, err := ioutil.ReadAll(res.Body)
	res.Body.Close()
	if err != nil {
		log.Fatal(err)
	}
	html := string(content)
	var re = regexp.MustCompile(`(&lt;img[^&gt;]+src)=&quot;([^&quot;]+)&quot;`)
	updatedHTML := re.ReplaceAllString(html, `$1=&quot;`+link+`$2&quot;`)
	re = regexp.MustCompile(`(&lt;link[^&gt;]+href)=&quot;([^&quot;]+)&quot;`)
	updatedHTML = re.ReplaceAllString(html, `$1=&quot;`+link+`$2&quot;`)
	return updatedHTML
}

答案2

得分: 1

我为下载任何URL的内容（包括图像、CSS、JS和视频）构建了一个包。

请查看：https://github.com/Riaz-Mahmud/Websitebackup

安装

composer require backdoor/websitebackup

用法

use Backdoor\WebsiteBackup\WebsiteBackup;

function siteBackup(){

    $url = '要备份的网站页面链接';
    $path = '保存备份文件的路径';

    $websiteBackup = new WebsiteBackup();
    $backup = $websiteBackup->backup($url, $path);

}

英文:

I built a package for downloading content from any URL, including images, CSS, JS, and video.

Check it out: https://github.com/Riaz-Mahmud/Websitebackup

Installation

composer require backdoor/websitebackup

Usage

use Backdoor\WebsiteBackup\WebsiteBackup;

function siteBackup(){

    $url = &#39;link to your website page to backup&#39;;
    $path = &#39;path to save backup file&#39;;

    $websiteBackup = new WebsiteBackup();
    $backup = $websiteBackup-&gt;backup($url, $path);

}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Golang：如何使用绝对链接从互联网下载网页

问题

答案1

答案2

获取在Golang正则表达式中命名的子组列表

如何使用Go的net/http检查POST表单中是否存在某个值？

在Go中处理读写UDP连接

在 Redis 连接失败时重试

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论