2015年9月17日 08:38:35go评论165阅读模式

英文:

Resolving absolute path from relative path

问题

我正在制作一个网络爬虫，我正在尝试找到一种从相对路径获取绝对路径的方法。
我选择了两个测试网站。一个是使用ROR制作的，另一个是使用Pyro CMS制作的。

在后者中，我发现了带有链接“index.php”的href标签。所以，如果我当前正在爬取的网址是http://example.com/xyz，那么我的爬虫将追加并将其变为http://example.com/xyz/index.php。但问题是，我应该追加到根目录而不是当前路径，即应该是http://example.com/index.php。所以，如果我爬取http://example.com/xyz/index.php，我会找到另一个被追加的“index.php”。

而在ROR中，如果相对路径以“/”开头，我可以很容易地知道它是一个根目录网站。

我可以处理index.php的情况，但如果我开始手动处理，可能会有很多规则需要注意。我相信有一种更简单的方法可以完成这个任务。

英文:

I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path.
I took 2 test sites. One in ROR and 1 made using Pyro CMS.

In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz, then my crawler will append and make it http://example.com/xyz/index.php. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php. So if I crawl http://example.com/xyz/index.php, I'll find another "index.php" which gets appended again.

While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.

I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.

答案1

得分: 1

在Go语言中，path包是你的朋友。

你可以使用path.Dir()从路径中获取目录或文件夹，例如：

p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // 输出："/xyz"

如果你找到一个以根路径（以斜杠开头）的链接，你可以直接使用。

如果它是相对路径，你可以使用path.Join()将其与上面的dir连接起来。Join()函数还会“清理”URL：

p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)

输出：

p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php

path.Join()执行的“清理”任务是通过path.Clean()完成的，你也可以手动调用它来清理任何路径。具体包括：

> 1. 将多个斜杠替换为一个斜杠。
> 2. 消除每个.路径名元素（当前目录）。
> 3. 消除每个内部的..路径名元素（父目录），以及它前面的非..元素。
> 4. 消除以根路径开始的..元素：即，在路径开头将"/.."替换为"/"。

如果你有一个“完整”的URL（包含协议、主机等），你可以使用url.Parse()函数从原始URL字符串中获取一个url.URL值，它会将URL进行分解，这样你就可以获取路径：

uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
	fmt.Println("无效的URL：", err)
}
fmt.Println("路径：", u.Path)

输出：

路径：/xyz/index.php

在Go Playground上尝试所有示例。

英文:

In Go, package path is your friend.

You can get the directory or folder from a path with path.Dir(), e.g.

p := &quot;/xyz/index.php&quot;
dir := path.Dir(p)
fmt.Println(&quot;dir:&quot;, dir) // Output: &quot;/xyz&quot;

If you find a link with root path (starts with a slash), you can use that as-is.

If it is relative, you can join it with the dir above using path.Join(). Join() will also "clean" the url:

p2 := path.Join(dir, &quot;index.php&quot;)
fmt.Println(&quot;p2:&quot;, p2)
p3 := path.Join(dir, &quot;./index.php&quot;)
fmt.Println(&quot;p3:&quot;, p3)
p4 := path.Join(dir, &quot;../index.php&quot;)
fmt.Println(&quot;p4:&quot;, p4)

Output:

p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php

The "cleaning" tasks performed by path.Join() are done by path.Clean() which you can manually call on any path of course. They are:

> 1. Replace multiple slashes with a single slash.
> 2. Eliminate each . path name element (the current directory).
> 3. Eliminate each inner .. path name element (the parent directory) along with the non-.. element that precedes it.
> 4. Eliminate .. elements that begin a rooted path: that is, replace "/.." by "/" at the beginning of a path.

And if you have a "full" url (with schema, host, etc.), you can use the url.Parse() function to obtain a url.URL value from the raw url string which tokenizes the url for you, so you can get the path like this:

uraw := &quot;http://example.com/xyz/index.php&quot;
u, err := url.Parse(uraw)
if err != nil {
	fmt.Println(&quot;Invalid url:&quot;, err)
}
fmt.Println(&quot;Path:&quot;, u.Path)

Output:

Path: /xyz/index.php

Try all the examples on the Go Playground.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

解析相对路径为绝对路径

问题

答案1

编码为Base64时的内存消耗

如何通过gin-gonic在Go中读取POST字符串

我的数据存储属性加载器（PropertyLoadSaver）在Golang中不起作用。

在Golang中捕获panic异常

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。