解析相对路径为绝对路径

huangapple go评论138阅读模式
英文:

Resolving absolute path from relative path

问题

我正在制作一个网络爬虫,我正在尝试找到一种从相对路径获取绝对路径的方法。
我选择了两个测试网站。一个是使用ROR制作的,另一个是使用Pyro CMS制作的。

在后者中,我发现了带有链接“index.php”的href标签。所以,如果我当前正在爬取的网址是http://example.com/xyz,那么我的爬虫将追加并将其变为http://example.com/xyz/index.php。但问题是,我应该追加到根目录而不是当前路径,即应该是http://example.com/index.php。所以,如果我爬取http://example.com/xyz/index.php,我会找到另一个被追加的“index.php”。

而在ROR中,如果相对路径以“/”开头,我可以很容易地知道它是一个根目录网站。

我可以处理index.php的情况,但如果我开始手动处理,可能会有很多规则需要注意。我相信有一种更简单的方法可以完成这个任务。

英文:

I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path.
I took 2 test sites. One in ROR and 1 made using Pyro CMS.

In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz, then my crawler will append and make it http://example.com/xyz/index.php. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php. So if I crawl http://example.com/xyz/index.php, I'll find another "index.php" which gets appended again.

While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.

I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.

答案1

得分: 1

在Go语言中,path包是你的朋友。

你可以使用path.Dir()从路径中获取目录或文件夹,例如:

p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // 输出:"/xyz"

如果你找到一个以根路径(以斜杠开头)的链接,你可以直接使用。

如果它是相对路径,你可以使用path.Join()将其与上面的dir连接起来。Join()函数还会“清理”URL:

p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)

输出:

p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php

path.Join()执行的“清理”任务是通过path.Clean()完成的,你也可以手动调用它来清理任何路径。具体包括:

> 1. 将多个斜杠替换为一个斜杠。
> 2. 消除每个.路径名元素(当前目录)。
> 3. 消除每个内部的..路径名元素(父目录),以及它前面的非..元素。
> 4. 消除以根路径开始的..元素:即,在路径开头将"/.."替换为"/"

如果你有一个“完整”的URL(包含协议、主机等),你可以使用url.Parse()函数从原始URL字符串中获取一个url.URL值,它会将URL进行分解,这样你就可以获取路径:

uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
	fmt.Println("无效的URL:", err)
}
fmt.Println("路径:", u.Path)

输出:

路径:/xyz/index.php

Go Playground上尝试所有示例。

英文:

In Go, package path is your friend.

You can get the directory or folder from a path with path.Dir(), e.g.

p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // Output: "/xyz"

If you find a link with root path (starts with a slash), you can use that as-is.

If it is relative, you can join it with the dir above using path.Join(). Join() will also "clean" the url:

p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)

Output:

p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php

The "cleaning" tasks performed by path.Join() are done by path.Clean() which you can manually call on any path of course. They are:

> 1. Replace multiple slashes with a single slash.
> 2. Eliminate each . path name element (the current directory).
> 3. Eliminate each inner .. path name element (the parent directory) along with the non-.. element that precedes it.
> 4. Eliminate .. elements that begin a rooted path: that is, replace "/.." by "/" at the beginning of a path.

And if you have a "full" url (with schema, host, etc.), you can use the url.Parse() function to obtain a url.URL value from the raw url string which tokenizes the url for you, so you can get the path like this:

uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
	fmt.Println("Invalid url:", err)
}
fmt.Println("Path:", u.Path)

Output:

Path: /xyz/index.php

Try all the examples on the Go Playground.

huangapple
  • 本文由 发表于 2015年9月17日 08:38:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/32620870.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定