英文:
Resolving absolute path from relative path
问题
我正在制作一个网络爬虫,我正在尝试找到一种从相对路径获取绝对路径的方法。
我选择了两个测试网站。一个是使用ROR制作的,另一个是使用Pyro CMS制作的。
在后者中,我发现了带有链接“index.php”的href标签。所以,如果我当前正在爬取的网址是http://example.com/xyz
,那么我的爬虫将追加并将其变为http://example.com/xyz/index.php
。但问题是,我应该追加到根目录而不是当前路径,即应该是http://example.com/index.php
。所以,如果我爬取http://example.com/xyz/index.php
,我会找到另一个被追加的“index.php”。
而在ROR中,如果相对路径以“/”开头,我可以很容易地知道它是一个根目录网站。
我可以处理index.php的情况,但如果我开始手动处理,可能会有很多规则需要注意。我相信有一种更简单的方法可以完成这个任务。
英文:
I'm making a web-crawler and I'm trying to figure out a way to find out absolute path from relative path.
I took 2 test sites. One in ROR and 1 made using Pyro CMS.
In the latter one, I found href tags with link "index.php". So, If I'm currently crawling at http://example.com/xyz
, then my crawler will append and make it http://example.com/xyz/index.php
. But the problem is that, I should be appending to root instead i.e. it should have been http://example.com/index.php
. So if I crawl http://example.com/xyz/index.php
, I'll find another "index.php" which gets appended again.
While in ROR, if the relative path starts with '/', I could've easily known that it is a root site.
I can handle the case of index.php, but there might be so many rules that I need to take care of if I start doing it manually. I'm sure there's an easier way to get this done.
答案1
得分: 1
在Go语言中,path
包是你的朋友。
你可以使用path.Dir()
从路径中获取目录或文件夹,例如:
p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // 输出:"/xyz"
如果你找到一个以根路径(以斜杠开头)的链接,你可以直接使用。
如果它是相对路径,你可以使用path.Join()
将其与上面的dir
连接起来。Join()
函数还会“清理”URL:
p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)
输出:
p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php
path.Join()
执行的“清理”任务是通过path.Clean()
完成的,你也可以手动调用它来清理任何路径。具体包括:
> 1. 将多个斜杠替换为一个斜杠。
> 2. 消除每个.
路径名元素(当前目录)。
> 3. 消除每个内部的..
路径名元素(父目录),以及它前面的非..
元素。
> 4. 消除以根路径开始的..
元素:即,在路径开头将"/.."
替换为"/"
。
如果你有一个“完整”的URL(包含协议、主机等),你可以使用url.Parse()
函数从原始URL字符串中获取一个url.URL
值,它会将URL进行分解,这样你就可以获取路径:
uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
fmt.Println("无效的URL:", err)
}
fmt.Println("路径:", u.Path)
输出:
路径:/xyz/index.php
在Go Playground上尝试所有示例。
英文:
In Go, package path
is your friend.
You can get the directory or folder from a path with path.Dir()
, e.g.
p := "/xyz/index.php"
dir := path.Dir(p)
fmt.Println("dir:", dir) // Output: "/xyz"
If you find a link with root path (starts with a slash), you can use that as-is.
If it is relative, you can join it with the dir
above using path.Join()
. Join()
will also "clean" the url:
p2 := path.Join(dir, "index.php")
fmt.Println("p2:", p2)
p3 := path.Join(dir, "./index.php")
fmt.Println("p3:", p3)
p4 := path.Join(dir, "../index.php")
fmt.Println("p4:", p4)
Output:
p2: /xyz/index.php
p3: /xyz/index.php
p4: /index.php
The "cleaning" tasks performed by path.Join()
are done by path.Clean()
which you can manually call on any path of course. They are:
> 1. Replace multiple slashes with a single slash.
> 2. Eliminate each .
path name element (the current directory).
> 3. Eliminate each inner ..
path name element (the parent directory) along with the non-..
element that precedes it.
> 4. Eliminate ..
elements that begin a rooted path: that is, replace "/.."
by "/"
at the beginning of a path.
And if you have a "full" url (with schema, host, etc.), you can use the url.Parse()
function to obtain a url.URL
value from the raw url string which tokenizes the url for you, so you can get the path like this:
uraw := "http://example.com/xyz/index.php"
u, err := url.Parse(uraw)
if err != nil {
fmt.Println("Invalid url:", err)
}
fmt.Println("Path:", u.Path)
Output:
Path: /xyz/index.php
Try all the examples on the Go Playground.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论