透过 Java 递归下载远程 HTTP 目录

huangapple go评论69阅读模式
英文:

Recursively downloading a remote HTTP directory through java

问题

我想创建一个函数,将远程目录(例如:"https://server.net/production/current/")通过HTTP下载到本地文件夹。我无法控制远程目录,因此不能只是创建一个方便的tar包。我找到了许多关于检索单个文件的问题,但没有找到符合我用例的解决方案。

为了让你了解我在指的是什么,这里是在浏览器中查看目录的示例。

透过 Java 递归下载远程 HTTP 目录

换句话说,我想要创建一个等同于以下 wget 命令的函数,其中 Y 是本地目标文件夹,X 是要检索的远程目录。我可以直接调用 wget,但我想要一个跨平台的解决方案,可以在 Windows 上无需额外设置即可运行。

wget -r -np -R "index.html*" -P Y X

最终目标是创建一个类似于以下所示的 Java 函数。

/**
 * 递归下载远程 HTTPS 目录中的所有文件到本地目标文件夹。
 * @param remoteFolder 文件夹的URL(例如:"https://server.net/production/current/")
 * @param destination 本地文件夹(例如:"C:\Users\Home\project\production")
 */
public static void downloadDirectory(String remoteFolder, String destination) {}

可以假设远程目录中没有循环依赖,并且目标文件夹存在且为空。

英文:

I want to create a function to download a remote directory (Ex: "https://server.net/production/current/") via HTTP to a local folder. I don't have control over the remote directory so I can't just create a convenient tar ball. I was able to find lots of questions related to retrieving individual files, but I couldn't find one that matched my use case.

To give you an idea of what I am referring to, here is a sample of what the directory looks like in browser.

透过 Java 递归下载远程 HTTP 目录

In other words I want to create a function equivalent to this wget where Y is the local destination folder and X is the remote directory to retrieve. I would call wget directly, but I want a cross-platform solution that will work on windows without additional setup.

wget -r -np -R "index.html*" -P Y X

The end goal is a java function like the one shown below.

/**
 * Recursively downloads all of the files in a remote HTTPS directory to the local destination
 * folder.
 * @param remoteFolder a folder URL (Ex: "https://server.net/production/current/")
 * @param destination a local folder (Ex: "C:\Users\Home\project\production")
 */
public static void downloadDirectory(String remoteFolder, String destination) {}

It can assume there are no circular dependencies in the remote directory and that the destination folder exists and is empty.

答案1

得分: 2

我原本希望在java.io或者Apache commons-io中能找到一些神奇的函数或最佳实践来做这个,但既然听起来似乎没有这样的函数,我就编写了自己的版本,手动浏览HTML页面并跟踪链接。

我会把这个答案留在这里,以防其他人有相同的问题,或者有人知道改进我这个版本的方法。

import org.apache.commons.io.FileUtils;

private static final Pattern HREF_PATTERN = Pattern.compile("href=\"(.*?)\"");

/**
 * 递归下载远程 HTTPS 目录中的所有文件到本地目标文件夹。这个实现要求目标字符串以文件分隔符结尾。
 * 如果你不确定是否以此结尾,可以在末尾添加 "/" 以确保安全。
 * 
 * @param src 远程文件夹 URL(例如:"https://server.net/production/current/")
 * @param dst 要复制到的本地文件夹(例如:"C:\Users\Home\project\production\")
 */
public static void downloadDirectory(String src, String dst) throws IOException {
    Scanner out = new Scanner(new URL(src).openStream(), "UTF-8").useDelimiter("\n");
    List<String> hrefs = new ArrayList<>(8);

    while (out.hasNext()) {
        Matcher match = HREF_PATTERN.matcher(out.next());

        if (match.find())
            hrefs.add(match.group(1));
    }

    out.close();

    for (String next : hrefs) {
        if (next.equals("../"))
            continue;

        if (next.endsWith("/"))
            copyURLToDirectory(src + next, dst + next);
        else
            FileUtils.copyURLToFile(new URL(src + next), new File(dst + next));
    }
}
英文:

I was hoping there was some magic function or best practice in java.io or maybe Apache commons-io to do this, but since it sounds like none exists I wrote my own version that manually goes through the html page and follows links.

I'm just going to leave this answer here in case someone else has the same question or someone knows a way to improve my version.

import org.apache.commons.io.FileUtils;

private static final Pattern HREF_PATTERN = Pattern.compile(&quot;href=\&quot;(.*?)\&quot;&quot;);

/**
 * Recursively downloads all of the files in a remote HTTPS directory to a local
 * destination folder. This implementation requires that the destination string
 * ends in a file delimiter. If you don&#39;t know if it does, append &quot;/&quot; to the end
 * just to be safe.
 * 
 * @param src remote folder URL (Ex: &quot;https://server.net/production/current/&quot;)
 * @param dst local folder to copy into (Ex: &quot;C:\Users\Home\project\production\&quot;)
 */
public static void downloadDirectory(String src, String dst) throws IOException {
    Scanner out = new Scanner(new URL(src).openStream(), &quot;UTF-8&quot;).useDelimiter(&quot;\n&quot;);
    List&lt;String&gt; hrefs = new ArrayList&lt;&gt;(8);

    while (out.hasNext()) {
        Matcher match = HREF_PATTERN.matcher(out.next());

        if (match.find())
            hrefs.add(match.group(1));
    }

    out.close();

    for (String next : hrefs) {
        if (next.equals(&quot;../&quot;))
            continue;

        if (next.endsWith(&quot;/&quot;))
            copyURLToDirectory(src + next, dst + next);
        else
            FileUtils.copyURLToFile(new URL(src + next), new File(dst + next));
    }
}

huangapple
  • 本文由 发表于 2020年8月25日 03:46:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/63567791.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定