2020年7月22日 03:55:30go评论77阅读模式

英文:

Web scraping using multithreading

问题

我编写了一段代码，用于在IMDB上查找一些电影名称，但例如我搜索“哈利·波特”时，可能会找到多部电影。我想要使用多线程，但我对这个领域了解不多。

我正在使用策略设计模式来在更多的网站中搜索，例如在其中一个方法中，我有以下代码：

for (Element element : elements) {
    String searchedUrl = element.select("a").attr("href");
    String movieName = element.select("h2").text();
    if (movieName.matches(patternMatcher)) {
        Result result = new Result();
        result.setName(movieName);
        result.setLink(searchedUrl);
        result.setTitleProp(super.imdbConnection(movieName));
        System.out.println(movieName + " " + searchedUrl);
        resultList.add(result);
    }
}

对于每个元素（即电影名称），它会在 super.imdbConnection(movieName) 行创建一个新的连接，以查找评分和其他信息。

问题是，我希望所有的连接都同时进行，因为在找到5-6部电影时，该过程将花费比预期更长的时间。

我不是在寻求代码，我想要一些想法。我考虑过创建一个实现了Runnable接口的内部类，并在其中使用它，但我觉得没有什么意义。

如何重写这个循环以使用多线程？

我正在使用Jsoup进行解析，Element 和 Elements 来自该库。

英文:

I wrote a code to lookup for some movie names on IMDB, but if for instance I am searching for "Harry Potter", I will find more than one movie. I would like to use multithreading, but I don't have much knowledge on this area.

I am using strategy design pattern to search among more websites, and for instance inside one of the methods I have this code

            for (Element element : elements) {
            String searchedUrl = element.select(&quot;a&quot;).attr(&quot;href&quot;);
            String movieName = element.select(&quot;h2&quot;).text();
            if (movieName.matches(patternMatcher)) {
                Result result = new Result();
                result.setName(movieName);
                result.setLink(searchedUrl);
                result.setTitleProp(super.imdbConnection(movieName));
                System.out.println(movieName + &quot; &quot; + searchedUrl);
                resultList.add(result);
            }
        }

which, for each element (which is the movie name), will create a new connection on IMDB to lookup for ratings and other stuff, on the super.imdbConnection(movieName) line.

The problem is, I would like to have all the connections at the same time, because on 5-6 movies found, the process will take much longer than expected.

I am not asking for code, I want some ideeas. I thought about creating an inner class which implements Runnable, and to use it, but I don't find any meaning on that.

How can I rewrite that loop to use multithreading?

I am using Jsoup for parsing, Element and Elements are from that library.

答案1

得分: 2

以下是翻译好的代码部分：

第一部分：

// 最简单的方法是使用 `parallelStream()`
List<Result> resultList = elements.parallelStream()
                                  .map(e -> {
                                      String searchedUrl = element.select("a").attr("href");
                                      String movieName = element.select("h2").text();

                                      if (movieName.matches(patternMatcher)) {
                                          Result result = new Result();
                                          result.setName(movieName);
                                          result.setLink(searchedUrl);
                                          result.setTitleProp(super.imdbConnection(movieName));
                                          
                                          System.out.println(movieName + " " + searchedUrl);

                                          return result;
                                      } else {
                                          return null;
                                      }
                                  }).filter(Objects::nonNull)
                                  .collect(Collectors.toList());

第二部分：

List<Element> elements = new ArrayList<>();

// 创建一个返回 `Callable` 实现的函数
// 输入: Element
// 输出: Callable<Result>
Function<Element, Callable<Result>> scrapFunction = (element) -> new Callable<Result>() {

    @Override
    public Result call() throws Exception {
        String searchedUrl = element.select("a").attr("href");
        String movieName = element.select("h2").text();
        if (movieName.matches(patternMatcher)) {
            Result result = new Result();
            result.setName(movieName);
            result.setLink(searchedUrl);
            result.setTitleProp(super.imdbConnection(movieName));
            
            System.out.println(movieName + " " + searchedUrl);

            return result;
        } else {
            return null;
        }
    }
};

// 创建一个固定大小的线程池
ExecutorService executor = Executors.newFixedThreadPool(elements.size());

// 提交一个 Callable<Result> 任务给每个 Element
// 使用 scrapFunction.apply(...)
List<Future<Result>> futures = elements.stream()
                                        .map(e -> executor.submit(scrapFunction.apply(e)))
                                        .collect(Collectors.toList());

// 从 Callable<Result> 中收集所有结果
List<Result> resultList = futures.stream()
                                .map(e -> {
                                    try {
                                        return e.get();
                                    } catch (Exception ignored) {
                                        return null;
                                    }
                                }).filter(Objects::nonNull)
                                .collect(Collectors.toList());

请注意，代码中的HTML选择器和方法调用没有被翻译，保持原文不变。

英文:

The most simple way is parallelStream()

List&lt;Result&gt; resultList = elements.parallelStream()
.map(e -&gt; {
String searchedUrl = element.select(&quot;a&quot;).attr(&quot;href&quot;);
String movieName = element.select(&quot;h2&quot;).text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + &quot; &quot; + searchedUrl);
return result;
}else{
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());

If you don't like parallelStream() and want to use Threads, you can to this:

List&lt;Element&gt; elements = new ArrayList&lt;&gt;();
//create a function which returns an implementation of `Callable`
//input: Element
//output: Callable&lt;Result&gt;
Function&lt;Element, Callable&lt;Result&gt;&gt; scrapFunction = (element) -&gt; new Callable&lt;Result&gt;() {
@Override
public Result call() throws Exception{
String searchedUrl = element.select(&quot;a&quot;).attr(&quot;href&quot;);
String movieName = element.select(&quot;h2&quot;).text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + &quot; &quot; + searchedUrl);
return result;
}else{
return null;
}
}
};
//create a fixed pool of threads
ExecutorService executor = Executors.newFixedThreadPool(elements.size());
//submit a Callable&lt;Result&gt; for every Element
//by using scrapFunction.apply(...)
List&lt;Future&lt;Result&gt;&gt; futures = elements.stream()
.map(e -&gt; executor.submit(scrapFunction.apply(e)))
.collect(Collectors.toList());
//collect all results from Callable&lt;Result&gt;
List&lt;Result&gt; resultList = futures.stream()
.map(e -&gt; {
try{
return e.get();
}catch(Exception ignored){
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用多线程进行网页抓取

问题

答案1

如何覆盖JXBrowser的默认下载文件路径？

实例变量在线程运行（Thread Run()）后为空 – Java中的扑克骰子程序

Hang during queue.join() asynchronously processing a queue

Spring Boot 2.2 多模块项目自动配置 JPA 失败

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论