使用多线程进行网页抓取

huangapple go评论77阅读模式
英文:

Web scraping using multithreading

问题

我编写了一段代码,用于在IMDB上查找一些电影名称,但例如我搜索“哈利·波特”时,可能会找到多部电影。我想要使用多线程,但我对这个领域了解不多。

我正在使用策略设计模式来在更多的网站中搜索,例如在其中一个方法中,我有以下代码:

for (Element element : elements) {
    String searchedUrl = element.select("a").attr("href");
    String movieName = element.select("h2").text();
    if (movieName.matches(patternMatcher)) {
        Result result = new Result();
        result.setName(movieName);
        result.setLink(searchedUrl);
        result.setTitleProp(super.imdbConnection(movieName));
        System.out.println(movieName + " " + searchedUrl);
        resultList.add(result);
    }
}

对于每个元素(即电影名称),它会在 super.imdbConnection(movieName) 行创建一个新的连接,以查找评分和其他信息。

问题是,我希望所有的连接都同时进行,因为在找到5-6部电影时,该过程将花费比预期更长的时间。

我不是在寻求代码,我想要一些想法。我考虑过创建一个实现了Runnable接口的内部类,并在其中使用它,但我觉得没有什么意义。

如何重写这个循环以使用多线程?

我正在使用Jsoup进行解析,ElementElements 来自该库。

英文:

I wrote a code to lookup for some movie names on IMDB, but if for instance I am searching for "Harry Potter", I will find more than one movie. I would like to use multithreading, but I don't have much knowledge on this area.

I am using strategy design pattern to search among more websites, and for instance inside one of the methods I have this code

            for (Element element : elements) {
            String searchedUrl = element.select("a").attr("href");
            String movieName = element.select("h2").text();
            if (movieName.matches(patternMatcher)) {
                Result result = new Result();
                result.setName(movieName);
                result.setLink(searchedUrl);
                result.setTitleProp(super.imdbConnection(movieName));
                System.out.println(movieName + " " + searchedUrl);
                resultList.add(result);
            }
        }

which, for each element (which is the movie name), will create a new connection on IMDB to lookup for ratings and other stuff, on the super.imdbConnection(movieName) line.

The problem is, I would like to have all the connections at the same time, because on 5-6 movies found, the process will take much longer than expected.

I am not asking for code, I want some ideeas. I thought about creating an inner class which implements Runnable, and to use it, but I don't find any meaning on that.

How can I rewrite that loop to use multithreading?

I am using Jsoup for parsing, Element and Elements are from that library.

答案1

得分: 2

以下是翻译好的代码部分:

第一部分:

// 最简单的方法是使用 `parallelStream()`
List<Result> resultList = elements.parallelStream()
                                  .map(e -> {
                                      String searchedUrl = element.select("a").attr("href");
                                      String movieName = element.select("h2").text();

                                      if (movieName.matches(patternMatcher)) {
                                          Result result = new Result();
                                          result.setName(movieName);
                                          result.setLink(searchedUrl);
                                          result.setTitleProp(super.imdbConnection(movieName));
                                          
                                          System.out.println(movieName + " " + searchedUrl);

                                          return result;
                                      } else {
                                          return null;
                                      }
                                  }).filter(Objects::nonNull)
                                  .collect(Collectors.toList());

第二部分:

List<Element> elements = new ArrayList<>();

// 创建一个返回 `Callable` 实现的函数
// 输入: Element
// 输出: Callable<Result>
Function<Element, Callable<Result>> scrapFunction = (element) -> new Callable<Result>() {

    @Override
    public Result call() throws Exception {
        String searchedUrl = element.select("a").attr("href");
        String movieName = element.select("h2").text();
        if (movieName.matches(patternMatcher)) {
            Result result = new Result();
            result.setName(movieName);
            result.setLink(searchedUrl);
            result.setTitleProp(super.imdbConnection(movieName));
            
            System.out.println(movieName + " " + searchedUrl);

            return result;
        } else {
            return null;
        }
    }
};

// 创建一个固定大小的线程池
ExecutorService executor = Executors.newFixedThreadPool(elements.size());

// 提交一个 Callable<Result> 任务给每个 Element
// 使用 scrapFunction.apply(...)
List<Future<Result>> futures = elements.stream()
                                        .map(e -> executor.submit(scrapFunction.apply(e)))
                                        .collect(Collectors.toList());

// 从 Callable<Result> 中收集所有结果
List<Result> resultList = futures.stream()
                                .map(e -> {
                                    try {
                                        return e.get();
                                    } catch (Exception ignored) {
                                        return null;
                                    }
                                }).filter(Objects::nonNull)
                                .collect(Collectors.toList());

请注意,代码中的HTML选择器和方法调用没有被翻译,保持原文不变。

英文:

The most simple way is parallelStream()

List&lt;Result&gt; resultList = elements.parallelStream()
.map(e -&gt; {
String searchedUrl = element.select(&quot;a&quot;).attr(&quot;href&quot;);
String movieName = element.select(&quot;h2&quot;).text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + &quot; &quot; + searchedUrl);
return result;
}else{
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());

If you don't like parallelStream() and want to use Threads, you can to this:

List&lt;Element&gt; elements = new ArrayList&lt;&gt;();
//create a function which returns an implementation of `Callable`
//input: Element
//output: Callable&lt;Result&gt;
Function&lt;Element, Callable&lt;Result&gt;&gt; scrapFunction = (element) -&gt; new Callable&lt;Result&gt;() {
@Override
public Result call() throws Exception{
String searchedUrl = element.select(&quot;a&quot;).attr(&quot;href&quot;);
String movieName = element.select(&quot;h2&quot;).text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + &quot; &quot; + searchedUrl);
return result;
}else{
return null;
}
}
};
//create a fixed pool of threads
ExecutorService executor = Executors.newFixedThreadPool(elements.size());
//submit a Callable&lt;Result&gt; for every Element
//by using scrapFunction.apply(...)
List&lt;Future&lt;Result&gt;&gt; futures = elements.stream()
.map(e -&gt; executor.submit(scrapFunction.apply(e)))
.collect(Collectors.toList());
//collect all results from Callable&lt;Result&gt;
List&lt;Result&gt; resultList = futures.stream()
.map(e -&gt; {
try{
return e.get();
}catch(Exception ignored){
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());

huangapple
  • 本文由 发表于 2020年7月22日 03:55:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/63022139.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定