2020年7月27日 03:32:02go评论101阅读模式

英文:

CompletableFuture to collect results to hashmap

问题

以下是翻译好的内容：

我正在从数据库中读取一系列的PDF文件，然后解析它们并对它们执行一些任务。
当我正在读取这些PDF文件的列表时，我发现从PDF中提取图像需要更多的时间，而且我不想阻塞主线程来读取图像。因此，我希望在单独的线程中执行提取图像的操作。
我想要一个接一个地从每个PDF中读取图像，而不是一次性将所有的PDF加载到内存中（由于内存限制）。因此，我只想要两个线程：一个主线程（从PDF中读取一些文本并执行其他操作），另一个线程用于提取图像并返回一组图像对象。

这里的一个注意点是，来自PDF的图像在内容上可能是相同的，因此我希要在汇总结果之前使用校验和或其他方式删除重复的图像。
我不想在所有任务完成之前将图像保存在内存中，我想要在获得一个PDF的结果时就删除重复项。

所以实际的问题是，我需要将多个任务提交给一个大小为1的线程池，并且需要在获得结果时删除重复项，以便我不需要长时间地将图像保存在内存中。

以下是我尝试过的想法。
我从代码中删除了不必要的部分，如图像及其内容，并将代码转换为基于字符串的问题。

public static void main(String[] args) throws InterruptedException, ExecutionException {
    Map<String, Integer> uniqueImages = new HashMap<>();
    ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(1);
    List<CompletableFuture<String>> futureList = new ArrayList<>();
    for(int i = 0; i<20000; i++) {
        CompletableFuture<String> obj = CompletableFuture.supplyAsync(()->{
            // 假设有很多重复项
            return UUID.randomUUID().toString();
        }, newFixedThreadPool).thenApply((x)->{
            if(uniqueImages.containsKey(x)) {
                int val = uniqueImages.get(x);
                uniqueImages.put(x, val+1);
            }
            else {
                uniqueImages.put(x, 1);
            }
            return x;
        });
        futureList.add(obj);
    }
    
    for(CompletableFuture<String> future: futureList) {
        future.get();
    }
    System.out.println(uniqueImages.size());
}

我担心这段代码是否真的有效，或者会抛出"ConcurrentModification Exception"异常。

uniqueImages地图是否真的包含唯一的图像及其计数？
是否存在任何隐藏问题？
是否有更好的方法来解决我的用例？

英文:

I am reading a list of pdf files from DB and parsing them and performing some tasks with them.
When I am reading this list of pdf's , I have seen extracting images from PDF is taking more time and I don't to block my main thread for reading images. So I want to execute extracting images in a separate thread.
I want to read images from one pdf after another, without loading all the pdfs into memory at once (due to memory concern). So I just want 2 threads; one should be main thread (which reads some text from pdf and does some other stuff) and other should be a thread which extracts images and return set of image objects.

One caveat here is that the images from pdfs can be same in their content , so I want to remove duplicate images using checksum or some other means before collecting their result.
I don't want to hold images in memory until all tasks are getting completed, I want to remove duplicates as on when I get the result of one pdf

So the real question is , I need to submit multiple tasks to a thread pool of size 1 and need to remove duplicates as on when I get the result so that I don't need to hold the image in memory for longer time.

Below is the idea of what have I tried.
I have removed unnecessary things from the code like Images and its content and converted code to String based problem.

public static void main(String[] args) throws InterruptedException, ExecutionException {
		Map&lt;String, Integer&gt; uniqueImages = new HashMap&lt;&gt;();
		 ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(1);
    	List&lt;CompletableFuture&lt;String&gt;&gt; futureList = new ArrayList&lt;&gt;();
    	for(int i = 0; i&lt;20000; i++) {
	    	CompletableFuture&lt;String&gt; obj = CompletableFuture.supplyAsync(()-&gt;{
	    		//Assume lot of duplicates
	    		return UUID.randomUUID().toString();
			}, newFixedThreadPool).thenApply((x)-&gt;{
				if(uniqueImages.containsKey(x)) {
					int val = uniqueImages.get(x);
					uniqueImages.put(x, val+1);
				}
				else {
					uniqueImages.put(x, 1);
				}
				return x;
			});
	    	futureList.add(obj);
    	}
    	
    	for(CompletableFuture&lt;String&gt; future: futureList) {
    		future.get();
    	}
    	System.out.println(uniqueImages.size());
	}

I'm worried if this code really works or throws ConcurrentModification Exception.

Does uniqueImages map really contains unique images and their counts?
Are there any hidden issues?
Is there better way to solve my use case?

答案1

得分: 1

只要您仅使用大小为1的线程池，就没有问题。

然而，如果有多个线程，那么代码确实不是线程安全的，容易出现数据竞争。
我们可以通过以下场景加以说明：

考虑以下代码：

if(uniqueImages.containsKey(x)) {
    int val = uniqueImages.get(x);
    uniqueImages.put(x, val+1);
}
else {
    uniqueImages.put(x, 1);
}

想象一下，线程1和线程2都返回相同的字符串，并同时到达if(uniqueImages.containsKey(x))这一行。
if语句在两个线程中都将返回false，然后uniqueImages.put(x, 1);会在两个线程中被调用。
要么会抛出ConcurrentModification异常，要么会得到错误的计数（变成1而不是2）。

如果您计划使用多个线程，那么必须使用ConcurrentHashMap。

英文:

As long as you are only using thread pool of size 1, there is no problem.

However, if there are multiple threads, then the code is indeed not thread-safe and is prone to data race.
We can demonstrate using the following scenario:

Consider this code:

if(uniqueImages.containsKey(x)) {
    int val = uniqueImages.get(x);
    uniqueImages.put(x, val+1);
}
else {
    uniqueImages.put(x, 1);
}

Imagine thread 1 and thread 2 both return the same string and reach at line if(uniqueImages.containsKey(x)) together.
if will return false in both the threads, and uniqueImages.put(x, 1); will be called in both the threads.
Either a ConcurrentModification exception will be thrown or you will get a wrong count (1 instead of 2).

If you plan to use multiple threads, then you must use a ConcurrentHashMap.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

CompletableFuture用于将结果收集到哈希映射中。

问题

答案1

从复制构造函数中打印用户输入 (Java)

如何故意从服务器获得一个409 HTTP冲突响应？

JRuby线程安全和数组

寻找数组中最小的 K 个元素

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。