英文:
CompletableFuture to collect results to hashmap
问题
以下是翻译好的内容:
我正在从数据库中读取一系列的PDF文件,然后解析它们并对它们执行一些任务。
当我正在读取这些PDF文件的列表时,我发现从PDF中提取图像需要更多的时间,而且我不想阻塞主线程来读取图像。因此,我希望在单独的线程中执行提取图像的操作。
我想要一个接一个地从每个PDF中读取图像,而不是一次性将所有的PDF加载到内存中(由于内存限制)。因此,我只想要两个线程:一个主线程(从PDF中读取一些文本并执行其他操作),另一个线程用于提取图像并返回一组图像对象。
这里的一个注意点是,来自PDF的图像在内容上可能是相同的,因此我希要在汇总结果之前使用校验和或其他方式删除重复的图像。
我不想在所有任务完成之前将图像保存在内存中,我想要在获得一个PDF的结果时就删除重复项。
所以实际的问题是,我需要将多个任务提交给一个大小为1的线程池,并且需要在获得结果时删除重复项,以便我不需要长时间地将图像保存在内存中。
以下是我尝试过的想法。
我从代码中删除了不必要的部分,如图像及其内容,并将代码转换为基于字符串的问题。
public static void main(String[] args) throws InterruptedException, ExecutionException {
Map<String, Integer> uniqueImages = new HashMap<>();
ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(1);
List<CompletableFuture<String>> futureList = new ArrayList<>();
for(int i = 0; i<20000; i++) {
CompletableFuture<String> obj = CompletableFuture.supplyAsync(()->{
// 假设有很多重复项
return UUID.randomUUID().toString();
}, newFixedThreadPool).thenApply((x)->{
if(uniqueImages.containsKey(x)) {
int val = uniqueImages.get(x);
uniqueImages.put(x, val+1);
}
else {
uniqueImages.put(x, 1);
}
return x;
});
futureList.add(obj);
}
for(CompletableFuture<String> future: futureList) {
future.get();
}
System.out.println(uniqueImages.size());
}
我担心这段代码是否真的有效,或者会抛出"ConcurrentModification Exception"异常。
uniqueImages
地图是否真的包含唯一的图像及其计数?- 是否存在任何隐藏问题?
- 是否有更好的方法来解决我的用例?
英文:
I am reading a list of pdf files from DB and parsing them and performing some tasks with them.
When I am reading this list of pdf's , I have seen extracting images from PDF is taking more time and I don't to block my main thread for reading images. So I want to execute extracting images in a separate thread.
I want to read images from one pdf after another, without loading all the pdfs into memory at once (due to memory concern). So I just want 2 threads; one should be main thread (which reads some text from pdf and does some other stuff) and other should be a thread which extracts images and return set of image objects.
One caveat here is that the images from pdfs can be same in their content , so I want to remove duplicate images using checksum or some other means before collecting their result.
I don't want to hold images in memory until all tasks are getting completed, I want to remove duplicates as on when I get the result of one pdf
So the real question is , I need to submit multiple tasks to a thread pool of size 1 and need to remove duplicates as on when I get the result so that I don't need to hold the image in memory for longer time.
Below is the idea of what have I tried.
I have removed unnecessary things from the code like Images and its content and converted code to String based problem.
public static void main(String[] args) throws InterruptedException, ExecutionException {
Map<String, Integer> uniqueImages = new HashMap<>();
ExecutorService newFixedThreadPool = Executors.newFixedThreadPool(1);
List<CompletableFuture<String>> futureList = new ArrayList<>();
for(int i = 0; i<20000; i++) {
CompletableFuture<String> obj = CompletableFuture.supplyAsync(()->{
//Assume lot of duplicates
return UUID.randomUUID().toString();
}, newFixedThreadPool).thenApply((x)->{
if(uniqueImages.containsKey(x)) {
int val = uniqueImages.get(x);
uniqueImages.put(x, val+1);
}
else {
uniqueImages.put(x, 1);
}
return x;
});
futureList.add(obj);
}
for(CompletableFuture<String> future: futureList) {
future.get();
}
System.out.println(uniqueImages.size());
}
I'm worried if this code really works or throws ConcurrentModification Exception
.
- Does
uniqueImages
map really contains unique images and their counts? - Are there any hidden issues?
- Is there better way to solve my use case?
答案1
得分: 1
只要您仅使用大小为1的线程池,就没有问题。
然而,如果有多个线程,那么代码确实不是线程安全的,容易出现数据竞争。
我们可以通过以下场景加以说明:
考虑以下代码:
if(uniqueImages.containsKey(x)) {
int val = uniqueImages.get(x);
uniqueImages.put(x, val+1);
}
else {
uniqueImages.put(x, 1);
}
想象一下,线程1和线程2都返回相同的字符串,并同时到达if(uniqueImages.containsKey(x))
这一行。
if
语句在两个线程中都将返回false,然后uniqueImages.put(x, 1);
会在两个线程中被调用。
要么会抛出ConcurrentModification
异常,要么会得到错误的计数(变成1而不是2)。
如果您计划使用多个线程,那么必须使用ConcurrentHashMap
。
英文:
As long as you are only using thread pool of size 1, there is no problem.
However, if there are multiple threads, then the code is indeed not thread-safe and is prone to data race.
We can demonstrate using the following scenario:
Consider this code:
if(uniqueImages.containsKey(x)) {
int val = uniqueImages.get(x);
uniqueImages.put(x, val+1);
}
else {
uniqueImages.put(x, 1);
}
Imagine thread 1 and thread 2 both return the same string and reach at line if(uniqueImages.containsKey(x))
together.
if
will return false in both the threads, and uniqueImages.put(x, 1);
will be called in both the threads.
Either a ConcurrentModification
exception will be thrown or you will get a wrong count (1 instead of 2).
If you plan to use multiple threads, then you must use a ConcurrentHashMap
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论