通过使用并行流来减少操作时间

huangapple go评论79阅读模式
英文:

Reducing operation time by using parallel stream

问题

在我的Java 8 Spring Boot应用程序中,我有一个包含40000条记录的列表。对于每条记录,我需要调用一个外部API并将结果保存到数据库。如何在短时间内以更高的性能完成这个任务?每个API调用大约需要20秒才能完成。我尝试使用了并行流来缩短时间,但改善并不明显。

if (!mainList.isEmpty()) {
    AtomicInteger counter = new AtomicInteger();
    List<List<PolicyAddressDto>> secondList = 
        new ArrayList<List<PolicyAddressDto>>(
            mainList.stream()
                .collect(Collectors.groupingBy(it -> counter.getAndIncrement() / subArraySize))
                .values());
    for (List<PolicyAddressDto> listOfList : secondList) {
        listOfList.parallelStream()
            .forEach(t -> {
                callAtheniumData(t, listDomain1, listDomain2); // listDomain2和listDomain1在全局范围内声明
            });

        if (!listDomain1.isEmpty()) {
            listDomain1Repository.saveAll(listDomain1);
        }
        if (!listDomain2.isEmpty()) {
            listDomain2Repository.saveAll(listDomain2);
        }
    }
}

如果你有任何其他问题,欢迎随时提问。

英文:

In my java 8 spring boot application, I have a list of 40000 records. For each record, I have to call an external API and save the result to DB. How can I do this with better performance within no time? Each of the API calls will take about 20 secs to complete. I used a parallel stream for reducing the time but there was no considerable change in it.

if (!mainList.isEmpty()) {
	AtomicInteger counter = new AtomicInteger();
	List&lt;List&lt;PolicyAddressDto&gt;&gt; secondList = 
			new ArrayList&lt;List&lt;PolicyAddressDto&gt;&gt;(
					mainList.stream()
					    .collect(Collectors.groupingBy(it -&gt; counter.getAndIncrement() / subArraySize))
					    .values());
	for (List&lt;PolicyAddressDto&gt; listOfList : secondList) {
		listOfList.parallelStream()
				.forEach(t -&gt; {
					callAtheniumData(t, listDomain1, listDomain2); // listDomain2 and listDomain1 declared
																	// globally
				});
		
		if (!listDomain1.isEmpty()) {
			listDomain1Repository.saveAll(listDomain1);
		}
		if (!listDomain2.isEmpty()) {
			listDomain2Repository.saveAll(listDomain2);
		}
	}
}

答案1

得分: 1

解决问题的并行方法通常涉及执行比顺序执行更多的实际工作。在将工作分配给多个线程并在之后合并结果时会产生开销。像是将短字符串转换为小写这样的问题很小,以至于它们有可能会被并行分割的开销所淹没。

据我所见,API 调用响应未被保存。
此外,所有 API 调用在彼此之间是不相干的。

我们是否可以尝试为每个 API 调用创建新线程。

for (List<PolicyAddressDto> listOfList : secondList) {
    listOfList.parallelStream()
            .forEach(t -> {
                new Thread(() -> {callAtheniumData(t, listDomain1, listDomain2)}).start(); 
            });
}
英文:

Solving a problem in parallel always involves performing more actual work than doing it sequentially. Overhead is involved in splitting the work among several threads and joining or merging the results. Problems like converting short strings to lower-case are small enough that they are in danger of being swamped by the parallel splitting overhead.

As I can see the api call response is not being saved.
Also all api calls are disjoint with respect to each other.

Can we try creating new threads for each api call.

for (List&lt;PolicyAddressDto&gt; listOfList : secondList) {
			listOfList.parallelStream()
					.forEach(t -&gt; {
						new Thread(() -&gt;{callAtheniumData(t, listDomain1, listDomain2)}).start(); 
					});
	}

答案2

得分: 0

因为并行流会将任务分割,通常每个核心会创建一个线程-1。如果你对外部API的每次调用都需要20秒,而你有4个核心,这意味着会有3个并发请求等待20秒。

你可以通过这种方式增加调用的并发性:https://stackoverflow.com/a/21172732/574147,但我认为你只是在转移问题。

一个需要20秒的API响应时间真的很慢。如果这是一个非常复杂的计算且受限于CPU,那么在保持相同性能的情况下,该服务如何能够响应10个并发请求呢?很可能做不到。

另一方面,如果计算受限于"IO",且需要20秒,你可能需要一个能够接收(并处理!)元素列表的服务。

英文:

That's because the parallel stream divide the task usually creating one thread per core -1. If every call you do to the external API takes 20 seconds and you have 4 core, this means 3 concurrent requests that wait for 20 seconds.

You can increase the concurrency of your calls in this way https://stackoverflow.com/a/21172732/574147 but I think you're just moving the problems.

An API that takes 20sec it's a really slow "typical" response time. If this is a really complex elaboration and CPU bounded, how can that service be able to respond at 10 concurrent request keeping the same performance? Probably it wouldn't.

Otherwise if the elaboration is "IO bounded" and takes 20 seconds, you probably need a service able to take (and work!) with list of elements

答案3

得分: 0

> 每个 API 调用大约需要 20 秒才能完成。

你的外部 API 是造成瓶颈的地方。在客户端,你的代码实际上无法加速它,除非对进程进行并行处理。你已经这样做了,所以如果外部 API 在你的组织内,你需要研究是否有任何性能改进空间。如果不是,你可以考虑像是通过 Kafka 将处理转移到 Apache NiFi 或 Streamsets,这样你的 Spring Boot API 就不必等待数小时来处理数据。

英文:

> Each of the API calls will take about 20 secs to complete.

Your external API is where you are being bottlenecked. There's really nothing your code can do to speed it up on the client side except to parallelize the process. You've already done that, so if the external API is within your organization, you need to look into any performance improvements there. If not, can do something like offload the processing via Kafka to Apache NiFi or Streamsets so that your Spring Boot API doesn't have to wait for hours to process the data.

huangapple
  • 本文由 发表于 2020年10月20日 18:53:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/64443670.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定