2023年6月22日 17:43:47go评论108阅读模式

英文:

Parallel processing instead of Interators.forEachRemaining?

问题

我有这段代码，它从数据库中获取了20万条记录，然后批量处理这些记录。我注意到处理每个批次（每批处理5000条记录）大约需要3分钟（基本上是2小时的处理时间）。
我已经在数据库中为entryId使用了索引以加快查找记录的速度。我想知道的是是否可能让这个批量处理以某种方式并行运行。
如果不清楚的话，我愿意重写问题。

private void processBatch(List&lt;X&gt; list, List&lt;OutX&gt; built) {
    list.stream().map(m -&gt; {
            X xEntry = repository
                    .findLatestUpdateById(m.getEntryId());
            if (xEntry == null) {
                return null;
            }
            XOut out = buildXOut(x);
            return out;
    }).filter(Objects::nonNull).forEach(built::add);
}

更新1：
我尝试了以下方式来更新第一部分，这将处理时间缩短到大约25分钟。
还有其他建议吗？如何使这个过程更快？

List<x> lists = repository
.findBySourceFileCreationDate(getDateFromFilename(sourceFileName));

int batchSize = properties.getBuildOutput().getSize();
List<OutX> built = new ArrayList<>();// 我应该使用线程安全的向量吗？

ForkJoinPool forkJoinPool = new ForkJoinPool();
List<ForkJoinTask<?>> tasks = new ArrayList<>();

for (int i = 0; i < lists.size(); i += batchSize) {
int endIndex = Math.min(i + batchSize, lists.size());
List<X> batch = lists.subList(i, endIndex);
ForkJoinTask<?> task = forkJoinPool.submit(() -> processBatch(batch, built));
tasks.add(task);
}

tasks.forEach(ForkJoinTask::join);
forkJoinPool.shutdown();


<details>
<summary>英文:</summary>
I have this piece of code that is getting 200k entries from the database and afterwards is processing it in batches. What I noticed is that the processBatch (5000 entries per batch) it takes around 3 minutes to process (2 hours processing basically)
I am already using indexes for the database for the entryId to speedup finding entries.  What I am wondering is if is possible to have this batch processing somehow maybe running in parallel.
If is not clear I would gladly refactor the question

     Stream&lt;X&gt; streams= repository
            .streamBySourceFileCreationDate(getDateFromFilename(sourceFileName));
    int batchSize = properties.getBuildOutput().getSize();
    List&lt;OutX&gt; built = new ArrayList&lt;&gt;();
    Iterators.partition(streams.iterator(), batchSize)
            .forEachRemaining(list -&gt; processBatch(list, built));

private void processBatch(List<X> list, List<OutX> built) {
list.stream().map(m -> {
X xEntry = repository
.findLatestUpdateById(m.getEntryId());
if (xEntry == null) {
return null;
}
XOut out = buildXOut(x);
return out;
}).filter(Objects::nonNull).forEach(built::add);
}

UPDATE 1:
I tried like this to update the first part and this reduced the processing to around 25 min. 
Any other suggestion maybe how to make this faster ?

List<x> lists = repository
.findBySourceFileCreationDate(getDateFromFilename(sourceFileName));

int batchSize = properties.getBuildOutput().getSize();
List<OutX> built = new ArrayList<>();// Should I use maybe a vector for thread safe ?

    ForkJoinPool forkJoinPool = new ForkJoinPool();
    List&lt;ForkJoinTask&lt;?&gt;&gt; tasks = new ArrayList&lt;&gt;();
    for (int i = 0; i &lt; lists.size(); i += batchSize) {
        int endIndex = Math.min(i + batchSize, lists.size());
        List&lt;X&gt; batch = lists.subList(i, endIndex);
        ForkJoinTask&lt;?&gt; task = forkJoinPool.submit(() -&gt; processBatch(batch, built));
        tasks.add(task);
    }
    tasks.forEach(ForkJoinTask::join);
    forkJoinPool.shutdown();


</details>
# 答案1
**得分**: 1
以下是已经翻译好的代码部分：
```java
One of many possible options:
    List<X> readIn = repository.findBySourceFileCreationDate(getDateFromFilename(sourceFileName));
    List<OutX> built = readIn.parallelStream()
        .map(m ->
            {
              X xEntry = repository.findLatestUpdateById(m.getEntryId());
              if (xEntry == null) {
                return null;
              }
              return buildXOut(x);
            })
        .filter(Objects::nonNull)
        .collect(Collectors.toList());

英文:

One of many possible options:

List&lt;X&gt; readIn = repository.findBySourceFileCreationDate(getDateFromFilename(sourceFileName));
List&lt;OutX&gt; built = readIn.parallelStream()
    .map(m -&gt; 
        {
          X xEntry = repository.findLatestUpdateById(m.getEntryId());
          if (xEntry == null) {
            return null;
          }
          return buildXOut(x);
        })
    .filter(Objects::nonNull)
    .collect(Collectors.toList());

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

并行处理而不是使用Iterators.forEachRemaining？

问题

org.postgresql.util.PSQLException: 错误: 在或附近出现语法错误 “"-"”

Java，将字符串’yyyy-MM-dd’T’HH:mm:ss’解析为ZonedDateTime，不改变时间

访问 Google 电子表格 API，使用筛选条件获取数据。

In Tomcat vulnerability CVE-2020-13943 Detail, what is concurrent streams for a connection?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。