Spring Data仓库:列表 vs 流

huangapple go评论75阅读模式
英文:

Spring Data repository: list vs stream

问题

什么时候在Spring Data存储库中定义方法liststream时的建议?

示例:

interface UserRepository extends Repository<User, Long> {

  List<User> findAllByLastName(String lastName);

  Stream<User> streamAllByFirstName(String firstName);                    
         
  // 其他定义的方法。
}

请注意,我这里不是在问关于PageSlice的问题 - 它们对我来说是清楚的,并且我在文档中找到了它们的描述。


我的假设(我错了吗?):

  1. Stream不会将所有记录加载到Java堆中。相反,它会将k条记录加载到堆中,逐个处理它们;然后加载另外的k条记录,依此类推。

  2. List会一次性将所有记录加载到Java堆中。

  3. 如果我需要一些后台批处理任务(例如计算分析),我可以使用流操作,因为我不会一次性将所有记录加载到堆中。

  4. 如果我需要返回包含所有记录的REST响应,无论如何都需要将它们加载到RAM中,并将它们序列化为JSON。在这种情况下,一次性加载列表是有意义的。


我看到一些开发人员在返回响应之前将流收集到列表中。

class UserController {

    public ResponseEntity<List<User>> getUsers() {
        return new ResponseEntity(
                repository.streamByFirstName()
                        // OK,对于映射器来说,这是一种不错的语法糖。
                        // 让我们暂时假设现在没有映射...
                        // .map(someMapper)  
                       .collect(Collectors.toList()), 
                HttpStatus.OK);
    }
}

对于这种情况,我没有看到使用Stream的任何优势,使用list将产生相同的最终结果。

那么,有没有使用list的合理示例?

英文:

What are recommendations when to define method list and stream in Spring Data repository?

https://docs.spring.io/spring-data/jpa/docs/current/reference/html/#repositories.query-streaming

Example:

interface UserRepository extends Repository&lt;User, Long&gt; {

  List&lt;User&gt; findAllByLastName(String lastName);

  Stream&lt;User&gt; streamAllByFirstName(String firstName);                    
         
  // Other methods defined.
}

Please, note, here I am not asking about Page, Slice - they are clear to me, and I found their description in the documentation.


My assumption (am I wrong?):

  1. Stream does not load all the records into Java Heap. Instead it loads k records into the heap and processes them one by one; then it loads another k records and so on.

  2. List does load all the records into Java Heap at once.

  3. If I need some background batch job (for example calculate analytics), I could use stream operation because I will not load all the records into the heap at once.

  4. If I need to return a REST response with all the records, I will need to load them into RAM anyway and serialize them into JSON. In this case, it makes sense to load a list at once.


I saw that some developers collect the stream into a list before returning a response.

class UserController {

    public ResponseEntity&lt;List&lt;User&gt;&gt; getUsers() {
        return new ResponseEntity(
                repository.streamByFirstName()
                        // OK, for mapper - it is nice syntactic sugar. 
                        // Let&#39;s imagine there is not map for now...
                        // .map(someMapper)  
                       .collect(Collectors.toList()), 
                HttpStatus.OK);
    }
}

For this case, I do not see any advantage of Stream, using list will make the same end result.

Are then any examples when using list is justified?

答案1

得分: 21

tl;dr

CollectionStream 的主要区别在以下两个方面:

  1. 首个结果的时间 - 客户端代码何时看到第一个元素?
  2. 处理过程中资源的状态 - 在处理流时底层基础设施资源处于什么状态?

使用集合(Working with collections)

让我们通过一个示例来讨论这个问题。假设我们需要从存储库中读取 100,000 个 Customer 实例。你处理结果的方式会暗示上述两个方面。

List<Customer> result = repository.findAllBy();

客户端代码将在所有元素完全从底层数据存储中读取出来后才会接收到该列表,而不会在此之前的任何时刻接收到。但是同时,底层数据库连接可能已经关闭。例如,在 Spring Data JPA 应用程序中,你会注意到底层的 EntityManager 被关闭并且实体被分离,除非你在更广泛的范围内主动保持它,例如通过在周围的方法上加上 @Transactional 注解或使用 OpenEntityManagerInViewFilter。此外,你不需要主动关闭这些资源。

使用流(Working with streams)

处理流需要这样做:

@Transactional
void someMethod() {
  try (Stream result = repository.streamAllBy()) {
    // … 处理过程在此
  }
}

使用 Stream,处理可以在第一个元素(例如数据库中的行)到达并进行映射时立即开始。也就是说,您可以在结果集的其他部分仍在处理时开始消耗元素。这也意味着底层资源需要保持活动状态,通常它们与存储库方法调用绑定在一起。注意,Stream 也需要被主动关闭(使用 try-with-resources),因为它绑定了底层资源,我们必须在某种程度上向其发出关闭信号。

在 JPA 中,如果没有使用 @TransactionalStream 将无法正常处理,因为底层的 EntityManager 在方法返回时被关闭。你会看到一些元素被处理,但在处理过程中会出现异常。

下游使用(Downstream usage)

因此,尽管您理论上可以使用 Stream 来高效地构建 JSON 数组,但它会显著复杂化情况,因为您需要保持核心资源的打开状态,直到您写入所有元素为止。这通常意味着编写将对象映射到 JSON 并手动将它们写入响应的代码(例如使用 Jackson 的 ObjectMapperHttpServletResponse)。

内存占用

虽然内存占用可能会有所改善,但这主要是因为您避免了在映射步骤中创建中间集合和额外的集合(ResultSet -> Customer -> CustomerDTO -> JSON 对象)。已处理的元素不能保证从内存中驱逐出去,因为它们可能因其他原因而被保留。同样,在 JPA 中,您必须保持 EntityManager 处于打开状态,因为它控制资源生命周期,因此所有元素将绑定到该 EntityManager 并且将一直保留,直到所有元素被处理完毕。

英文:

tl;dr

The primary difference in Collection VS Stream are the following two aspects:

  1. Time to first result – when does the client code see the first element?
  2. The state of resources while processing - in what state are underlying infrastructure resources while the stream is processed?

Working with collections

Let's talk this through with an example. Let's say we need to read 100k Customer instances from a repository. The way you (have to) handle the result gives a hint at both of the aspects described above.

List&lt;Customer&gt; result = repository.findAllBy();

The client code will receive that list once all elements have been completely read from the underlying data store, not any moment before that. But also, underlying database connections can have been closed. I.e. e.g. in a Spring Data JPA application you will see the underlying EntityManager be closed and the entity detached unless you actively keep that in a broader scope, e.g. by annotating surrounding methods with @Transactional or using an OpenEntityManagerInViewFilter. Also, you don't need to actively close the resources.

Working with streams

A stream will have to be handled like this:

@Transactional
void someMethod() {

  try (Stream result = repository.streamAllBy()) {
    // … processing goes here
  }
}

With a Stream, the processing can start as soon as the first element (e.g. row in a database) arrives and is mapped. I.e. you will be able to already consume elements while others of the result set are still processed. That also means, that the underlying resources need to actively be kept open and as they're usually bound to the repository method invocation. Note how the Stream also has to actively be closed (try-with-resources) as it binds underlying resources and we somehow have to signal it to close them.

With JPA, without @Transactional the Stream will not be able to be processed properly as the underlying EntityManager is closed on method return. You'd see a few elements processed but an exception in the middle of the processing.

Downstream usage

So while you theoretically can use a Stream to e.g. build up JSON arrays efficiently, it significantly complicates the picture as you need to keep the core resources open until you've written all elements. That usually means writing the code to map objects to JSON and writing them to the response manually (using e.g. Jackson's ObjectMapper and the HttpServletResponse.

Memory footprint

While the memory footprint will likely improve, this mostly stems from the fact that you're like avoiding the intermediate creation of collections and additional collections in mapping steps (ResultSet -> Customer -> CustomerDTO -> JSON Object). Elements already processed are not guaranteed to be evicted from memory as they might be held onto for other reasons. Again, e.g. in JPA you'd have to keep the EntityManager open as it controls the resource lifecycle and thus all elements will stay bound to that EntityManager and will be kept around until all elements are processed.

答案2

得分: 2

StreamCollection 都有对象的集合,但是 Collection 及其实现的问题在于,Collection 的实现在内存中包含了所有的元素,实际上 Stream 是在 Java8 中引入的,以解决这个问题(以及其他一些问题)。想象一下,如果你有一个包含无限数量元素的 Collection,你能拥有吗?当然不能,因为无论你的内存有多大,最终都会在某个点上出现内存溢出异常。但是 Stream 没有这个问题,你可以在 Stream 中拥有无限数量的元素,因为它们不会存储在内存中,它们会按需生成。

回到你的问题,想象一下如果在你的第一个查询 findAllByLastName 中有许多记录具有 lastname,当然,你将会得到 OutOfMemoryError 异常,但是 Stream 解决了这个问题,无论有多少符合条件的记录,你都不会得到 OutOfMemoryError 异常。Stream 不会在内存中加载对象,它会按需加载对象,因此在处理大量结果的查询时性能更好。

所以对于你的问题的回答是:

  1. 是的,它会按需将元素加载到内存中,从而降低了内存消耗和对数据库的查询调用次数。

  2. 是的,当你调用那个方法时,列表会加载满足条件的所有记录。

  3. 是的,如果你想迭代通过某些条件满足的记录并进行一些处理工作,你应该使用 Stream

  4. 这是一个棘手的问题,从某种程度上来说,不是,当你使用 WebFlux 和其他类似的响应式编程方法时,我认为最好选择 Stream

重要说明:在某些情况下,当你说有些开发者将流收集到列表中然后返回响应时,他们可以通过使用 WebFlux 来提升性能,并返回 Stream 本身。这是一个更好的方法。

英文:

Stream and Collection both have Collection of Objects, but the problem with Collection and its implementations is that Collection implementation have all the elements in memory, actually Stream is introduced in Java8 to tackle this problem(and some other problems). Imagine what happened if you have Collection with infinite number of elements, Could you have Collection with infinite number of elements? sure you can't because no matter how large your memory is, you will get out off memory exception at some point. but Stream does not have this problem you can have infinite number of elements with Stream because they are not stored in memory, they will be generated on demand.

Back to your question imagine what happen if you have many many records that have the lastname in your first query findAllByLastName? Sure you will get OutOfMemoryError exception but Stream solve this problem no matter how many record meet your criteria you won't get OutOfMemoryError exception.
Stream does not load object in memory it loads objects on demand, So it performs better on large result queries.

So the answer to your questions:

  1. Yes, it loads elements to memory on demand so it reduces the amount of memory consumption and query calls to a database.

  2. Yes, List Loads all record that meet criteria when you call that method.

  3. Yes, if you want to iterate through your record that meet some criteria and do some processing job you should use Stream one.

  4. This is the tricky one, somehow No, When you are using WebFlux and other similar approaches for reactive programming, I think it is better to go for the Stream one.

Important note: In case when you said some developers collect the stream into a list before returning a response, They can boost their performance using WebFlux and return Stream itself. it is much better approach.

huangapple
  • 本文由 发表于 2020年7月27日 20:57:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/63115831.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定