英文:
Spring Data repository: list vs stream
问题
什么时候在Spring Data存储库中定义方法list
和stream
时的建议?
示例:
interface UserRepository extends Repository<User, Long> {
List<User> findAllByLastName(String lastName);
Stream<User> streamAllByFirstName(String firstName);
// 其他定义的方法。
}
请注意,我这里不是在问关于Page和Slice的问题 - 它们对我来说是清楚的,并且我在文档中找到了它们的描述。
我的假设(我错了吗?):
-
Stream不会将所有记录加载到Java堆中。相反,它会将
k
条记录加载到堆中,逐个处理它们;然后加载另外的k
条记录,依此类推。 -
List会一次性将所有记录加载到Java堆中。
-
如果我需要一些后台批处理任务(例如计算分析),我可以使用流操作,因为我不会一次性将所有记录加载到堆中。
-
如果我需要返回包含所有记录的REST响应,无论如何都需要将它们加载到RAM中,并将它们序列化为JSON。在这种情况下,一次性加载列表是有意义的。
我看到一些开发人员在返回响应之前将流收集到列表中。
class UserController {
public ResponseEntity<List<User>> getUsers() {
return new ResponseEntity(
repository.streamByFirstName()
// OK,对于映射器来说,这是一种不错的语法糖。
// 让我们暂时假设现在没有映射...
// .map(someMapper)
.collect(Collectors.toList()),
HttpStatus.OK);
}
}
对于这种情况,我没有看到使用Stream的任何优势,使用list
将产生相同的最终结果。
那么,有没有使用list
的合理示例?
英文:
What are recommendations when to define method list
and stream
in Spring Data repository?
https://docs.spring.io/spring-data/jpa/docs/current/reference/html/#repositories.query-streaming
Example:
interface UserRepository extends Repository<User, Long> {
List<User> findAllByLastName(String lastName);
Stream<User> streamAllByFirstName(String firstName);
// Other methods defined.
}
Please, note, here I am not asking about Page, Slice - they are clear to me, and I found their description in the documentation.
My assumption (am I wrong?):
-
Stream does not load all the records into Java Heap. Instead it loads
k
records into the heap and processes them one by one; then it loads anotherk
records and so on. -
List does load all the records into Java Heap at once.
-
If I need some background batch job (for example calculate analytics), I could use stream operation because I will not load all the records into the heap at once.
-
If I need to return a REST response with all the records, I will need to load them into RAM anyway and serialize them into JSON. In this case, it makes sense to load a list at once.
I saw that some developers collect the stream into a list before returning a response.
class UserController {
public ResponseEntity<List<User>> getUsers() {
return new ResponseEntity(
repository.streamByFirstName()
// OK, for mapper - it is nice syntactic sugar.
// Let's imagine there is not map for now...
// .map(someMapper)
.collect(Collectors.toList()),
HttpStatus.OK);
}
}
For this case, I do not see any advantage of Stream, using list
will make the same end result.
Are then any examples when using list
is justified?
答案1
得分: 21
tl;dr
Collection
和 Stream
的主要区别在以下两个方面:
- 首个结果的时间 - 客户端代码何时看到第一个元素?
- 处理过程中资源的状态 - 在处理流时底层基础设施资源处于什么状态?
使用集合(Working with collections)
让我们通过一个示例来讨论这个问题。假设我们需要从存储库中读取 100,000 个 Customer
实例。你处理结果的方式会暗示上述两个方面。
List<Customer> result = repository.findAllBy();
客户端代码将在所有元素完全从底层数据存储中读取出来后才会接收到该列表,而不会在此之前的任何时刻接收到。但是同时,底层数据库连接可能已经关闭。例如,在 Spring Data JPA 应用程序中,你会注意到底层的 EntityManager
被关闭并且实体被分离,除非你在更广泛的范围内主动保持它,例如通过在周围的方法上加上 @Transactional
注解或使用 OpenEntityManagerInViewFilter
。此外,你不需要主动关闭这些资源。
使用流(Working with streams)
处理流需要这样做:
@Transactional
void someMethod() {
try (Stream result = repository.streamAllBy()) {
// … 处理过程在此
}
}
使用 Stream
,处理可以在第一个元素(例如数据库中的行)到达并进行映射时立即开始。也就是说,您可以在结果集的其他部分仍在处理时开始消耗元素。这也意味着底层资源需要保持活动状态,通常它们与存储库方法调用绑定在一起。注意,Stream
也需要被主动关闭(使用 try-with-resources),因为它绑定了底层资源,我们必须在某种程度上向其发出关闭信号。
在 JPA 中,如果没有使用 @Transactional
,Stream
将无法正常处理,因为底层的 EntityManager
在方法返回时被关闭。你会看到一些元素被处理,但在处理过程中会出现异常。
下游使用(Downstream usage)
因此,尽管您理论上可以使用 Stream
来高效地构建 JSON 数组,但它会显著复杂化情况,因为您需要保持核心资源的打开状态,直到您写入所有元素为止。这通常意味着编写将对象映射到 JSON 并手动将它们写入响应的代码(例如使用 Jackson 的 ObjectMapper
和 HttpServletResponse
)。
内存占用
虽然内存占用可能会有所改善,但这主要是因为您避免了在映射步骤中创建中间集合和额外的集合(ResultSet
-> Customer
-> CustomerDTO
-> JSON 对象)。已处理的元素不能保证从内存中驱逐出去,因为它们可能因其他原因而被保留。同样,在 JPA 中,您必须保持 EntityManager
处于打开状态,因为它控制资源生命周期,因此所有元素将绑定到该 EntityManager
并且将一直保留,直到所有元素被处理完毕。
英文:
tl;dr
The primary difference in Collection
VS Stream
are the following two aspects:
- Time to first result – when does the client code see the first element?
- The state of resources while processing - in what state are underlying infrastructure resources while the stream is processed?
Working with collections
Let's talk this through with an example. Let's say we need to read 100k Customer
instances from a repository. The way you (have to) handle the result gives a hint at both of the aspects described above.
List<Customer> result = repository.findAllBy();
The client code will receive that list once all elements have been completely read from the underlying data store, not any moment before that. But also, underlying database connections can have been closed. I.e. e.g. in a Spring Data JPA application you will see the underlying EntityManager
be closed and the entity detached unless you actively keep that in a broader scope, e.g. by annotating surrounding methods with @Transactional
or using an OpenEntityManagerInViewFilter
. Also, you don't need to actively close the resources.
Working with streams
A stream will have to be handled like this:
@Transactional
void someMethod() {
try (Stream result = repository.streamAllBy()) {
// … processing goes here
}
}
With a Stream
, the processing can start as soon as the first element (e.g. row in a database) arrives and is mapped. I.e. you will be able to already consume elements while others of the result set are still processed. That also means, that the underlying resources need to actively be kept open and as they're usually bound to the repository method invocation. Note how the Stream
also has to actively be closed (try-with-resources) as it binds underlying resources and we somehow have to signal it to close them.
With JPA, without @Transactional
the Stream
will not be able to be processed properly as the underlying EntityManager
is closed on method return. You'd see a few elements processed but an exception in the middle of the processing.
Downstream usage
So while you theoretically can use a Stream
to e.g. build up JSON arrays efficiently, it significantly complicates the picture as you need to keep the core resources open until you've written all elements. That usually means writing the code to map objects to JSON and writing them to the response manually (using e.g. Jackson's ObjectMapper
and the HttpServletResponse
.
Memory footprint
While the memory footprint will likely improve, this mostly stems from the fact that you're like avoiding the intermediate creation of collections and additional collections in mapping steps (ResultSet
-> Customer
-> CustomerDTO
-> JSON Object). Elements already processed are not guaranteed to be evicted from memory as they might be held onto for other reasons. Again, e.g. in JPA you'd have to keep the EntityManager
open as it controls the resource lifecycle and thus all elements will stay bound to that EntityManager
and will be kept around until all elements are processed.
答案2
得分: 2
Stream
和 Collection
都有对象的集合,但是 Collection
及其实现的问题在于,Collection
的实现在内存中包含了所有的元素,实际上 Stream
是在 Java8 中引入的,以解决这个问题(以及其他一些问题)。想象一下,如果你有一个包含无限数量元素的 Collection
,你能拥有吗?当然不能,因为无论你的内存有多大,最终都会在某个点上出现内存溢出异常。但是 Stream
没有这个问题,你可以在 Stream
中拥有无限数量的元素,因为它们不会存储在内存中,它们会按需生成。
回到你的问题,想象一下如果在你的第一个查询 findAllByLastName
中有许多记录具有 lastname
,当然,你将会得到 OutOfMemoryError
异常,但是 Stream
解决了这个问题,无论有多少符合条件的记录,你都不会得到 OutOfMemoryError
异常。Stream
不会在内存中加载对象,它会按需加载对象,因此在处理大量结果的查询时性能更好。
所以对于你的问题的回答是:
-
是的,它会按需将元素加载到内存中,从而降低了内存消耗和对数据库的查询调用次数。
-
是的,当你调用那个方法时,列表会加载满足条件的所有记录。
-
是的,如果你想迭代通过某些条件满足的记录并进行一些处理工作,你应该使用
Stream
。 -
这是一个棘手的问题,从某种程度上来说,不是,当你使用
WebFlux
和其他类似的响应式编程方法时,我认为最好选择Stream
。
重要说明:在某些情况下,当你说有些开发者将流收集到列表中然后返回响应时,他们可以通过使用 WebFlux 来提升性能,并返回 Stream
本身。这是一个更好的方法。
英文:
Stream
and Collection
both have Collection of Objects, but the problem with Collection and its implementations is that Collection
implementation have all the elements in memory, actually Stream
is introduced in Java8 to tackle this problem(and some other problems). Imagine what happened if you have Collection
with infinite number of elements, Could you have Collection
with infinite number of elements? sure you can't because no matter how large your memory is, you will get out off memory exception at some point. but Stream does not have this problem you can have infinite number of elements with Stream
because they are not stored in memory, they will be generated on demand.
Back to your question imagine what happen if you have many many records that have the lastname
in your first query findAllByLastName
? Sure you will get OutOfMemoryError
exception but Stream solve this problem no matter how many record meet your criteria you won't get OutOfMemoryError
exception.
Stream
does not load object in memory it loads objects on demand, So it performs better on large result queries.
So the answer to your questions:
-
Yes, it loads elements to memory on demand so it reduces the amount of memory consumption and query calls to a database.
-
Yes, List Loads all record that meet criteria when you call that method.
-
Yes, if you want to iterate through your record that meet some criteria and do some processing job you should use Stream one.
-
This is the tricky one, somehow No, When you are using
WebFlux
and other similar approaches for reactive programming, I think it is better to go for theStream
one.
Important note: In case when you said some developers collect the stream into a list before returning a response, They can boost their performance using WebFlux and return Stream
itself. it is much better approach.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论