英文:
JPA starts to consume more and more memory after each iteration
问题
目前,我正在尝试使用JPA从Web API存储一些新闻。我有3个实体需要存储:Webpage、NewsPost和返回新闻帖子的Query。我分别为这三个实体创建了一个表。我的简化JPA实体如下所示:
@Entity
@Data
@Table(name = "NewsPosts", schema = "data")
@EqualsAndHashCode
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class NewsPost {
// 属性字段
}
@Entity
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
@Table(name = "queries", schema = "data")
@EqualsAndHashCode
public class QueryEntity {
// 属性字段
// ManyToMany关系属性
}
@Entity
@Data
@Table(name = "sites", schema = "data")
@EqualsAndHashCode
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class NewsSite {
// 属性字段
}
目前,我正在执行以下操作:我创建查询并检索查询的结果。然后我开始进行爬取操作:我以分页的方式从Web API获取返回的对象,每页大小为100个newsPosts。我使用对象映射器将JSON响应映射到我的实体类。
之后,我尝试了两种不同的方法:
- 我将查询ID添加为NewsPost的一个集合,并使用
EntityManager
的合并选项将其写回到数据库。这个方法效果相当好,直到我为另一个查询再次获取NewsPost,然后新的查询会覆盖旧的查询。为了解决这个问题,我尝试了第2种方法。 - 我检查NewsPost是否已经存在,如果存在,我检索该帖子,将新的查询添加到现有的查询中,并像之前一样将其合并回数据库。当我这样做时,前几批次的结果都很好,但是突然间在第三批次中,应用程序开始消耗越来越多的内存。我附上了JavaVisualVM的屏幕截图。是否有人知道为什么会出现这种情况?
编辑:
由于在评论中提出了一些问题,我想在这里提供对这些问题的回答。
我认为爬取方面一切都很顺利。Web API的返回结果是JSON格式的。我使用Jackson映射器将其映射为POJO,然后我使用Dozer映射器将POJO转换为实体。关于在EntityManager
中的写操作,我不确定我是否做得正确。
首先,我为了检查帖子是否已经存在创建了一个JPA仓库(为了获取旧的查询ID并避免在查询ID和帖子ID表中出现覆盖问题)。我的JPA仓库如下所示:
@Repository
public interface PostRepo extends JpaRepository<NewsPost, Long> {
NewsPost getById(long id);
}
为了更新帖子,我按照以下方式进行操作:
private void updatePosts(List<NewsPost> posts) {
posts.forEach(post -> {
NewsPost foundPost = postRepo.getById(post.getId());
if (foundPost != null) {
post.getQueries().addAll(foundPost.getQueries());
}
});
}
我的实体写入方式如下:我有一个实体列表,其中还包含更新后的帖子,我在我的类中自动装配了一个EntityManagerFactory
来处理写入。
EntityManager em = entityManagerFactory.createEntityManager();
try {
EntityTransaction transaction = em.getTransaction();
transaction.begin();
entities.forEach(entity -> em.merge(entity));
em.flush();
transaction.commit();
} finally {
em.clear();
em.close();
}
我很确定问题出在写入过程中。如果我保持软件逻辑不变,但只是跳过合并操作,或者只是打印或将实体转储到文件中,一切都正常,速度很快,没有出现错误,因此似乎是合并操作的问题?
关于程序是否因为内存消耗而崩溃,这取决于情况。如果我在我的Mac上运行它,它会消耗多达8GB以上的内存,但是MAC OS会处理这个情况并将内存交换到磁盘上。如果我将其作为Docker容器运行在CentOS上,由于内存不足,进程将被终止。
不知道这是否相关,但是我正在使用OpenJDK 11、Spring Boot 2.2.6和MySQL 8数据库。
我在应用的application.yml
中配置了JPA:
spring:
main:
allow-bean-definition-overriding: true
datasource:
url: "jdbc:mysql://db"
username: user
password: secret
driver-class-name: com.mysql.cj.jdbc.Driver
test-while-idle: true
validation-query: Select 1
jpa:
database-platform: org.hibernate.dialect.MySQL8Dialect
hibernate:
ddl-auto: none
properties:
hibernate:
event:
merge:
entity_copy_observer: allow
(以上内容为翻译后的结果,只包含原文的翻译部分,不包含额外内容)
英文:
Currently I try to store some news from a web api with the help of JPA.
I have 3 entities i need to store: Webpage, NewsPost and the Query that returned the news post. I have one table for each of the three. My simpliefied JPA entities looking like the following ones:
@Entity
@Data
@Table(name = "NewsPosts", schema = "data")
@EqualsAndHashCode
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class NewsPost {
@Id
@Column(name = "id")
private long id;
@Basic
@Column(name = "subject")
private String subject;
@Basic
@Column(name = "post_text")
private String postText;
@ManyToOne(fetch = FetchType.LAZY, cascade = CascadeType.MERGE)
@JoinColumn(name = "newsSite")
private NewsSite site;
@ManyToMany(fetch = FetchType.EAGER, cascade = CascadeType.MERGE)
@JoinTable(name = "query_news_post", joinColumns = @JoinColumn(name = "newsid"), inverseJoinColumns = @JoinColumn(name = "queryid"))
private Set<QueryEntity> queries;
}
@Entity
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
@Table(name = "queries", schema = "data")
@EqualsAndHashCode
public class QueryEntity {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
@Column(name = "id")
private int id;
@EqualsAndHashCode.Exclude
@Basic
@Column(name = "query")
private String query;
// needs to be exclueded otherwise we can create stack overflow, because of circular references...
@EqualsAndHashCode.Exclude
@ToString.Exclude
@ManyToMany(mappedBy = "queries", fetch = FetchType.LAZY, cascade = CascadeType.MERGE)
Set<PostsEntity> posts;
}
@Entity
@Data
@Table(name = "sites", schema = "data")
@EqualsAndHashCode
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class newsSite {
@Id
@Column(name = "SiteId")
private long id;
@Basic
@Column(name = "SiteName")
private String site;
}
Currently I'm doing the following: I create the query and retrieve the of the query. Then i start crawling:
I get the objects from the web api back in paginated fashion with pagesize of 100 newsPosts i use an object mapper to map the json response to my entity classes.
Afterwards i tried two different thins:
- I added the query ID as Set to the NewsPost and wrote it back to DB with the
EntityManager
's merge option. This works quite well until i came to the point, where I got a NewsPost again for another query, then the new query is overwritten by the old one. To solve this it tried 2. - I check if the NewsPost already exists if it does i retrieved the post added the new query to the existing one and merged it back to the database as i did before. When doing this i works quite well and i get the expected result for the first batches but then suddenly the application start to consume more and more memory for the thrid batch. I attachted a screenshot from JavaVisualVM. Has somebody an idea why this happens?
Edit:
As some Questions were raised in the comments i would like to provide the answers to the questions here.
I think with the crawling everything works fine. The return of the Webapi comes as json. I'm using jackson mapper to map this to a POJO and afterwards I'm using the Dozer mapper to convert to POJO to the Entity. (Yes i need the step to POJO first for other purposes in the application this is workin fine).
Regarding the writing with the EntityManager I'm not sure if I'm doing that correctly.
At first i created a JPA repo for checking if the post already exists (to get the old query ids and avoid the issue with the overwriting in the queryid, postid table). My JPA repo looks as follows.
@Repository
public interface PostRepo extends JpaRepository<NewsPost, Long> {
NewsPost getById(long id);
}
To update the posts I'm doing this as follows:
private void updatePosts(List<NewsPost> posts){
posts.forEach(post->{
NewsPost foundPost = postRepo.getById(post.getId());
if(foundPost!=null){
post.getQueries().addAll(foundPost.getQueries());
}});
}
I'm currently writing my entities as follows i have a list of entities the contains also the updated posts and i have an autowired EntityManagerFactory
in my class that handles the writing.
EntityManager em = entityManagerFactory.createEntityManager();
try {
EntityTransaction transaction = em.getTransaction();
transaction.begin();
entities.forEach(entity->em.merge(entity))
em.flush();
transaction.commit();
} finally {
em.clear();
em.close();
}
I'm pretty sure that it is the writing process. If i keep the logic of my software the same but only skip the merge or just printing or dumping the entities to a file everything works and fast and no error appears so it seems to be an issue with the merge comment?
Regarding the question if my program dies because of the memory consumption it depends. If I run it on my mac is consumes up to 8+ gigabytes of ram but MAC OS is handling this and swaps the ram to disk. If I run it as a docker container von CentOS the process is killed due to to less memory.
Don't now if this is relevant, but I'm using OpenJDK 11, Springboot 2.2.6, and a MYSQL 8 Database.
I configured jpa as follows in my application.yml:
spring:
main:
allow-bean-definition-overriding: true
datasource:
url: "jdbc:mysql://db"
username: user
password: secret
driver-class-name: com.mysql.cj.jdbc.Driver
test-while-idle: true
validation-query: Select 1
jpa:
database-platform: org.hibernate.dialect.MySQL8Dialect
hibernate:
ddl-auto: none
properties:
hibernate:
event:
merge:
entity_copy_observer: allow
```
</details>
# 答案1
**得分**: 1
如果合并过程是问题所在,为了在 `entityManager` 中保持内存消耗低的快速修复方法可能是在每次合并后添加 `em.flush();` 和 `em.clear();`:
```java
EntityTransaction transaction = em.getTransaction();
transaction.begin();
entities.forEach(entity -> {
em.merge(entity);
em.flush();
em.clear();
});
transaction.commit();
然而,我认为你应该改变你的模型。为了仅添加新的关联而加载每篇帖子的所有现有查询是非常低效的。你可以将 N-M 关系建模为一个新实体,然后仅持久化新关系。
英文:
If the merge process is the problem, a quick fix to keep memory consumption low in the entityManager
could be add a em.flush();
and em.clear();
after every merge:
EntityTransaction transaction = em.getTransaction();
transaction.begin();
entities.forEach(entity-> {
em.merge(entity);
em.flush();
em.clear();
});
transaction.commit();
However, I think you should change your model. Loading all the existing queries of every post just to add new ones is very inefficient. You could model the N-M relation into a new entity and just persist new relations.
答案2
得分: 1
在尝试了一些方法后,我自己解决了这个问题。我为多对多关系创建了一个实体。然后,我为每个实体创建了CRUD存储库,并使用了CRUD存储库中的saveAll
方法。这在内存方面也运行良好。现在,垃圾收集器在内存可视化中产生了预期的链锯模式。但是,我仍然不明白为什么我之前使用注释中的联接表创建的多对多关系会在内存管理方面引发问题。是否有人可以解释一下为什么这解决了我的问题?是因为多对多关系创建了循环依赖吗?但据我所知,垃圾收集器也能找到循环依赖。
英文:
Solved it on my own with trying around. I created an entity for the many to many relation. Afterwards i created CRUD repositories for each entity and used saveAll
from crud repository. This is working fine also with the memory. The GC now produces the expected chainsaw pattern in the memory visualisation. But I still have no clue why the many to many relation I created before with the join table in the annotation created the issues regarding the memory management. Could somebody explain why this solves my problem is ManyToMany creating circular dependencies? But as far as I know GC also finds circular dependencies.
答案3
得分: 0
一个多对多关系中的 EAGER 关系会获取许多对象。关于 LAZY 关系,请确保获取它们,因为如果不这样做,遍历完整对象以将其转换为 JSON 或 POJO 将为每个未使用 fetch 初始化的对象抛出查询,这可能是一种危险操作。如果不需要全部对象,可以使用 @JsonIgnore 注解。
英文:
An EAGER relation in ManyToMany brings up many objects. Regarding LAZY realtion, make sure to fetch them, because if you don't, going through the complete object to convert it to JSON or POJO will throw a query for each object that has not been initialized with a fetch, something dangerous. If you don't need all of them you can use the @JsonIgnore annotation.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论