英文:
Apache Ignite 2.14: Getting "partition data has been lost" error for ignite-sys-atomic-cache
问题
我有一个运行在Kubernetes上的由3个节点组成的Apache Ignite 2.14集群。我的所有缓存都有一个备份副本。
在几个月前启用了默认数据区域的持久性后,当其中一个或两个节点由于部署或其他原因重新启动时,我开始收到异常CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost)
。
这让我感到担忧,但我学会了通过运行control.sh --cache reset_lost_partitions cacheName
来修复它。
这次,在由于某种瞬态故障而导致两个节点重新启动后,我开始遇到一个错误,无法通过运行上述命令来修复:
Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=ignite-sys-atomic-cache@default-ds-group, partition=985, key=UserKeyCacheObjectImpl [part=985, val=GridCacheInternalKeyImpl [name=alias, grpName=default-ds-group], hasValBytes=true]] at rg.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:214)
看起来这次的问题涉及到系统缓存ignite-sys-atomic-cache@default-ds-group
。我猜它与我在应用程序中用于获取生成的ID的AtomicSequence对象有关。错误恰好发生在我尝试使用AtomicLong时。
问题如下:
- 为什么会发生这种情况?
- 是否可能在不破坏集群并重新加载所有数据的情况下修复它(这需要一两天的时间)?
- 如何防止将来出现类似的问题?
提前感谢您!
附言:在GridGain Portal上报告了以下错误:缓存[default-ds-group]没有分区副本。
英文:
I have an Apache Ignite 2.14 cluster of 3 nodes running on Kubernetes. All my caches have one backup copy.
After enabling persistence on the default data region a couple of months ago, I started getting the exception CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost)
when one or two nodes restarted either as a result of deployment or for some other reason.
It was worrying but I learned to fix it by running control.sh --cache reset_lost_partitions cacheName
.
This time after two nodes restarted due to some transient failure I started getting an error which I couldn't fix by running the mentioned command:
Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=ignite-sys-atomic-cache@default-ds-group, partition=985, key=UserKeyCacheObjectImpl [part=985, val=GridCacheInternalKeyImpl [name=alias, grpName=default-ds-group], hasValBytes=true]] at rg.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:214)
Looks like this time this issue involved a system cache ignite-sys-atomic-cache@default-ds-group
. I guess it is related to the AtomicSequence object that I use in the application to get IDs generated. The error occurs exactly when I'm trying to use AtomicLong.
The question are:
- Why it might happen?
- Is it possible to fix it without destroying the cluster and reloading all the data from scratch (it would take a day or two).
- How to prevent similar issues in the future?
Thank you in advance!
P.S. On GridGain Portal the following error is reported: Cache [default-ds-group] has zero partition copies.
答案1
得分: 1
要修复它,您可以运行:
control.sh --cache reset_lost_partitions default-ds-group,default-volatile-ds-group@volatileDsMemPlc
某些系统缓存被分区,可能会丢失分区,就像普通用户缓存一样。
上面的命令应该对您有所帮助。
作为解决方法,您可以更改备份因子并更改组:
如果数据结构是易失性的,它将具有组名 "default-volatile-ds-group"。否则,如果未提供组名,名称将是 "default-ds-group"。据我所知,它具有一些基于此的缓存创建逻辑。
尝试以下示例来处理您的数据结构:
AtomicConfiguration cfg = new AtomicConfiguration().setGroupName("testgrp");
cfg.setBackups(1);
cfg.setCacheMode(CacheMode.PARTITIONED);
IgniteAtomicReference<String> ref = ignite.atomicReference("ref", cfg, "d", true);
祝好,
Andrei
英文:
To fix it you can run:
control.sh --cache reset_lost_partitions default-ds-group,default-volatile-ds-group@volatileDsMemPlc
Some system caches are partitioned and can loss the partitions as well as normal user caches.
The command above should help in your case.
As the work around you can change the backup factor and change the group:
https://www.gridgain.com/sdk/latest/javadoc/org/apache/ignite/configuration/AtomicConfiguration.html#setBackups-int-
https://www.gridgain.com/sdk/latest/javadoc/org/apache/ignite/configuration/AtomicConfiguration.html#setGroupName-java.lang.String-
If the structure is volatile, it will have the group name "default-volatile-ds-group". Otherwise, if no group name is given, the name will be "default-ds-group". As far as I know it has some cache creation logic based on this.
Try the following example for your data structure:
AtomicConfiguration cfg = new AtomicConfiguration().setGroupName("testgrp");
cfg.setBackups(1);
cfg.setCacheMode(CacheMode.PARTITIONED);
IgniteAtomicReference<String> ref = ignite.atomicReference("ref", cfg, "d", true);
Regards,
Andrei
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论