2023年5月17日 22:43:38go评论58阅读模式

英文:

Actor Cluster Sharding Remember entities not properly recreating while Rolling restart of nodes

问题

在一个3节点的Actor集群中，在执行3个节点的滚动重启时，记住的实体无法正确重新创建。分片完全重新平衡，但一些实体未重新创建。

集群配置

akka.cluster.sharding.remember-entities = on
akka.cluster.sharding.remember-entities-store = ddata
akka.cluster.sharding.distributed-data.durable.keys = []

akka.remote.artery{
        enabled = on
        transport = tcp
}

开始时，所有3个节点每个节点都有100个分片，总共有1000个Actor，共有300个分片和3000个Actor。

节点1 -- 100个分片 \ 1000个Actor
节点2 -- 100个分片 \ 1000个Actor
节点3 -- 100个分片 \ 1000个Actor

1.当节点1宕机时，节点1上的分片重新平衡到节点2和节点3，所有记住的实体都在这些节点上重新创建。

节点1 -- 宕机
节点2 -- 150个分片 \ 1500个Actor
节点3 -- 150个分片 \ 1500个Actor

2.当节点1在一段时间后上线时，节点2宕机。节点2上的分片和记住的实体会重新创建到节点1。

节点1 -- 150个分片 \ 1500个Actor
节点2 -- 宕机
节点1 -- 150个分片 \ 1500个Actor

3.当节点2在一段时间后上线时，节点3宕机。节点2上的分片和记住的实体会重新创建到节点2，但一些实体未能从节点3重新创建到节点2。分片总体上仍然重新平衡。

节点1 -- 150个分片 \ 1500个Actor
节点2 -- 150个分片 \ 1423个Actor
节点3 -- 宕机

问题在于

当我们在节点2加入集群后立即重新启动节点3时，记住的实体的重新创建是不一致的。
同时，集群上的Actor会有消息发送。

当节点3在节点2加入后立即重启时，可能的瓶颈是什么？

尝试过的

1.如果我们不重新启动节点3，实体就没有问题。
2.在一段时间后，单独重新启动节点3，没有问题。
3.增加/减少分片数量。
4.将 akka.cluster.distributed-data.majority-min-cap 从默认值5更改为3，问题仍然存在。

是否需要调整任何配置？

我们需要进一步调试哪个部分以找到根本原因？

英文:

Problem

In a 3 node Actor cluster, While doing rolling restart of 3 Nodes Remembered entities are not recreating properly.
The Shards are completely rebalanced but Some of the entities not recreated.

Cluster Configurations

akka.cluster.sharding.remember-entities = on
akka.cluster.sharding.remember-entities-store = ddata
akka.cluster.sharding.distributed-data.durable.keys = []

akka.remote.artery{
        enabled = on
        transport = tcp
}

At start all the 3 nodes will have 100 shards in each node with 1000 Actors totally 300 Shards And 3000 Actors.

```
Node 1 -- 100 Shards \ 1000 Actors
```
```
Node 2 -- 100 Shards \ 1000 Actors
```
```
Node 1 -- 100 Shards \ 1000 Actors
```

1.When Node 1 Down Shards on node 1 rebalanced to Node 2 And node 3 with all the remembered entities recreated on those nodes.

```
Node 1 -- Down
```
```
Node 2 -- 150 Shards \ 1500 Actors
```
```
Node 1 -- 150 Shards \ 1500 Actors
```

2.When Node 1 is Up after few moments Node 2 getting Down .Shards and the Remembered entities on Node 2 is recreated to Node 1.

```
Node 1 -- 150 Shards \ 1500 Actors
```
```
Node 2 -- Down
```
```
Node 1 -- 150 Shards \ 1500 Actors
```

3.When Node 2 is Up after few moments Node 3 getting down.Shards and the Remembered entities on Node 2 is recreated to Node 2 but some of the entities not recreated to Node 2 from Node 3. All the Shards are rebalanced anyway.

```
Node 1 -- 150 Shards \ 1500 Actors
```
```
Node 2 -- 150 Shards \ 1423 Actor
```
```
Node 1 -- Down
```

The issue here is

When we restart the Node 3 after the Node 2 joined the Cluster the recreation of Remembered entities is inconsistent.
In mean time there are messages will be send to the Actors on the Cluster.

What can be the bottleneck here when the Node 3 Restarted right after the Node 2 joins?

Tried

1.If we are not restarting the Node 3 there is no issue with the Entities.
2.If we restart the Node 3 alone in rolling restart after some time there is no problem.
3.Increased\decreased Shard count.
4.Changed akka.cluster.distributed-data.majority-min-cap from default 5 to 3 still issue persists.

Is there any configurations need to be tuned?

In which part we need to debug further to find the root cause?

答案1

得分: 0

以下是您要翻译的内容：

"Answer for the Own Question."
在进一步调试时，我们发现问题与在频繁重启之间复制“Remember”实体有关。

调试

在Actor Cluster中，为每个角色在每个节点上创建了一个Replicator Actor。
使用键值对方法存储“Remember”实体。
在一个Shard内，有五个键，实体使用ORSet数据结构存储。
这些键值对使用Gossip协议在节点之间复制。

要检索上述键值对，我们可以向Replicator Actor发送Get消息。此消息将允许我们获取存储在Actor Cluster中的所需键值。

Key<ORSet<String>> rememberEntitiesKey = ORSetKey.create(orSetKey);
Get<ORSet<String>> getCmd = new Get<ORSet<String>>(rememberEntitiesKey, Replicator.readLocal());

Future<Object> ack = Patterns.ask(replicator, getCmd, timeout1).toCompletableFuture();
Object result = ack.get(5000, TimeUnit.MILLISECONDS);

Replicator.GetSuccess<ORSet<String>> Orset = (GetSuccess<ORSet<String>>) result;
值包含了本地数据中的实体--->Orset.dataValue().getElements();

要查找每个节点上“Remember”实体的当前状态，我们可以检查该节点的本地数据。通过查看本地数据，我们可以看到“Remember”实体当前是如何存储的。

对于问题中提到的情况，本地Replicator“Remember”实体数据如下：

最初，集群中的所有节点都有3000个“Remember”实体。
当Node 1宕机时，Node 2和Node 3仍然在内存中保留了3000个“Remember”实体。
当Node 1重新上线时，Node 2宕机。Node 3保留了所有3000个“Remember”实体，Node 1从Node 3中复制以恢复其数据。
当Node 2重新上线时，Node 3宕机。但是，Node 1之前未完全从Node 3复制，因此它的内存中只有约2900个实体。Node 2从Node 1检索缺失的实体。
当Node 3重新上线时，它从Node 1和Node 2中复制，导致重新创建2900多个实体。
结果，集群最终只有2900多个实体。

问题：在频繁重启之间存在不足的复制

由于频繁重启，Replicator可能无法完全复制数据，导致集群中的数据不完整或不一致。
通过调整以下属性，我们已解决了此问题。

已配置为以下值：

akka.cluster.sharding.distributed-data {
    gossip-interval = 500 ms        // 默认为2秒
    notify-subscribers-interval = 100 ms   // 默认为500毫秒
}

实施这些更改后，即使在频繁的滚动重启期间，已记住的实体数据现在也已完全复制到所有节点，并且所有实体都已成功重新创建。

英文:

Answer for the Own Question.

While Debugging Further We found that the issue is with respect to replicating the Remember entities across the nodes in between the frequent restarts.

Debugging

In the Actor Cluster, a Replicator Actor is created on each Node for every Role.
The Remember entities are stored using a key-value pair approach.
Within a Shard, there are five keys, and the Entities are stored using an ORSet data structure.
These key-value pairs are replicated across the nodes using the Gossip protocol.

To retrieve the above key-value pairs, we can send a Get message to the Replicator Actor. This message will allow us to fetch the desired key values stored in the Actor Cluster.

Key&lt;ORSet&lt;String&gt;&gt; rememberEntitiesKey = ORSetKey.create(orSetKey);
Get&lt;ORSet&lt;String&gt;&gt; getCmd = new Get&lt;ORSet&lt;String&gt;&gt;(rememberEntitiesKey,Replicator.readLocal());

Future&lt;Object&gt; ack = Patterns.ask(replicator, getCmd, timeout1).toCompletableFuture();
Object result =ack.get(5000, TimeUnit.MILLISECONDS);

Replicator.GetSuccess&lt;ORSet&lt;String&gt;&gt; Orset = (GetSuccess&lt;ORSet&lt;String&gt;&gt;) result;
The value Contains entities in the Local ddata ---&gt; Orset.dataValue().getElements();

To find out the current state of Remember entities on each node, we can check the local data of that node. By looking at the local data, we can see how the Remember entities are currently stored

For the case mentioned in the Question the local replicator Rememeber entities data.

Initially, all nodes in the cluster have 3000 Remember entities.
When Node 1 goes down, Node 2 and Node 3 still have the 3000 Remember entities in memory.
When Node 1 comes back up, Node 2 goes down. Node 3 retains all 3000 Remember entities and Node 1 replicates from Node 3 to restore its data.
When Node 2 comes back up, Node 3 goes down. However, Node 1 was not fully replicated from Node 3 before, so it only has around 2900+ entities in memory. Node 2 retrieves the missing entities from Node 1.
When Node 3 comes back up, it replicates from both Node 1 and Node 2, resulting in the recreation of the 2900+ entities.
As a result, the cluster ends up with only the 2900+ entities in its final state.

Issue :: UnderReplication in-between the frequent Restart

Due to frequent restarts, the Replicator may fail to fully replicate the data, resulting in incomplete or inconsistent data in the cluster
By tuning the below properties we have resolved this issue

Configured to this value

akka.cluster.sharding.distributed-data {
           gossip-interval = 500 ms        // default 2 s
            notify-subscribers-interval = 100 ms   // default 500 ms
        }

After implementing these changes, even during frequent rolling restarts, the Remembered Entities data is now fully replicated to all nodes and all the entities are recreated successfully.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Actor Cluster Sharding在节点滚动重启时记得实体未能正确重建。

问题

问题

集群配置

问题在于

当节点3在节点2加入后立即重启时，可能的瓶颈是什么？

尝试过的

是否需要调整任何配置？

我们需要进一步调试哪个部分以找到根本原因？

Problem

Cluster Configurations

The issue here is

What can be the bottleneck here when the Node 3 Restarted right after the Node 2 joins?

Tried

Is there any configurations need to be tuned?

In which part we need to debug further to find the root cause?

答案1

Answer for the Own Question.

Debugging

For the case mentioned in the Question the local replicator Rememeber entities data.

Issue :: UnderReplication in-between the frequent Restart

Swift。如何在不使用Task的情况下在viewDidLoad()中使用并发。

如何使用Akka建模聚合关系

Akka流，条件性divertLeft

在 Akka Streams Sink 之后获取原始元素的引用？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论