卡桑德拉节点拒绝加入集群,出现“压缩执行器”错误。

huangapple go评论73阅读模式
英文:

Cassandra Node refuses to Join Cluster "Compaction Executor" error

问题

我们有一个运行以下版本的3节点Cassandra集群

[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL规范 3.4.4 | Native协议 v4]

Node1今早停止与集群的通信,日志显示如下:

ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - 线程Thread[CompactionExecutor:242,1,main]中的异常
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - 线程Thread[MutationStage-2,5,main]上的未捕获异常
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - 停止gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - 停止本地传输
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - 事务日志[位于/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377的md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log]表明事务未完成,尝试现在中止它
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - 无法中止事务日志[位于/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377的md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - 无法删除/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log,因为它不存在,请参阅调试日志文件以获取堆栈跟踪

Cassandra在"故障节点"上启动正常,但拒绝重新加入集群。

当我运行nodetool status时,出现以下情况:

错误:该节点尚未具有system_traces,可能仍在引导过程中

Gossip未运行,我已尝试禁用和重新启用,但没有成功。

我还尝试了修复和重建,都没有出现任何错误。

非常感谢您提供任何帮助。

英文:

We have a 3 node Cassandra Cluster running the following version

[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]

Node1 stopped communicating with the rest of the cluster this morning, the logs showed this:

ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - Exception in thread Thread[CompactionExecutor:242,1,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread Thread[MutationStage-2,5,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - Stopping gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - Stopping native transport
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - Transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377] indicates txn was not completed, trying to abort it now
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - Failed to abort transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - Unable to delete /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log as it does not exist, see debug log file for stack trace

Cassandra starts up fine on the "broken node", but refuses to rejoin the cluster.

When I do a nodetool status I get this:

**Error: The node does not have system_traces yet, probably still bootstrapping**

Gossip is not running, i've tried disabling and re-enabling, no joy.

I've also tried both a repair and a rebuild, both came back with no errors at all.

Any and all help would be appreciated.

Thanks.

答案1

得分: 3

你所描述的症状表明节点出现了某种形式的硬件故障,data/磁盘可能无法访问。

在这种情况下,cassandra.yaml中的磁盘故障策略已启动:

disk_failure_policy: stop

这可以解释为什么gossip不可用(在默认端口7000上),并且节点也不会接受任何客户端连接(在默认CQL端口9042上)。

如果有即将发生的硬件故障,很有可能磁盘/卷被挂载为只读。还有可能磁盘已满。查看操作系统日志以获取线索,您可能需要将问题升级给系统管理员团队。祝好运!

英文:

The symptoms you described indicates to me that the node had some form of hardware failure and the data/ disk is possibly inaccessible.

In instances like this, the disk failure policy in cassandra.yaml kicked in:

disk_failure_policy: stop

This would explain why gossip is unavailable (on default port 7000) and the node would not be accepting any client connections either (on default CQL port 9042).

If there is an impending hardware failure, there's a good chance the disk/volume is mounted as read-only. There's also the possibility that the disk is full. Check the operating system logs for clues and you will likely need to escalate the issue to your sysadmin team. Cheers!

huangapple
  • 本文由 发表于 2020年9月17日 17:25:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/63935063.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定