英文:
Cassandra Node refuses to Join Cluster "Compaction Executor" error
问题
我们有一个运行以下版本的3节点Cassandra集群
[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL规范 3.4.4 | Native协议 v4]
Node1今早停止与集群的通信,日志显示如下:
ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - 线程Thread[CompactionExecutor:242,1,main]中的异常
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - 线程Thread[MutationStage-2,5,main]上的未捕获异常
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - 停止gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - 停止本地传输
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - 事务日志[位于/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377的md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log]表明事务未完成,尝试现在中止它
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - 无法中止事务日志[位于/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377的md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - 无法删除/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log,因为它不存在,请参阅调试日志文件以获取堆栈跟踪
Cassandra在"故障节点"上启动正常,但拒绝重新加入集群。
当我运行nodetool status时,出现以下情况:
错误:该节点尚未具有system_traces,可能仍在引导过程中
Gossip未运行,我已尝试禁用和重新启用,但没有成功。
我还尝试了修复和重建,都没有出现任何错误。
非常感谢您提供任何帮助。
英文:
We have a 3 node Cassandra Cluster running the following version
[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
Node1 stopped communicating with the rest of the cluster this morning, the logs showed this:
ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - Exception in thread Thread[CompactionExecutor:242,1,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread Thread[MutationStage-2,5,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - Stopping gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - Stopping native transport
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - Transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377] indicates txn was not completed, trying to abort it now
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - Failed to abort transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - Unable to delete /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log as it does not exist, see debug log file for stack trace
Cassandra starts up fine on the "broken node", but refuses to rejoin the cluster.
When I do a nodetool status I get this:
**Error: The node does not have system_traces yet, probably still bootstrapping**
Gossip is not running, i've tried disabling and re-enabling, no joy.
I've also tried both a repair and a rebuild, both came back with no errors at all.
Any and all help would be appreciated.
Thanks.
答案1
得分: 3
你所描述的症状表明节点出现了某种形式的硬件故障,data/
磁盘可能无法访问。
在这种情况下,cassandra.yaml
中的磁盘故障策略已启动:
disk_failure_policy: stop
这可以解释为什么gossip不可用(在默认端口7000
上),并且节点也不会接受任何客户端连接(在默认CQL端口9042
上)。
如果有即将发生的硬件故障,很有可能磁盘/卷被挂载为只读。还有可能磁盘已满。查看操作系统日志以获取线索,您可能需要将问题升级给系统管理员团队。祝好运!
英文:
The symptoms you described indicates to me that the node had some form of hardware failure and the data/
disk is possibly inaccessible.
In instances like this, the disk failure policy in cassandra.yaml
kicked in:
disk_failure_policy: stop
This would explain why gossip is unavailable (on default port 7000
) and the node would not be accepting any client connections either (on default CQL port 9042
).
If there is an impending hardware failure, there's a good chance the disk/volume is mounted as read-only. There's also the possibility that the disk is full. Check the operating system logs for clues and you will likely need to escalate the issue to your sysadmin team. Cheers!
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论