2020年9月17日 17:25:34go评论73阅读模式

英文:

Cassandra Node refuses to Join Cluster "Compaction Executor" error

问题

我们有一个运行以下版本的3节点Cassandra集群

[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL规范 3.4.4 | Native协议 v4]

Node1今早停止与集群的通信，日志显示如下：

ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - 线程Thread[CompactionExecutor:242,1,main]中的异常
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - 线程Thread[MutationStage-2,5,main]上的未捕获异常
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - 停止gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - 停止本地传输
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - 事务日志[位于/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377的md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log]表明事务未完成，尝试现在中止它
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - 无法中止事务日志[位于/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377的md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - 无法删除/mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log，因为它不存在，请参阅调试日志文件以获取堆栈跟踪

Cassandra在"故障节点"上启动正常，但拒绝重新加入集群。

当我运行nodetool status时，出现以下情况：

错误：该节点尚未具有system_traces，可能仍在引导过程中

Gossip未运行，我已尝试禁用和重新启用，但没有成功。

我还尝试了修复和重建，都没有出现任何错误。

非常感谢您提供任何帮助。

英文:

We have a 3 node Cassandra Cluster running the following version

[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]

Node1 stopped communicating with the rest of the cluster this morning, the logs showed this:

ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - Exception in thread Thread[CompactionExecutor:242,1,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread Thread[MutationStage-2,5,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - Stopping gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - Stopping native transport
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - Transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377] indicates txn was not completed, trying to abort it now
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - Failed to abort transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - Unable to delete /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log as it does not exist, see debug log file for stack trace

Cassandra starts up fine on the "broken node", but refuses to rejoin the cluster.

When I do a nodetool status I get this:

**Error: The node does not have system_traces yet, probably still bootstrapping**

Gossip is not running, i've tried disabling and re-enabling, no joy.

I've also tried both a repair and a rebuild, both came back with no errors at all.

Any and all help would be appreciated.

Thanks.

答案1

得分: 3

你所描述的症状表明节点出现了某种形式的硬件故障，data/磁盘可能无法访问。

在这种情况下，cassandra.yaml中的磁盘故障策略已启动：

disk_failure_policy: stop

这可以解释为什么gossip不可用（在默认端口7000上），并且节点也不会接受任何客户端连接（在默认CQL端口9042上）。

如果有即将发生的硬件故障，很有可能磁盘/卷被挂载为只读。还有可能磁盘已满。查看操作系统日志以获取线索，您可能需要将问题升级给系统管理员团队。祝好运！

英文:

The symptoms you described indicates to me that the node had some form of hardware failure and the data/ disk is possibly inaccessible.

In instances like this, the disk failure policy in cassandra.yaml kicked in:

disk_failure_policy: stop

This would explain why gossip is unavailable (on default port 7000) and the node would not be accepting any client connections either (on default CQL port 9042).

If there is an impending hardware failure, there's a good chance the disk/volume is mounted as read-only. There's also the possibility that the disk is full. Check the operating system logs for clues and you will likely need to escalate the issue to your sysadmin team. Cheers!

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

卡桑德拉节点拒绝加入集群，出现“压缩执行器”错误。

问题

答案1

检查API响应代码，无论API版本如何。

‘is’ 不是 @JsonIgnore 的

Java版本的Python的def函数

Window Task CPU Utilization 和 Oshi CPU 使用率之间的差异。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论