akka cluster相关问题怎么解决

魁首哥

作者

akka cluster相关问题怎么解决

这篇文章主要讲解了“akka cluster相关问题怎么解决”，文中的讲解内容简单清晰，易于学习与理解，下面请大家跟着小编的思路慢慢深入，一起来研究和学习“akka cluster相关问题怎么解决”吧！

背景

最近项目中，用akka(2.6.8) cluster在k8s做分布式的部署，，其中遇到unreachable node 如果一直未手动重启，则会导致其他的node加入不到cluster中来，
具体的操作为其中的一个非seed node节点由于pod 重启导致，部署到了其他的节点上，而之前的node(ip)，cluster则会一直去连接该node(ip)，从而导致异常

具体原因分析

首先我们先看一下概念Gossip Convergence,如下：

Gossipconvergencecannotoccurwhileanynodesareunreachable.Thenodesneedtobecomereachableagain,ormovedtothedownandremovedstates(seetheClusterMembershipLifecyclesection).Thisonlyblockstheleaderfromperformingitsclustermembershipmanagementanddoesnotinfluencetheapplicationrunningontopofthecluster.Forexamplethismeansthatduringanetworkpartitionitisnotpossibletoaddmorenodestothecluster.Thenodescanjoin,buttheywillnotbemovedtotheupstateuntilthepartitionhashealedortheunreachablenodeshavebeendowned.

翻译过来就是: 当任何节点都不可达时，Gossip convergence就不达成一致。节点需要再次变得reachable，或转移到down和removed状态。这仅阻止领导者执行其集群成员资格管理，并且不会影响在集群顶部运行的应用程序。例如，这意味着在网络分
区期间不可能将更多节点添加到群集。节点可以加入，但在分区修复或无法访问的节点已关闭之前，它们将不会移入up状态。
很明显，akka就是要保证每个节点是reachable或者down，这样才能进行一致性协商

membership-lifecycle也有提到:

Ifanodeisunreachablethengossipconvergenceisnotpossibleandthereforemostleaderactionsareimpossible(forinstance,allowinganodetobecomeapartofthecluster).Tobeabletomoveforward,thenodemustbecomereachableagainorthenodemustbeexplicitly“downed”.ThisisrequiredbecausethestateofanunreachablenodeisunknownandtheclustercannotknowifthenodehascrashedorisonlytemporarilyunreachablebecauseofnetworkissuesorGCpauses.SeethesectionaboutUserActionsbelowforwaysanodecanbedowned.

也就是说，如果一个节点是unreachable的，必须保证节点是reachable或者downed状态，因为unreachable状态也有可能是网络抖动，或者GC导致服务器负载过高引起的，这些状态akka无法分辨，只能无限的进行重连

解决方法

既然有了问题，问题咱们就得解决，解决方法自然就可以去官网解决，通过把unreachable节点自动的转化为down状态

以http请求的形式，主动的进行状态转化
引入split-brain-resolver(SBR)

第一种方式自行研究，我们采用第二种方式：其中SBR分tatic-quorum, keep-majority, keep-oldest, down-all, lease-majority 五种strategies
我们采用keep-majority策略，其中五种策略的优缺点以及使用场景自行通过官网strategies进行分析
我们看一下keep-majority策略下的akka配置

akka.coordinated-shutdown.exit-jvm=onakka.coordinated-shutdown.exit-code=0akka.cluster.downing-provider-class="akka.cluster.sbr.SplitBrainResolverProvider"akka.cluster.split-brain-resolver.down-all-when-unstable=offakka.cluster.split-brain-resolver.stable-after=20sakka.cluster.split-brain-resolver.active-strategy=keep-majorityakka.cluster.split-brain-resolver.keep-majority.role="admin"

名词	说明
akka.coordinated-shutdown.exit-jvm	当节点从cluster中移除时，是否退出jvm，可选为on off
akka.coordinated-shutdown.exit-code	退出时的状态码
akka.cluster.downing-provider-class	配置为akka.cluster.sbr.SplitBrainResolverProvider，表示启动SBR
akka.cluster.split-brain-resolver.down-all-when-unstable	当cluster处于不稳定状态多久，会关闭所有节点，可选on off或者持续时间，如15s
akka.cluster.split-brain-resolver.stable-after	节点处于unreachable多久，SBR开始进行节点down操作
akka.cluster.split-brain-resolver.active-strategy	keep-majority，启动的策略
akka.cluster.split-brain-resolver.keep-majority.role	设置只有该role才能进行做SBR决定

注意：对于akka.cluster.split-brain-resolver.keep-majority.role，如果cluster由于其他原因，导致只存在少数节点（小于集群节点的一半），而该少数节点的role刚好等于该值，则该少数节点不会退出，
如果不配置该项，则少数节点就会全部退出,从而导致整个集群down

感谢各位的阅读，以上就是“akka cluster相关问题怎么解决”的内容了，经过本文的学习后，相信大家对akka cluster相关问题怎么解决这一问题有了更深刻的体会，具体使用情况还需要大家实践验证。这里是亿速云，小编将为大家推送更多相关知识点的文章，欢迎关注！

阅读全文

发布于 2022-01-10 23:39:56

akka cluster

分享空间
分享微博
手机扫一扫

海报

上一篇：git中的origin怎么用下一篇：mlflow的model registry怎么用