Dubbo + Nacos这么玩就失去高可用的能力了( 三 )


既然有了这个猜想 , 那就赶紧去证实:
继续翻看nacos源码 , 发现nacos提供了集群节点之间数据一致性保障 , 使用的是Raft协议(一致性的选主协议 , 最后在简单介绍) , 源代码如下:

Dubbo + Nacos这么玩就失去高可用的能力了

文章插图
既然有选主协议 , 那就看看为什么通信还是失败了呢?继续从nacos-server的异常信息入手 , 在nacos-server-1停机时 , 看到nacos-server的logs下多种异常信息:
在naming-raft.log里 , 如下异常信息:
java.lang.NullPointerException: nullat com.alibaba.nacos.naming.consistency.persistent.raft.RaftCore.signalDelete(RaftCore.java:275)at com.alibaba.nacos.naming.consistency.persistent.raft.RaftConsistencyServiceImpl.remove(RaftConsistencyServiceImpl.java:72)at com.alibaba.nacos.naming.consistency.DelegateConsistencyServiceImpl.remove(DelegateConsistencyServiceImpl.java:53)at com.alibaba.nacos.naming.core.ServiceManager.easyRemoveService(ServiceManager.java:434)at com.alibaba.nacos.naming.core.ServiceManager$EmptyServiceAutoClean.lambda$null$1(ServiceManager.java:902)at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)at com.alibaba.nacos.naming.core.ServiceManager$EmptyServiceAutoClean.lambda$null$2(ServiceManager.java:891)at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)at java.util.concurrent.ConcurrentHashMap$EntrySpliterator.forEachRemaining(ConcurrentHashMap.java:3606)at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1870)at java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:2467)at java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:324)at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)at com.alibaba.nacos.naming.core.ServiceManager$EmptyServiceAutoClean.lambda$run$3(ServiceManager.java:891)at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)at com.alibaba.nacos.naming.core.ServiceManager$EmptyServiceAutoClean.run(ServiceManager.java:881)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)2023-09-07 08:19:25,262 ERROR Raft remove failed.naming-push.log里 , 如下异常信息:
java.lang.IllegalStateException: unable to find ackEntry for key: 10.21.140.23,43998,31247629183519634, ack json: {"type": "push-ack", "lastRefTime":"31247629183519634", "data":""}at com.alibaba.nacos.naming.push.PushService$Receiver.run(PushService.java:677)at java.lang.Thread.run(Thread.java:748)2023-09-07 08:17:38,533 ERROR [NACOS-PUSH] error while receiving ack datanaming-distro.log里 , 如下异常信息:
2023-09-07 08:19:39,904 ERROR receive responsible key timestamp of com.alibaba.nacos.naming.iplist.ephemeral.dev-jzj##DEFAULT_GROUP@@providers:com.bm001.league.ordercenter.api.AdCluePoolApi:1.0: from 10.20.1.13:8848将这些异常信息结合起来可以推断出 , 在nacos-server-1停机时 , nacos-server集群只剩余2台机器 , 它们在利用Raft协议进行选主时 , 出现了异常 。导致consumer没有找到主节点 , 无法建立正确的通信 , 所以consumer获取不到provider的元数据 。
继续证实这个推断吧!
此时同时把nacos-server-1和nacos-server-2同时停机 , 只保留1台nacos-server时 , 微服务之间调用就正常了 。因为单个节点时 , 选主正常 , consumer很快与nacos-server建立了通信 。此时再把3台全部启动后 , 也是一切正常 。至此可以证实2台nacos-server确实存在选主问题 。


推荐阅读