K8s集群CoreDNS监控告警最佳实践

2024/1/24 14:02:46

编程Tag： 域名解析 k8s 华为云新鲜技术分享华为云开发者联盟

本文主要是介绍K8s集群CoreDNS监控告警最佳实践，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

本文分享自华为云社区《K8s集群CoreDNS监控告警最佳实践》，作者：可以交个朋友。

一背景

coreDNS作为K8s集群中的关键组成部分。主要负责k8s集群中的服务发现，域名解析等功能。如果在使用过程中出现域名解析失败，域名解析超时等情况，需要引起注意。

二方案简介

可以通过CCE集群插件kube-prometheus-stack进行coreDNS服务的指标监控，并提供开箱即用的仪表盘视图。时刻观察coreDNS的各项运行指标是否处于健康状态。

【加一下怎么到这个图的，选监控-仪表盘】

CCE普罗监控数据统一吐到华为云AOM2.0服务，可以在AOM2.0服务中根据展示各种普罗指标数据，并根据业务实际诉求，实现基于指标的的告警通知。

【CCE普罗对接哪个AOM实例】

【加一个AOM2.0图，可以看到AOM实例指标数据】

三 coreDNS关键指标

确保Prometheus已经成功抓取coreDNS相关指标

coreDNS请求速率: sum(rate(coredns_dns_requests_total{}[5m])) by (proto,instance)
coreDNS请求速率(记录类型分组): sum(rate(coredns_dns_requests_total{}[5m])) by (type,instance)
coreDNS请求速率(DO标志位): sum(rate(coredns_dns_do_requests_total{}[5m])) by (instance)
coreDNS UDP请求数据包大小:
P99: histogram_quantile(0.99,sum(rate(coredns_dns_request_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
P90:
histogram_quantile(0.90,sum(rate(coredns_dns_request_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
P50:
histogram_quantile(0.50,sum(rate(coredns_dns_request_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
coreDNS TCP请求数据包大小:
P99: histogram_quantile(0.99,sum(rate(coredns_dns_request_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
P90:
histogram_quantile(0.90,sum(rate(coredns_dns_request_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
P50:
histogram_quantile(0.50,sum(rate(coredns_dns_request_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
coreDNS响应速率(根据响应状态码分组): sum(rate(coredns_dns_responses_total{}[5m])) by(rcode,instance)
coreDNS响应时延:
P99: histogram_quantile(0.99,sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by(le,job,instance))
P90:
histogram_quantile(0.90,sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by(le,job,instance))
P50:
histogram_quantile(0.50,sum(rate(coredns_dns_request_duration_seconds_bucket{}[5m])) by(le,job,instance))
coreDNS UDP响应数据包大小:
P99: histogram_quantile(0.99,sum(rate(coredns_dns_response_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
P90:
histogram_quantile(0.90,sum(rate(coredns_dns_response_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
P50:
histogram_quantile(0.50,sum(rate(coredns_dns_response_size_bytes_bucket{proto="udp"}[5m])) by(le,proto,instance))
coreDNS TCP响应数据包大小
P99: histogram_quantile(0.99,sum(rate(coredns_dns_response_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
P90:
histogram_quantile(0.90,sum(rate(coredns_dns_response_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
P50:
histogram_quantile(0.50,sum(rate(coredns_dns_response_size_bytes_bucket{proto="tcp"}[5m])) by(le,proto,instance))
coreDNS缓存的DNS记录数: sum (coredns_cache_entries{}) by(type,instance)
coreDNS缓存命中率:
sum (rate(coredns_cache_hits_total{}[5m])) by (type,instance)
coreDNS缓存丢失率:
sum (rate(coredns_cache_misses_total{}[5m])) by (type,instance)

其中主要关注：p99coreDNS响应时延、coreDNS请求速率、coreDNS缓存命中率指标，其中p99coreDNS响应时延基于域名解析超时时间一般为2s，可以初步设置高级阈值为1s，后续再根据实际监控数据根据指标进一步设置一个更加精细阈值。

四：如何根据coreDNS指标进行告警

前往AOM告警管理tab页

【怎么导入这个图】

配置告警规则

选择指标告警规则，配置方式可使用PromQL语句

配置告警通知规则

触发指标告警规则，邮箱收到告警

点击关注，第一时间了解华为云新鲜技术~

这篇关于K8s集群CoreDNS监控告警最佳实践的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

相关编程文章

更多>

2024-12-23云原生周刊：利用 eBPF 增强 K8s
2024-12-20/kubernetes 1.32版本更新解读：新特性和变化一目了然
2024-12-19拒绝 Helm？如何在 K8s 上部署 KRaft 模式 Kafka 集群？
2024-12-16云原生周刊：Kubernetes v1.32 正式发布
2024-12-13Kubernetes上运行Minecraft：打造开发者平台的例子
2024-12-12深入 Kubernetes 的健康奥秘：探针（Probe）究竟有多强？
2024-12-10运维实战：K8s 上的 Doris 高可用集群最佳实践
2024-12-022024年最好用的十大Kubernetes工具
2024-12-02OPA守门人：Kubernetes集群策略编写指南
2024-11-26云原生周刊：K8s 严重漏洞