以下文章来源于回滚吧代码 ,作者回滚君
好记性不如烂笔头,一直想把自己的学到的东西写下来,热切盼望与同行们交流提高。喜欢可以关注我,专注Java源码、架构、算法和面试。
Exporters:监控数据采集器,将数据通过Http的方式暴露给Prometheus Server;
Prometheus Server:负责对监控数据的获取,存储以及查询;获取的监控数据需要是指定的Metrics 格式,这样才能处理监控数据;对于查询Prometheus提供了PromQL方便对数据进行查询汇总,当然Prometheus本身也提供了Web UI;
AlertManager:Prometheus支持通过PromQL来创建告警规则,如果满足规则则创建一条告警,后续的告警流程就交给AlertManager,其提供了多种告警方式包括email,webhook等方式;
PushGateway:正常情况下Prometheus Server能够直接与Exporter进行通信,然后pull数据;当网络需求无法满足时就可以使用PushGateway作为中转站了;
社区提供
范围 | 常用Exporter |
用户自定义
独立运行
集成到应用中
<metric name>{<label name>=<label value>, ...}
metric name:指标的名称,主要反映被监控样本的含义 a-zA-Z_:*
_
label name: 标签 反映了当前样本的特征维度 [a-zA-Z0-9_]*
label value: 各个标签的值,不限制格式
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="nonheap",id="Metaspace",} -1.0
jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="heap",id="PS Eden Space",} 1.033895936E9
jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="nonheap",id="Code Cache",} 2.5165824E8
jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="nonheap",id="Compressed Class Space",} 1.073741824E9
jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="heap",id="PS Survivor Space",} 2621440.0
jvm_memory_max_bytes{application="springboot-actuator-prometheus-test",area="heap",id="PS Old Gen",} 2.09190912E9
Counter
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total{application="springboot-actuator-prometheus-test",} 6.3123664E9
Gauge
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads{application="springboot-actuator-prometheus-test",} 20.0
Histogram和Summary
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 0.008
jvm_gc_pause_seconds_count{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Allocation Failure",} 38.0
jvm_gc_pause_seconds_sum{action="end of minor GC",application="springboot-actuator-prometheus-test",cause="Allocation Failure",} 0.134
jvm_gc_pause_seconds_count{action="end of major GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",application="springboot-actuator-prometheus-test",cause="Metadata GC Threshold",} 0.073
Prometheus UI
Grafana
操作符
rabbitmq_queue_messages>0
聚合函数
sum
(求和), min
(最小值),max
(最大值),avg
(平均值)等等;sum(rabbitmq_queue_messages)>0
- name: queue-messages-warning
rules:
- alert: queue-messages-warning
expr: sum(rabbitmq_queue_messages{job='rabbit-state-metrics'}) > 500
labels:
team: webhook-warning
annotations:
summary: High queue-messages usage detected
threshold: 500
current: '{{ $value }}'
alert:告警规则的名称;
expr:基于PromQL表达式告警触发条件;
labels:自定义标签,通过其关联到具体Alertmanager上;
annotations:用于指定一组附加信息,比如用于描述告警详细信息的文字等;
global:
resolve_timeout: 5m
route:
receiver: webhook
group_wait: 30s
group_interval: 1m
repeat_interval: 5m
group_by:
- alertname
routes:
- receiver: webhook
group_wait: 10s
match:
team: webhook-warning
receivers:
- name: webhook
webhook_configs:
- url: 'http://ip:port/api/v1/monitor/alert-receiver'
send_resolved: true
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '18'
generation: 18
labels:
app: prometheus
name: prometheus
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- image: 'prom/prometheus:latest'
imagePullPolicy: Always
name: prometheus-0
ports:
- containerPort: 9090
name: p-port
protocol: TCP
resources:
requests:
cpu: 250m
memory: 512Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/prometheus
name: config-volume
- image: 'prom/alertmanager:latest'
imagePullPolicy: Always
name: prometheus-1
ports:
- containerPort: 9093
name: a-port
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/alertmanager
name: alertcfg
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: data
persistentVolumeClaim:
claimName: monitoring-nfs-pvc
- configMap:
defaultMode: 420
name: prometheus-config
name: config-volume
- configMap:
defaultMode: 420
name: alert-config
name: alertcfg
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- 'rabbitmq_warn.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['127.0.0.1:9093']
scrape_configs:
- job_name: 'rabbit-state-metrics'
static_configs:
- targets: ['ip:port']
查看Exporter
查看Alerts
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '1'
generation: 1
labels:
app: grafana
name: grafana
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- image: grafana/grafana
imagePullPolicy: Always
name: grafana
ports:
- containerPort: 3000
protocol: TCP
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: '3'
labels:
k8s-app: rabbitmq-exporter
name: rabbitmq-exporter
namespace: monitoring
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
k8s-app: rabbitmq-exporter
template:
metadata:
labels:
k8s-app: rabbitmq-exporter
spec:
containers:
- env:
- name: PUBLISH_PORT
value: '9098'
- name: RABBIT_CAPABILITIES
value: 'bert,no_sort'
- name: RABBIT_USER
value: xxxx
- name: RABBIT_PASSWORD
value: xxxx
- name: RABBIT_URL
value: 'http://ip:15672'
image: kbudde/rabbitmq-exporter
imagePullPolicy: IfNotPresent
name: rabbitmq-exporter
ports:
- containerPort: 9098
protocol: TCP
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
PrometheusMeterRegistry
和 CollectorRegistry
来以Prometheus 可以抓取的格式收集和导出指标数据;/prometheus
端点暴露出来。Prometheus 可以抓取该端点以定期获取度量标准数据。# HELP tomcat_sessions_created_sessions_total
# TYPE tomcat_sessions_created_sessions_total counter
tomcat_sessions_created_sessions_total{application="springboot-actuator-prometheus-test",} 1782.0
# HELP tomcat_sessions_active_current_sessions
# TYPE tomcat_sessions_active_current_sessions gauge
tomcat_sessions_active_current_sessions{application="springboot-actuator-prometheus-test",} 365.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads{application="springboot-actuator-prometheus-test",} 16.0
# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage{application="springboot-actuator-prometheus-test",} 0.0102880658436214
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the young generation memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total{application="springboot-actuator-prometheus-test",} 9.13812704E8
# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{application="springboot-actuator-prometheus-test",id="mapped",} 0.0
jvm_buffer_count_buffers{application="springboot-actuator-prometheus-test",id="direct",} 10.0
...
- job_name: 'springboot-actuator-prometheus-test'
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
basic_auth:
username: 'actuator'
password: 'actuator'
static_configs:
- targets: ['ip:8080']
curl -X POST http:``//ip:9090/-/reload
Counter:允许以固定的数值递增,该数值必须为正数;
Gauge:获取当前值的句柄。典型的例子是,获取集合、map、或运行中的线程数等;
Timer:Timer用于测量短时间延迟和此类事件的频率。所有Timer实现至少将总时间和事件次数报告为单独的时间序列;
LongTaskTimer:长任务计时器用于跟踪所有正在运行的长时间运行任务的总持续时间和此类任务的数量;
DistributionSummary:用于跟踪分布式的事件;
往期精彩回顾