Prometheus ๊ฒฝ๊ณ  ์„ค์ •

7806 ๋‹จ์–ด AlertManagercentos7prometheus

ํ™˜๊ฒฝ



Prometheus๋Š” Docker ์ปจํ…Œ์ด๋„ˆ์—์„œ ์›€์ง์ž…๋‹ˆ๋‹ค.
ํด๋ผ์šฐ๋“œ ํ™˜๊ฒฝ: Azure
Docker ํ˜ธ์ŠคํŠธ: CentOS7.3
Docker ์ปจํ…Œ์ด๋„ˆ: (prometheus ์„œ๋ฒ„) CentOS7.3

<๊ฐ์‹œ ๋Œ€์ƒ>
Docker ํ˜ธ์ŠคํŠธ: CentOS7.3
Docker ์ปจํ…Œ์ด๋„ˆ : CentOS7.3 (์›น ์„œ๋ฒ„๋ฅผ ๊ฐ€์ •ํ•˜์—ฌ Apache ์‹œ์ž‘)

์ „์ œ ์กฐ๊ฑด



ยท Prometheus ์„œ๋ฒ„ ์„ค์น˜๊ฐ€ ์™„๋ฃŒ๋˜์—ˆ์Œ
Prometheus๋ฅผ CentOS7.3 & Docker์— ์„ค์น˜ํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

AlertManager ์„ค์น˜



1. AlertMananager์˜ URL ๋ณต์‚ฌ

Prometheus ๊ณต์‹ ์‚ฌ์ดํŠธ์—์„œ AlertManager๋ฅผ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

์ด ํ™˜๊ฒฝ์—์„œ๋Š” ๋‹ค์Œ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
Operating system: linux
Architecture: amd64

alertmanager๋ฅผ ์ฐพ์•„ ๋งํฌ์˜ ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ•ฉ๋‹ˆ๋‹ค.

2. ๋‹ค์šด๋กœ๋“œ
<Promethusใ‚ตใƒผใƒ>
## cd /usr/local/src
## wget https://github.com/prometheus/alertmanager/releases/download/v0.5.1/alertmanager-0.5.1.linux-amd64.tar.gz
## tar xfvz alertmanager-0.5.1.linux-amd64.tar.gz
## cd alertmanager-0.5.1.linux-amd64/
## cp -p alertmanager /usr/bin/.

3. ์„ค์ • ํŒŒ์ผ ๋ฐฐ์น˜
<Promethusใ‚ตใƒผใƒ>
## cd /etc/prometheus
## wget https://raw.githubusercontent.com/alerta/prometheus-config/master/alertmanager.yml
(Default็Šถๆ…‹)
## cat /etc/prometheus/alertmanager.yml
global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'localhost:25'                  
  smtp_from: 'alertmanager@example.org'           

route:
  receiver: "alerta"
  group_by: ['alertname']
  group_wait:      30s
  group_interval:  5m
  repeat_interval: 2h

receivers:
- name: "alerta"
  webhook_configs:
  - url: 'http://localhost:8080/webhooks/prometheus'
    send_resolved: true

4. ์ž๋™ ์‹œ์ž‘ ์„ค์ •
AlertManager๋„ ํ™•์‹คํžˆ ์ž๋™ ์‹œ์ž‘ํ•˜๋„๋ก ํ•ฉ์‹œ๋‹ค.
<Promethusใ‚ตใƒผใƒ>
## vi /etc/default/alertmanager
OPTIONS="-config.file /etc/prometheus/alertmanager.yml"

## vi /usr/lib/systemd/system/alertmanager.service

[Unit]
Description=Prometheus alertmanager Service
After=syslog.target.prometheus.alertmanager.service

[Service]
Type=simple
EnvironmentFile=-/etc/default/alertmanager
ExecStart=/usr/bin/alertmanager $OPTIONS
PrivateTmp=true

[Install]
WantedBy=multi-user.target


## systemctl enable alertmanager.service
Created symlink from /etc/systemd/system/multi-user.target.wants/alertmanager.service to /usr/lib/systemd/system/alertmanager.service.
## systemctl start alertmanager

5. ์•Œ๋ฆผ ์„ค์ • ์ „ ์ค€๋น„(๋ฉ”์ผ ์„ค์ •)



๋ฉ”์ผ ์†ก์‹ ์˜ ๊ตฌ์กฐ๋Š”, ํ™˜๊ฒฝ์— ๋งž์ถ”์–ด ์‹ค์‹œํ•ฉ์‹œ๋‹ค.
์ด๋ฒˆ์—๋Š”, Azure์˜ VM์ƒ์—์„œ ํ™˜๊ฒฝ์„ ์กฐ๋ฆฝํ•˜๊ณ  ์žˆ๋Š” ์ผ๋„ ์žˆ์–ด, ์ด์ชฝ์„ ์ฐธ๊ณ ๋กœ ๋ฉ”์ผ ์†ก์‹ ์˜ ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
Azure ๋ฉ”์ผ ์ „์†ก์€ SendGrid

6. ๊ฒฝ๊ณ  ์„ค์ •



ใ€Œ3. ์„ค์ • ํŒŒ์ผ์˜ ๋ฐฐ์น˜ใ€์˜ config ํŒŒ์ผ์„ ํŽธ์ง‘ํ•ฉ์‹œ๋‹ค.
์ด๋ฒˆ์—๋Š” ๋ฉ”์ผ ์•Œ๋ฆผ ์„ค์ •์„ ๋„ฃ์Šต๋‹ˆ๋‹ค. ๊ฐ’์€ Default ๊ฐ’์—์„œ ๋ณ€๊ฒฝ๋ฉ๋‹ˆ๋‹ค.
<Promethusใ‚ตใƒผใƒ>
## cat alertmanager.yml
global:
# The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtp.sendgrid.net:25'    โ˜… SendGrid ใฎSMTPๆŽฅ็ถšๅ…ˆ
  smtp_from: '****************@******'      โ˜… SendGrid ็™ป้Œฒใƒกใƒผใƒซใ‚ขใƒ‰ใƒฌใ‚น
  smtp_auth_username: '****@azure.com'      โ˜… SendGrid ใงๆ‰•ใ„ๅ‡บใ•ใ‚ŒใŸUserName
  smtp_auth_password: '*******'             โ˜… SendGrid ใง่จญๅฎšใ—ใŸใƒ‘ใ‚นใƒฏใƒผใƒ‰๏ผˆๅนณๆ–‡ใง่จ˜่ผ‰ใ™ใ‚‹ใฎใฏใกใ‚‡ใฃใจใญ๏ผ‰
  smtp_auth_secret: '*********'             โ˜… SendGrid ใงๆ‰•ใ„ๅ‡บใ•ใ‚ŒใŸAPIใ‚ญใƒผ

route:
  receiver: "mail"
  group_by: ['alertname', 'instance', 'severity']   โ˜… ๅŒไธ€ใ‚ขใƒฉใƒผใƒˆๅใ€ๅŒไธ€ใ‚คใƒณใ‚นใ‚ฟใƒณใ‚นใ€ๅŒไธ€ใ‚ตใƒผใƒ“ใ‚นใฎใ‚ขใƒฉใƒผใƒˆใซๅฏพใ—ใฆ
  group_wait: 30s                                   โ˜… 30็ง’ไปฅๅ†…ใฎใ‚ขใƒฉใƒผใƒˆใฏๅŒไธ€ใ‚ขใƒฉใƒผใƒˆใจ่ฆ‹ใชใ™
  group_interval: 10m                               โ˜… 10ๅˆ†ๆฏŽใซ้€š็Ÿฅ
  repeat_interval: 1h                               โ˜… ไธ€ๅบฆ้€š็Ÿฅใ—ใŸใ‚ขใƒฉใƒผใƒˆใฏ 1ๆ™‚้–“ๅพŒใซ้€š็Ÿฅ

#  receiver: "slack-notifications"
#  group_by: ['alertname', 'instance']

receivers:
 - name: 'mail'
   email_configs:
   - to: *****@********,####@######        โ˜… ใ‚ขใƒฉใƒผใƒˆ้€ไฟกๅ…ˆใฎใ‚ขใƒ‰ใƒฌใ‚น๏ผˆ่ค‡ๆ•ฐใ‚ใ‚‹ใจใใฏใ€, ใ‚ซใƒณใƒžๅŒบๅˆ‡ใ‚Š๏ผ‰
                                           โ˜… ใ„ใฏใ€้ ‘ๅผตใฃใŸใ‘ใฉใงใใชใ„ใ€‚ใ€‚ใ€‚
                                           โ˜… toใ‚’ๅˆ†ใ‘ใŸใ„ใจใใฏใ€-to: ใ‚’ๅŒใ˜ใ‚ˆใ†ใซ่จ˜่ผ‰ใ™ใ‚ŒใฐOK

inhibit_rules:
 - source_match:
     severity: 'critical'                  โ˜… ใ‚ขใƒฉใƒผใƒˆใฎๆทฑๅˆปๅบฆ(severity) ใŒ critical ใฎๅ ดๅˆใ€
   target_match:                           โ˜… ๅŒไธ€ใฎใ‚ขใƒฉใƒผใƒˆๅใง warning ใฎใ‚‚ใฎใฏ้€š็Ÿฅใ—ใชใ„ใ€‚
     severity: 'warning'
   equal: ['alertname']

7. ๊ทœ์น™ ์„ค์ •



๊ทœ์น™ ์„ค์ •์€ ์ง์ ‘ ํ•„์š”ํ•œ ๊ทœ์น™์„ ๊ณ ๋ คํ•ด๋ณด์‹ญ์‹œ์˜ค.
<Promethusใ‚ตใƒผใƒ>
## cat /etc/prometheus/alert.rules
ALERT instance_down
  IF up == 0
  FOR 2m
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} down",
    description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes.",
  }

ALERT cpu_threshold_exceeded
  IF (100 * (1 - avg by(instance)(irate(node_cpu{job='node',mode='idle'}[5m])))) > THRESHOLD_CPU
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} CPU usage is dangerously high",
    description = "This device's cpu usage has exceeded the threshold with a value of {{ $value }}.",
  }

ALERT mem_threshold_exceeded
  IF (node_memory_MemFree{job='node'} + node_memory_Cached{job='node'} + node_memory_Buffers{job='node'})/1000000 < THRESHOLD_MEM
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} memory usage is dangerously high",
    description = "This device's memory usage has exceeded the threshold with a value of {{ $value }}.",
  }

ALERT filesystem_threshold_exceeded
  IF node_filesystem_avail{job='node',mountpoint='/'} / node_filesystem_size{job='node'} * 100 < THRESHOLD_FS
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} filesystem usage is dangerously high",
    description = "This device's filesystem usage has exceeded the threshold with a value of {{ $value }}.",
  }

ALERT node_high_loadaverage
  IF rate(node_load1[1m]) > 2
  FOR 10s
  LABELS { severity = "warning" }
  ANNOTATIONS {
    summary = "High load average on {{$labels.instance}}",
    description = "{{$labels.instance}} has a high load average above 10s (current value: {{$value}})"
  }


8. prometheus์— ๋‚ด์žฅ



Prometheus์— Alertmanager๋ฅผ ํ†ตํ•ฉํ•˜์‹ญ์‹œ์˜ค.
/etc/prometheus/prometheus.yml์˜ ๋์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets: ['<ใƒ›ใ‚นใƒˆๅ>.japaneast.cloudapp.azure.com:9093']

9. ๋งˆ์ง€๋ง‰์œผ๋กœ



์„ค์ • ํŒŒ์ผ์˜ ๊ธฐ์žฌ๊ฐ€ ์˜ฌ๋ฐ”๋ฅธ์ง€ ์ œ๋Œ€๋กœ ํ™•์ธํ•ฉ์‹œ๋‹ค.
<Promethusใ‚ตใƒผใƒ>
## promtool check-config /etc/prometheus/prometheus.yml
## promtool check-config /etc/prometheus/alertmanager.yml


alertmanager, Prometheus๋ฅผ ๋‹ค์‹œ ์‹œ์ž‘ํ•˜์—ฌ ์™„๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
<Promethusใ‚ตใƒผใƒ>
## systemctl restart alertmanager
## systemctl restart prometheus


10. ๋™์ž‘ ํ™•์ธ



์ ๋‹นํžˆ ๊ฐ์‹œ ๋Œ€์ƒ์˜ ์„œ๋ฒ„๋ฅผ ์ •์ง€ํ•ด ๋ด…์‹œ๋‹ค.
๋ฉ”์ผ์ด ๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.



์ฐธ๊ณ  ์‚ฌ์ดํŠธ



Tech-Sketch
Prometheus ํ™˜๊ฒฝ ๊ตฌ์ถ• ์ ˆ์ฐจ
Azure ๋ฉ”์ผ ์ „์†ก์€ SendGrid

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ