Prometheus部署相关软件版本

为了更好的更全面的演示 Prometheus 功能，我们将相关的组件也安装起来，以实现业务环境的全监控，相关软件版本信息如下

软件	地址
prometheus	https://github.com/prometheus/prometheus/releases/download/v2.30.3/prometheus-2.30.3.linux-amd64.tar.gz
node_exporter	https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
grafana	https://dl.grafana.com/enterprise/release/grafana-enterprise_8.2.1_amd64.deb
alertmanager	https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz

1. 部署prometheus

1.1 常见部署方式

包安装

RHEL系统:

https://packagecloud.io/app/prometheus-rpm/release/search

Ubuntu和Debian可直接使用apt命令安装

二进制安装

https://prometheus.io/download/

基于 Docker 运行

https://prometheus.io/docs/prometheus/latest/installation/

基于 Kubernetes Operator 安装

https://github.com/coreos/kube-prometheus

1.2 Docker镜像直接启动

[root@cs7 data]# docker run -d --name prometheus -p 9090:9090 \
-v /opt/prometheus/node_exporter/prometheus.yml:/etc/prometheus/conf/prometheus.yml  \
-v /prometheus-data:/opt/bitnami/prometheus/data \
-v /etc/localtime:/etc/localtime \
--restart=always \
bitnami/prometheus:latest \
--storage.tsdb.retention=15d \     #修改默认存储指标的时间
--config.file=/etc/prometheus/conf/prometheus.yml   #修改加载配置文件的路径
#浏览器访问:http://prometheus服务器:9090/

1.3 二进制安装prometheus

1.3.1 下载二进制包并解压

官方下载:

https://prometheus.io/download/

[root@prometheus local]#pwd
/usr/local/
[root@prometheus local]#wget
https://github.com/prometheus/prometheus/releases/download/v2.19.2/prometheus-
2.19.2.linux-amd64.tar.gz
[root@prometheus local]#ls -l prometheus-2.19.2.linux-amd64.tar.gz
-rw-r--r-- 1 root root 64152439 Jul  5 21:13 prometheus-2.19.2.linux-
amd64.tar.gz

[root@prometheus local]#tar xvf prometheus-2.19.2.linux-amd64.tar.gz
[root@prometheus local]#ln -s prometheus-2.19.2.linux-amd64 prometheus
[root@prometheus local]#cd prometheus/
[root@prometheus prometheus]#mkdir  bin conf data
[root@prometheus prometheus]#mv prometheus promtool bin/
[root@prometheus prometheus]#mv prometheus.yml conf/
[root@prometheus prometheus]#useradd -r -s /sbin/nologin prometheus
[root@prometheus prometheus]#chown -R prometheus.prometheus
/usr/local/prometheus/

[root@prometheus ~]#cat /etc/profile.d/prometheus.sh
export PROMETHEUS_HOME=/usr/local/prometheus
export PATH=${PROMETHEUS_HOME}/bin:$PATH
[root@prometheus ~]#source /etc/profile.d/prometheus.sh

#查看配置文件,默认可不修改
[root@prometheus ~]#grep -Ev "^ *#|^$" /usr/local/prometheus/conf/prometheus.yml global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
  alertmanagers:
    - static_configs:
        - targets:
rule_files:
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
#属性解析:主要是global和 scrape_configs 两部分的配置，在这里目前我们保持默认即可
#检查配置文件是否正确
[root@prometheus ~]#promtool check config /usr/local/prometheus/conf/prometheus.yml 
Checking /usr/local/prometheus/conf/prometheus.yml
  SUCCESS: 0 rule files found

1.3.2 创建service文件

[root@prometheus ~]#vim /lib/systemd/system/prometheus.service
[root@prometheus ~]#cat /lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
Restart=on-failure
User=prometheus
Group=prometheus
WorkingDirectory=/usr/local/prometheus/
ExecStart=/usr/local/prometheus/bin/prometheus --config.file=/usr/local/prometheus/conf/prometheus.yml
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target

#配置解析: 
#我们需要将定制的prometheus的配置文件和数据目录作为启动参数配置好 
#其它的参数，可以基于prometheus --help 查看更多

[root@prometheus ~]#systemctl daemon-reload
[root@prometheus ~]#systemctl enable --now prometheus.service
[root@prometheus ~]#ss -tnlp |grep prometheus
LISTEN   0         128                       *:9090                   *:*
users:(("prometheus",pid=84755,fd=10))
#结果显示:可以看到当前主机上可以看到一个端口9090，可通过下面地址看到prometheus的服务页面

范例: 查看prometheus命令选项

I have no name!@59b9dbdb09c4:/usr/sbin$ prometheus --help
usage: prometheus [<flags>]

The Prometheus monitoring server

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).
      --version                  Show application version.
      --config.file="prometheus.yml"  
                                 Prometheus configuration file path.
      --web.listen-address="0.0.0.0:9090"  
                                 Address to listen on for UI, API, and telemetry.
      --web.config.file=""       [EXPERIMENTAL] Path to configuration file that can enable TLS or
                                 authentication.
      --web.read-timeout=5m      Maximum duration before timing out read of the request, and closing idle
                                 connections.
      --web.max-connections=512  Maximum number of simultaneous connections.
      --web.external-url=<URL>   The URL under which Prometheus is externally reachable (for example, if
                                 Prometheus is served via a reverse proxy). Used for generating relative and
                                 absolute links back to Prometheus itself. If the URL has a path portion, it
                                 will be used to prefix all HTTP endpoints served by Prometheus. If omitted,
                                 relevant URL components will be derived automatically.
      --web.route-prefix=<path>  Prefix for the internal routes of web endpoints. Defaults to path of
                                 --web.external-url.
      --web.user-assets=<path>   Path to static asset directory, available at /user.
      --web.enable-lifecycle     Enable shutdown and reload via HTTP request.
      --web.enable-admin-api     Enable API endpoints for admin control actions.
      --web.console.templates="consoles"  
                                 Path to the console template directory, available at /consoles.
      --web.console.libraries="console_libraries"  
                                 Path to the console library directory.
      --web.page-title="Prometheus Time Series Collection and Processing Server"  
                                 Document title of Prometheus instance.
      --web.cors.origin=".*"     Regex for CORS origin. It is fully anchored. Example:
                                 'https?://(domain1|domain2)\.com'
      --storage.tsdb.path="data/"  
                                 Base path for metrics storage. Use with server mode only.
      --storage.tsdb.retention=STORAGE.TSDB.RETENTION  
                                 [DEPRECATED] How long to retain samples in storage. This flag has been
                                 deprecated, use "storage.tsdb.retention.time" instead. Use with server mode
                                 only.
      --storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME  
                                 How long to retain samples in storage. When this flag is set it overrides
                                 "storage.tsdb.retention". If neither this flag nor "storage.tsdb.retention"
                                 nor "storage.tsdb.retention.size" is set, the retention time defaults to 15d.
                                 Units Supported: y, w, d, h, m, s, ms. Use with server mode only.
      --storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE  
                                 Maximum number of bytes that can be stored for blocks. A unit is required,
                                 supported units: B, KB, MB, GB, TB, PB, EB. Ex: "512MB". Use with server mode
                                 only.
      --storage.tsdb.no-lockfile  
                                 Do not create lockfile in data directory. Use with server mode only.
      --storage.tsdb.allow-overlapping-blocks  
                                 Allow overlapping blocks, which in turn enables vertical compaction and
                                 vertical query merge. Use with server mode only.
      --storage.agent.path="data-agent/"  
                                 Base path for metrics storage. Use with agent mode only.
      --storage.agent.wal-compression  
                                 Compress the agent WAL. Use with agent mode only.
      --storage.agent.retention.min-time=STORAGE.AGENT.RETENTION.MIN-TIME  
                                 Minimum age samples may be before being considered for deletion when the WAL
                                 is truncated Use with agent mode only.
      --storage.agent.retention.max-time=STORAGE.AGENT.RETENTION.MAX-TIME  
                                 Maximum age samples may be before being forcibly deleted when the WAL is
                                 truncated Use with agent mode only.
      --storage.agent.no-lockfile  
                                 Do not create lockfile in data directory. Use with agent mode only.
      --storage.remote.flush-deadline=<duration>  
                                 How long to wait flushing sample on shutdown or config reload.
      --storage.remote.read-sample-limit=5e7  
                                 Maximum overall number of samples to return via the remote read interface, in
                                 a single query. 0 means no limit. This limit is ignored for streamed response
                                 types. Use with server mode only.
      --storage.remote.read-concurrent-limit=10  
                                 Maximum number of concurrent remote read calls. 0 means no limit. Use with
                                 server mode only.
      --storage.remote.read-max-bytes-in-frame=1048576  
                                 Maximum number of bytes in a single frame for streaming remote read response
                                 types before marshalling. Note that client might have limit on frame size as
                                 well. 1MB as recommended by protobuf by default. Use with server mode only.
      --rules.alert.for-outage-tolerance=1h  
                                 Max time to tolerate prometheus outage for restoring "for" state of alert. Use
                                 with server mode only.
      --rules.alert.for-grace-period=10m  
                                 Minimum duration between alert and restored "for" state. This is maintained
                                 only for alerts with configured "for" time greater than grace period. Use with
                                 server mode only.
      --rules.alert.resend-delay=1m  
                                 Minimum amount of time to wait before resending an alert to Alertmanager. Use
                                 with server mode only.
      --alertmanager.notification-queue-capacity=10000  
                                 The capacity of the queue for pending Alertmanager notifications. Use with
                                 server mode only.
      --query.lookback-delta=5m  The maximum lookback duration for retrieving metrics during expression
                                 evaluations and federation. Use with server mode only.
      --query.timeout=2m         Maximum time a query may take before being aborted. Use with server mode only.
      --query.max-concurrency=20  
                                 Maximum number of queries executed concurrently. Use with server mode only.
      --query.max-samples=50000000  
                                 Maximum number of samples a single query can load into memory. Note that
                                 queries will fail if they try to load more samples than this into memory, so
                                 this also limits the number of samples a query can return. Use with server
                                 mode only.
      --enable-feature= ...      Comma separated feature names to enable. Valid options: agent,
                                 exemplar-storage, expand-external-labels, memory-snapshot-on-shutdown,
                                 promql-at-modifier, promql-negative-offset, remote-write-receiver,
                                 extra-scrape-metrics, new-service-discovery-manager. See
                                 https://prometheus.io/docs/prometheus/latest/feature_flags/ for more details.
      --log.level=info           Only log messages with the given severity or above. One of: [debug, info,
                                 warn, error]
      --log.format=logfmt        Output format of log messages. One of: [logfmt, json]

1.3.3 测试访问

1.3.4 dashboard菜单说明

#一级目录解析
Alerts #Prometheus的告警信息菜单
Graph #Prometheus的图形展示界面，这是prometheus默认访问的界面 
Status #Prometheus的状态数据界面
Help #Prometheus的帮助信息界面

#Status子菜单,在Status菜单下存在很多的子选项，其名称和功能效果如下:
Runtime & Build Information    服务主机的运行状态信息及内部的监控项基本信息
Command-Line Flags             启动时候从配置文件中加载的属性信息
Configuration                  配置文件的具体内容(yaml格式)
Rules                          查询、告警、可视化等数据分析动作的规则记录
Targets                        监控的目标对象，包括主机、服务等以endpoint形式存在
Service Discovery              自动发现的各种Targets对象列表

点击一级菜单Graph，该界面的基本功能分区样式如下

注意: 我们选择监控项的作用就是生成规则表达式，当然，规则表达式的功能远远超过选择监控项的功能。但是这些定制采集数据的表达式，在刷新的时候，就没有了，这也是我们需要可视化插件的原因。我们选择一个监控项”scrape_duration_seconds”，然后点击”Execute”，查看效果

1.3.5 优化配置

[root@prometheus ~]#/usr/local/prometheus/bin/prometheus --help
.......
--web.read-timeout=5m #Maximum duration before timing out read of the request, and closing idle connections. 请求连接的最大等待时间,可以防止太多的空闲连接占用资源
--web.max-connections=512 #Maximum number of simultaneous connections. 最大链接数 
--storage.tsdb.retention=15d #How long to retain samples in the storage.prometheus开始采集监控数据后 会存在内存中和硬盘中对于保留期限的设置,太长硬盘和内存都 吃不消,太短要查历史数据就没有了,企业中设置15天为宜,默认值为0
--storage.tsdb.path="data/" #Base path for metrics storage. 存储数据路径,建议独立分 区,防止把根目录塞满,默认data/目录
--query.timeout=2m #Maximum time a query may take before being aborted.此为默认值 2m
--query.max-concurrency=20 #Maximum number of queries executed concurrently.此为默 认值20

查看配置

2. 安装node exporter

安装 Node Exporter 用于收集各 node 主机节点上的监控指标数据，监听端口为9100

官方下载:

https://prometheus.io/download/

2.1 下载并解压

在需要监控的所有节点主机上进行安装

[root@node1 ~]#wget -P /usr/local/
https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_export
er-1.2.2.linux-amd64.tar.gz
[root@node1 ~]#cd /usr/local
[root@node1 local]#tar xvf node_exporter-1.2.2.linux-amd64.tar.gz
[root@node1 local]#ln -s node_exporter-1.2.2.linux-amd64  node_exporter
[root@node1 local]#cd node_exporter
[root@node1 node_exporter]#mv node_exporter /usr/local/bin/

2.2 准备service文件

[root@node1 ~]#vim /lib/systemd/system/node_exporter.service
[root@node1 ~]#cat /lib/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

范例: 查看node_exporter命令选项

[root@cs7 node_exporter]# /usr/local/bin/node_exporter --help
usage: node_exporter [<flags>]

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).
      --collector.arp.device-include=COLLECTOR.ARP.DEVICE-INCLUDE  
                                 Regexp of arp devices to include (mutually exclusive to device-exclude).
      --collector.arp.device-exclude=COLLECTOR.ARP.DEVICE-EXCLUDE  
                                 Regexp of arp devices to exclude (mutually exclusive to device-include).
      --collector.bcache.priorityStats  
                                 Expose expensive priority stats.
      --collector.cpu.guest      Enables metric node_cpu_guest_seconds_total
      --collector.cpu.info       Enables metric cpu_info
      --collector.cpu.info.flags-include=COLLECTOR.CPU.INFO.FLAGS-INCLUDE  
                                 Filter the `flags` field in cpuInfo with a value that must be a regular
                                 expression
      --collector.cpu.info.bugs-include=COLLECTOR.CPU.INFO.BUGS-INCLUDE  
                                 Filter the `bugs` field in cpuInfo with a value that must be a regular
                                 expression
      --collector.diskstats.device-exclude="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$"  
                                 Regexp of diskstats devices to exclude (mutually exclusive to device-include).
      --collector.diskstats.device-include=COLLECTOR.DISKSTATS.DEVICE-INCLUDE  
                                 Regexp of diskstats devices to include (mutually exclusive to device-exclude).
      --collector.ethtool.device-include=COLLECTOR.ETHTOOL.DEVICE-INCLUDE  
                                 Regexp of ethtool devices to include (mutually exclusive to device-exclude).
      --collector.ethtool.device-exclude=COLLECTOR.ETHTOOL.DEVICE-EXCLUDE  
                                 Regexp of ethtool devices to exclude (mutually exclusive to device-include).
      --collector.ethtool.metrics-include=".*"  
                                 Regexp of ethtool stats to include.
      --collector.filesystem.mount-points-exclude="^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)"  
                                 Regexp of mount points to exclude for filesystem collector.
      --collector.filesystem.fs-types-exclude="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"  
                                 Regexp of filesystem types to exclude for filesystem collector.
      --collector.ipvs.backend-labels="local_address,local_port,remote_address,remote_port,proto,local_mark"  
                                 Comma separated list for IPVS backend stats labels.
      --collector.netclass.ignored-devices="^$"  
                                 Regexp of net devices to ignore for netclass collector.
      --collector.netclass.ignore-invalid-speed  
                                 Ignore devices where the speed is invalid. This will be the default behavior
                                 in 2.x.
      --collector.netclass.netlink  
                                 Use netlink to gather stats instead of /proc/net/dev.
      --collector.netclass_rtnl.with-stats  
                                 Expose the statistics for each network device, replacing netdev collector.
      --collector.netdev.device-include=COLLECTOR.NETDEV.DEVICE-INCLUDE  
                                 Regexp of net devices to include (mutually exclusive to device-exclude).
      --collector.netdev.device-exclude=COLLECTOR.NETDEV.DEVICE-EXCLUDE  
                                 Regexp of net devices to exclude (mutually exclusive to device-include).
      --collector.netdev.address-info  
                                 Collect address-info for every device
      --collector.netdev.enable-detailed-metrics  
                                 Use (incompatible) metric names that provide more detailed stats on Linux
      --collector.netdev.netlink  
                                 Use netlink to gather stats instead of /proc/net/dev.
      --collector.netstat.fields="^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans|TCPTimeouts)|Tcp_(ActiveOpens|InSegs|OutSegs|OutRsts|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts|RcvbufErrors|SndbufErrors))$"  
                                 Regexp of fields to return for netstat collector.
      --collector.ntp.server="127.0.0.1"  
                                 NTP server to use for ntp collector
      --collector.ntp.server-port=123  
                                 UDP port number to connect to on NTP server
      --collector.ntp.protocol-version=4  
                                 NTP protocol version
      --collector.ntp.server-is-local  
                                 Certify that collector.ntp.server address is not a public ntp server
      --collector.ntp.ip-ttl=1   IP TTL to use while sending NTP query
      --collector.ntp.max-distance=3.46608s  
                                 Max accumulated distance to the root
      --collector.ntp.local-offset-tolerance=1ms  
                                 Offset between local clock and local ntpd time to tolerate
      --path.procfs="/proc"      procfs mountpoint.
      --path.sysfs="/sys"        sysfs mountpoint.
      --path.rootfs="/"          rootfs mountpoint.
      --path.udev.data="/run/udev/data"  
                                 udev data path.
      --collector.perf.cpus=""   List of CPUs from which perf metrics should be collected
      --collector.perf.tracepoint=COLLECTOR.PERF.TRACEPOINT ...  
                                 perf tracepoint that should be collected
      --collector.powersupply.ignored-supplies="^$"  
                                 Regexp of power supplies to ignore for powersupplyclass collector.
      --collector.qdisc.fixtures=""  
                                 test fixtures to use for qdisc collector end-to-end testing
      --collector.qdisk.device-include=COLLECTOR.QDISK.DEVICE-INCLUDE  
                                 Regexp of qdisk devices to include (mutually exclusive to device-exclude).
      --collector.qdisk.device-exclude=COLLECTOR.QDISK.DEVICE-EXCLUDE  
                                 Regexp of qdisk devices to exclude (mutually exclusive to device-include).
      --collector.rapl.enable-zone-label  
                                 Enables service unit metric unit_start_time_seconds
      --collector.runit.servicedir="/etc/service"  
                                 Path to runit service directory.
      --collector.stat.softirq   Export softirq calls per vector
      --collector.supervisord.url="http://localhost:9001/RPC2"  
                                 XML RPC endpoint.
      --collector.sysctl.include=COLLECTOR.SYSCTL.INCLUDE ...  
                                 Select sysctl metrics to include
      --collector.sysctl.include-info=COLLECTOR.SYSCTL.INCLUDE-INFO ...  
                                 Select sysctl metrics to include as info metrics
      --collector.systemd.unit-include=".+"  
                                 Regexp of systemd units to include. Units must both match include and not
                                 match exclude to be included.
      --collector.systemd.unit-exclude=".+\\.(automount|device|mount|scope|slice)"  
                                 Regexp of systemd units to exclude. Units must both match include and not
                                 match exclude to be included.
      --collector.systemd.enable-task-metrics  
                                 Enables service unit tasks metrics unit_tasks_current and unit_tasks_max
      --collector.systemd.enable-restarts-metrics  
                                 Enables service unit metric service_restart_total
      --collector.systemd.enable-start-time-metrics  
                                 Enables service unit metric unit_start_time_seconds
      --collector.tapestats.ignored-devices="^$"  
                                 Regexp of devices to ignore for tapestats.
      --collector.textfile.directory=""  
                                 Directory to read text files with metrics from.
      --collector.vmstat.fields="^(oom_kill|pgpg|pswp|pg.*fault).*"  
                                 Regexp of fields to return for vmstat collector.
      --collector.wifi.fixtures=""  
                                 test fixtures to use for wifi collector metrics
      --collector.arp            Enable the arp collector (default: enabled).
      --collector.bcache         Enable the bcache collector (default: enabled).
      --collector.bonding        Enable the bonding collector (default: enabled).
      --collector.btrfs          Enable the btrfs collector (default: enabled).
      --collector.buddyinfo      Enable the buddyinfo collector (default: disabled).
      --collector.cgroups        Enable the cgroups collector (default: disabled).
      --collector.conntrack      Enable the conntrack collector (default: enabled).
      --collector.cpu            Enable the cpu collector (default: enabled).
      --collector.cpufreq        Enable the cpufreq collector (default: enabled).
      --collector.diskstats      Enable the diskstats collector (default: enabled).
      --collector.dmi            Enable the dmi collector (default: enabled).
      --collector.drbd           Enable the drbd collector (default: disabled).
      --collector.drm            Enable the drm collector (default: disabled).
      --collector.edac           Enable the edac collector (default: enabled).
      --collector.entropy        Enable the entropy collector (default: enabled).
      --collector.ethtool        Enable the ethtool collector (default: disabled).
      --collector.fibrechannel   Enable the fibrechannel collector (default: enabled).
      --collector.filefd         Enable the filefd collector (default: enabled).
      --collector.filesystem     Enable the filesystem collector (default: enabled).
      --collector.hwmon          Enable the hwmon collector (default: enabled).
      --collector.infiniband     Enable the infiniband collector (default: enabled).
      --collector.interrupts     Enable the interrupts collector (default: disabled).
      --collector.ipvs           Enable the ipvs collector (default: enabled).
      --collector.ksmd           Enable the ksmd collector (default: disabled).
      --collector.lnstat         Enable the lnstat collector (default: disabled).
      --collector.loadavg        Enable the loadavg collector (default: enabled).
      --collector.logind         Enable the logind collector (default: disabled).
      --collector.mdadm          Enable the mdadm collector (default: enabled).
      --collector.meminfo        Enable the meminfo collector (default: enabled).
      --collector.meminfo_numa   Enable the meminfo_numa collector (default: disabled).
      --collector.mountstats     Enable the mountstats collector (default: disabled).
      --collector.netclass       Enable the netclass collector (default: enabled).
      --collector.netdev         Enable the netdev collector (default: enabled).
      --collector.netstat        Enable the netstat collector (default: enabled).
      --collector.network_route  Enable the network_route collector (default: disabled).
      --collector.nfs            Enable the nfs collector (default: enabled).
      --collector.nfsd           Enable the nfsd collector (default: enabled).
      --collector.ntp            Enable the ntp collector (default: disabled).
      --collector.nvme           Enable the nvme collector (default: enabled).
      --collector.os             Enable the os collector (default: enabled).
      --collector.perf           Enable the perf collector (default: disabled).
      --collector.powersupplyclass  
                                 Enable the powersupplyclass collector (default: enabled).
      --collector.pressure       Enable the pressure collector (default: enabled).
      --collector.processes      Enable the processes collector (default: disabled).
      --collector.qdisc          Enable the qdisc collector (default: disabled).
      --collector.rapl           Enable the rapl collector (default: enabled).
      --collector.runit          Enable the runit collector (default: disabled).
      --collector.schedstat      Enable the schedstat collector (default: enabled).
      --collector.selinux        Enable the selinux collector (default: enabled).
      --collector.slabinfo       Enable the slabinfo collector (default: disabled).
      --collector.sockstat       Enable the sockstat collector (default: enabled).
      --collector.softnet        Enable the softnet collector (default: enabled).
      --collector.stat           Enable the stat collector (default: enabled).
      --collector.supervisord    Enable the supervisord collector (default: disabled).
      --collector.sysctl         Enable the sysctl collector (default: disabled).
      --collector.systemd        Enable the systemd collector (default: disabled).
      --collector.tapestats      Enable the tapestats collector (default: enabled).
      --collector.tcpstat        Enable the tcpstat collector (default: disabled).
      --collector.textfile       Enable the textfile collector (default: enabled).
      --collector.thermal_zone   Enable the thermal_zone collector (default: enabled).
      --collector.time           Enable the time collector (default: enabled).
      --collector.timex          Enable the timex collector (default: enabled).
      --collector.udp_queues     Enable the udp_queues collector (default: enabled).
      --collector.uname          Enable the uname collector (default: enabled).
      --collector.vmstat         Enable the vmstat collector (default: enabled).
      --collector.wifi           Enable the wifi collector (default: disabled).
      --collector.xfs            Enable the xfs collector (default: enabled).
      --collector.zfs            Enable the zfs collector (default: enabled).
      --collector.zoneinfo       Enable the zoneinfo collector (default: disabled).
      --web.telemetry-path="/metrics"  
                                 Path under which to expose metrics.
      --web.disable-exporter-metrics  
                                 Exclude metrics about the exporter itself (promhttp_*, process_*, go_*).
      --web.max-requests=40      Maximum number of parallel scrape requests. Use 0 to disable.
      --collector.disable-defaults  
                                 Set all collectors to disabled by default.
      --runtime.gomaxprocs=1     The target number of CPUs Go will run on (GOMAXPROCS)
      --web.systemd-socket       Use systemd socket activation listeners instead of port listeners (Linux
                                 only).
      --web.listen-address=:9100 ...  
                                 Addresses on which to expose metrics and web interface. Repeatable for
                                 multiple addresses.
      --web.config.file=""       [EXPERIMENTAL] Path to configuration file that can enable TLS or
                                 authentication.
      --log.level=info           Only log messages with the given severity or above. One of: [debug, info,
                                 warn, error]
      --log.format=logfmt        Output format of log messages. One of: [logfmt, json]
      --version                  Show application version.

2.3 启动node exporter服务

[root@node1 ~]#systemctl daemon-reload
[root@node1 ~]#systemctl enable --now node_exporter.service
[root@node1 ~]#systemctl is-active  node_exporter
active
[root@node1 ~]#ss -ntlp|grep node_exporter
LISTEN 0 128 *:9100 *:* users:(("node_exporter",pid=33761,fd=3)) 
#结果显示:可以看到当前主机上可以看到一个端口9100

2.4 容器方式部署node_exporter

[root@cs7 ~]# docker run -itd --name node_exporter -p 9100:9100/tcp --restart=always bitnami/node-exporter:latest 
78cacac1e53a83a20792e995cd9fec08f04d9088a8153fce04c468f53adfa1ed

2.5 访问node exporter web界面

浏览器访问

3. prometheus采集node_exporter数据

配置 prometheus 通过 node_exporter 组件采集node节点的监控指标数据

3.1 修改prometheus配置文件

scrape_interval: 15s #采集目标主机监控据的时间间隔
scrape_timeout: 10s #数据采集超时时间,默认10s
evaluation_interval: 1m #查看规则的时间间隔,默认是1m
[root@cs7 ~]# vim /opt/prometheus/node_exporter/prometheus.yml
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is `every 1 minute.
alerting:
  alertmanagers:
    - static_configs:
        - targets:
rule_files:
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: 'node_exporter' #添加以下行,指定监控的node exporter节点
    metrics_path: '/metrics'   #指定采集指标的路径
    static_configs:
      - targets: ['192.168.225.167:9100','192.168.225.112:9100','192.168.225.113:9100','192.168.225.114:9100']
     #- targets:  或者以这种格式也也可以支持
     #  - "192.168.225.167:9100"
     #  - "192.168.225.112:9100"
     #  - "192.168.225.113:9100"
     #  - "192.168.225.114:9100"

#属性解析:
#新增一个job_name 和 static_configs的属性
#targets 就是前面基本概念中提到的instance，格式就是"ip:port"
#检查语法
[root@prometheus ~]#promtool check config /usr/local/prometheus/conf/prometheus.yml
#重启prometheus服务:
[root@prometheus ~]#systemctl restart prometheus
#如果是以容器方式运行的prometheus，需要把配置文件挂载到容器中：
[root@cs7 data]# docker run -d --name prometheus -p 9090:9090 \
-v /opt/prometheus/node_exporter/prometheus.yml:/etc/prometheus/conf/prometheus.yml  \
-v /prometheus-data:/opt/bitnami/prometheus/data \
-v /etc/localtime:/etc/localtime \
bitnami/prometheus:latest \
--storage.tsdb.retention=15d \     #修改默认存储指标的时间
--config.file=/etc/prometheus/conf/prometheus.yml   #修改加载配置文件的路径
#浏览器访问:http://prometheus服务器:9090/

3.2 prometheus验证node_exporter节点状态

浏览器访问如下地址

http://192.168.225.167:9090/flags

#结果显示:当我们把鼠标放到Labels部位的时候，就会将改target的属性信息标识出来，其中”__*”开头的就是该对象的私有属性，当我们点击Endpoint的时候，就可以看到该node节点上的所有定制好的监控项。

通过抓包可以看到，prometheus server节点采用http get的方式主动从node_expoter节点采集metrics

3.3 prometheus验证node_exporter节点监控数据

点上面页面的主机链接,可以直接跳转至对应节点的页面

3.4 使用promql获取节点已用内存百分百和剩余内存百分百

可用内存百分比：(可用内存/总内存)*100

(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)*100

已用内存百分比：100-（(可用内存/总内存)*100）

100-(((node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)*100))

使用标签过滤固定主机的可用内存

node_memory_MemAvailable_bytes{instance=”192.168.225.167:9100″}

4. grafana展示prometheus数据

4.1 grafana简介

Grafana是一个开源的度量分析与可视化套件，它基于go语言开发。经常被用作基础设施的时间序列数据和应用程序分析的可视化，应用场景非常多。

Grafana不仅仅支持很多类型的时序数据库数据源，比如Graphite、InfluxDB、Prometheus、Elasticsearch等，虽然每种数据源都有独立的查询语言和能力，但是Grafana的查询编辑器支持所有的数据源，而且能很好的支持每种数据源的特性。通过该查询编辑器可以对所有数据进行整合，而且还可以在同一个dashboard上进行综合展示。

尤其是Grafana最具特色的就是各种各样漂亮的可视化界面，在Grafana提供了各种定制好的，可以直接给各种软件直接使用的展示界面模板，

默认监听于TCP协议的3000端口，支持集成其他认证服务，且能够通过/metrics输出内建指标; 可以在

https://grafana.com/dashboards/

页面查询到我们想要的各种dashboard模板。

4.2 grafana部署

Grafana官方下载链接

https://grafana.com/grafana/download

在prometheus服务器同时也安装grafana

#安装软件
[root@grafana ~]#wget https://dl.grafana.com/enterprise/release/grafana- enterprise_8.2.1_amd64.deb
#从国内镜站下载
[root@grafana ~]#wget https://mirrors.tuna.tsinghua.edu.cn/grafana/apt/pool/main/g/grafana/grafana_9.0 .6_amd64.deb
[root@grafana ~]#apt update
[root@grafana ~]#apt -y install ./grafana-enterprise_8.2.1_amd64.deb 
#注意:这里安装的是本地文件，所以要加文件路径
#如果安装失败,解决依赖关系
[root@grafana ~]#apt -y --fix-broken install
#查看插件列表(此步可选)
[root@grafana ~]#grafana-cli plugins list-remote
#安装插件,如果安装失败,多试几次(此步可选)
[root@grafana ~]#grafana-cli plugins install grafana-piechart-panel [root@grafana ~]#grafana-cli plugins ls

#启动服务
[root@grafana ~]#systemctl daemon-reload
[root@grafana ~]#systemctl enable --now grafana-server.service
#检查效果
[root@grafana ~]#systemctl status grafana-server.service
[root@grafana ~]#ss -ntulp|grep grafana
tcp LISTEN 0 4096 *:3000 *:* users:(("grafana-server",pid=35383,fd=8)) 
#结果显示:当前主机上出现了一个端口3000
docker方式安装部署grafana
#1 创建grafana数据目录、配置文件目录、日志目录
[root@cs7 prometheus-data]# mkdir /prometheus-data/grafana/{data,log,conf} -p
#2 运行grafana临时容器
[root@cs7 prometheus-data]# docker run -d --name=grafana  \
--restart=always -p 3000:3000 \
grafana/grafana-enterprise
#3 查看grafana容器默认的数据目录、配置文件目录、日志目录
[root@cs7 grafana]# docker inspect grafana
            "Env": [
                "PATH=/usr/share/grafana/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "GF_PATHS_CONFIG=/etc/grafana/grafana.ini",  #配置文件
                "GF_PATHS_DATA=/var/lib/grafana",            #数据目录
                "GF_PATHS_HOME=/usr/share/grafana",
                "GF_PATHS_LOGS=/var/log/grafana",            #日志目录
                "GF_PATHS_PLUGINS=/var/lib/grafana/plugins", #插件目录
                "GF_PATHS_PROVISIONING=/etc/grafana/provisioning"
            ]
#4 拷贝配置文件等信息到宿主机上面的相关目录中
[root@cs7 grafana]# docker cp grafana:/etc/grafana/. /prometheus-data/grafana/conf/
[root@cs7 grafana]# docker cp grafana:/var/lib/grafana/. /prometheus-data/grafana/data/
[root@cs7 grafana]# docker cp grafana:/var/log/grafana/. /prometheus-data/grafana/log/

#5 创建grafana运行用户
[root@cs7 grafana]# useradd grafana

#6 修改uid和gid为472，默认情况下grafana运行的的uid和gid为472
[root@cs7 grafana]# getent passwd | grep grafana
grafana:x:472:472::/home/grafana:/bin/bash
[root@cs7 grafana]# getent group | grep grafana
grafana:x:472:

#7 修改文件目录
[root@cs7 grafana]# chown -R grafana:grafana /prometheus-data/grafana/

#8 删除临时grafana容器
[root@cs7 grafana]# docker rm -f grafana

#9 创建grafana容器启动脚本
[root@cs7 grafana]# cat grafana.sh 
docker run -d --name grafana -p 3000:3000 \
-v /prometheus-data/grafana/conf:/etc/grafana \
-v /prometheus-data/grafana/data:/var/lib/grafana \
-v /prometheus-data/grafana/log:/var/log/grafana \
--restart=always \
grafana/grafana-enterprise:latest

#10 运行脚本创建容器
[root@cs7 grafana]# bash grafana.sh 
bca57cce8e2f4c1853d6ce979272efaecb68c500b507484fdbb81001c2b143ec

4.3 配置prometheus数据源

4.3.1 登录grafana web

#浏览器访问 http://master1.wang.org:3000/，查看效果 
#输入用户名和密码:admin/admin，就会进入到更改密码的页面，查看效果

输入更改后的密码后，此处为了方便还用原来的密码admin, 点击”Submit”后，就会进入到首页，查看效果

4.3.2 添加prometheus的数据源

添加数据源: 点击 “Add your first data source”

选择 “Prometheus” 出现添加界面

按照如下配置信息，在Settings界面对Prometheus进行配置，效果如下

 #注意:这里默认地址无法使用，将Prometheus的地址(10.0.0.101:9090或者localhost:9090)输入进去，其它没有做任何变动。

其它信息不用设置，点击最下面的”Save & Test” 查看效果

4.3.3 使用数据源插件中内置的dashboard

点Dashboards页签内的import,导入内置的三个模板，内置的模版可能会出现无法获取数据的情况，此时需要我们重新导入对应的grafana模版；

需要点击inmport导入模版

4.4 导入指定模版展示node exporter数据

上面内置的模板不太理想,导入指定的网络上比较合适的 Dashboard 模板

4.4.1 登录grafana官网查找模版

https://grafana.com/grafana/dashboards/?dataSource=prometheus

4.4.2 导入模版

导入完成后即可显示监控大屏

结果显示：一旦我们的node_exporter部署完毕后，节点的信息立刻就展示出来了。

如果看不到数据的话

没有安装节点监控 NodePort软件

浏览器的时间与服务器的时间误差超过3s以上

5. 监控zookeeper

监控的指标可以通过下面方式提供

Prometheus内置

应用内置的功能模块,比如: Zookeeper ，内置了exporter的功能，不需要额外再去安装

额外的exporter，使用第三方开发的功能

程序语言内置的功能代码，需要开发能力

5.1 安装和配置zookeeper

#安装zookeeper 略
#修改配置支持prometheus
[root@node3 ~]#vim /usr/local/zookeeper/conf/zoo.cfg 
#在最后面加下面三行,默认带有下面三行只需要取消注释即可 
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
[root@node3 ~]#systemctl restart zookeeper.service
#浏览器访问:http://node1.wang.org:7000/metrics

5.2 修改prometheus配置

5.3 查看prometheus的监控数据

5.4 查看grafana监控数据

6. pushgateway采集自定义数据

6.1 pushgateway介绍

官方链接：

https://prometheus.io/docs/practices/pushing/

虽然有很多的Exporter提供了丰富的数据,但生产环境中仍需要采集用户自定义的数据,可以利用 Pushgateway实现

Pushgateway 是另一种采用客户端主动推送数据的方式,也可以获取监控数据的prometheus 插件与exporter 不同, Exporter是被动采集数据

它是可以单独运行在任何节点上的插件(并不一定要在被监控客户端)

用户自定义的脚本或程序将需要监控的数据推送给 Pushgateway ,然后prometheus server再向 pushgateway拉取数据

Pushgateway 缺点

Pushgateway 会形成一个单点瓶颈，假如好多个应用同时发送给一个pushgateway的进程,如果这个进程没了，那么监控数据也就无法获取了

Pushgateway并不能对发送过来的数据进行更智能的判断,假如脚本中间采集出问题,那么有问题的数据 pushgateway一样照单全收发送给prometheus

6.2 安装pushgateway

二进制方式安装

[root@prometheus ~]#wget https://github.com/prometheus/pushgateway/releases/download/v1.4.3/pushgateway-1.4.3.linux-amd64.tar.gz
[root@prometheus ~]#tar xf pushgateway-1.4.3.linux-amd64.tar.gz  -C /usr/local/
[root@prometheus ~]#ln -s pushgateway-1.4.3.linux-amd64/ pushgateway
[root@prometheus ~]#ls /usr/local/pushgateway
LICENSE  NOTICE pushgateway
[root@prometheus ~]#mkdir /usr/local/pushgateway/bin
[root@prometheus ~]#mv /usr/local/pushgateway/pushgateway /usr/local/pushgateway/bin
[root@prometheus ~]#ldd /usr/local/pushgateway/bin/pushgateway
                    不是动态可执行文件
[root@prometheus ~]#ln -s /usr/local/pushgateway/bin/pushgateway /usr/local/bin/
[root@prometheus ~]#pushgateway --help
usage: pushgateway [<flags>]

The Pushgateway

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).
      --web.config.file=""       [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
      --web.listen-address=":9091"  
                                 Address to listen on for the web interface, API, and telemetry.
      --web.telemetry-path="/metrics"  
                                 Path under which to expose metrics.
      --web.external-url=        The URL under which the Pushgateway is externally reachable.
      --web.route-prefix=""      Prefix for the internal routes of web endpoints. Defaults to the path of
                                 --web.external-url.
      --web.enable-lifecycle     Enable shutdown via HTTP request.
      --web.enable-admin-api     Enable API endpoints for admin control actions.
      --persistence.file=""      File to persist metrics. If empty, metrics are only kept in memory.
      --persistence.interval=5m  The minimum interval at which to write out the persistence file.
      --push.disable-consistency-check  
                                 Do not check consistency of pushed metrics. DANGEROUS.
      --log.level=info           Only log messages with the given severity or above. One of: [debug, info, warn,
                                 error]
      --log.format=logfmt        Output format of log messages. One of: [logfmt, json]
      --version                  Show application version.

[root@prometheus ~]#pushgateway
ts=2022-08-06T14:55:02.911Z caller=main.go:85 level=info msg="starting
pushgateway" version="(version=1.4.3, branch=HEAD,
revision=f9dc1c8664050edbc75916c3664be1df595a1958)"
ts=2022-08-06T14:55:02.911Z caller=main.go:86 level=info build_context="
(go=go1.18.2, user=root@75e397dd33fe, date=20220530-19:02:00)"
ts=2022-08-06T14:55:02.913Z caller=main.go:139 level=info listen_address=:9091
ts=2022-08-06T14:55:02.914Z caller=tls_config.go:195 level=info msg="TLS is
disabled." http2=false

容器方式部署

[root@cs7 prometheus-data]# docker run -d --name pushgateway --restart=always -p 9091:9091/tcp bitnami/pushgateway:latest

登录pushgateway

6.3 配置prometheus收集pushgateway数据

[root@prometheus ~]#vim /usr/local/prometheus/conf/prometheus.yml
  - job_name: "pushgateway"
    static_configs:
     - targets:
       - "192.168.225.167:9091"

登录prometheus server 查看target

6.4 配置客户端发送数据给pushgateway

6.4.1 收集当前主机的进程数量，发送给pushgateway

#/bin/bash
METRIC_NAME=process_number
VALUE=$[$(ps -ef | wc -l)-1]
PUSHGATEWAY_HOST='192.168.225.167:9091'
EXPORTER_JOB='pushgateway'
EXPORTER_HOST=$(hostname -I| cut -d " " -f 1)
CURL_URL="curl --data-binary @- http://${PUSHGATEWAY_HOST}/metrics/job/${EXPORTER_JOB}/instance/${EXPORTER_HOST}"

echo "${METRIC_NAME}  ${VALUE}" | ${CURL_URL}

验证数据

6.4.2 收集当前主机的登录用户数，发送给pushgateway

#下面为发送一次数据,如果想周期性发送,可以通过cron或脚本循环实现

[root@ubuntu2004 ~]#echo "age 8" | curl --data-binary @- http://192.168.225.167:9091/metrics/job/pushgateway/instance/`hostname -I` 
#说明
192.168.225.167:9091 #安装为Pushgateway主机的IP和端口
pushgateway #Prometheus的job名称
`hostname -I` #取当前主机的IP为instance名称
#编定自定义的metric的脚本
[root@ubuntu2004 ~]#vim pushgateway_metric.sh
#!/bin/bash
# #********************************************************************
#Author:   wangxiaochun
#QQ:       29308620
#Date:     2022-08-08
#FileName: pushgateway_metric.sh
#URL:      http://www.wangxiaochun.com
#Description: The test script
#Copyright (C):  2022 All rights reserved
#********************************************************************

METRIC_NAME=login_number
METRIC_VALUE="who | wc -l"
PUSHGATEWAY_HOST=192.168.225.167:9091
EXPORTED_JOB=pushgateway
SLEEP_TIME=1
CURL_URL="curl --data-binary @- http://${PUSHGATEWAY_HOST}/metrics/job/${EXPORTED_JOB}/instance/`hostname -I`"
action()  {
    while true ;do
        VALUE=`eval "$METRIC_VALUE"`
        echo "${METRIC_NAME} ${VALUE}" |  $CURL_URL
        sleep $SLEEP_TIME
done
}
action
[root@ubuntu2004 ~]#bash pushgateway_metric.sh

验证数据

原文链接：https://blog.csdn.net/weixin_42714264/article/details/128574067