Prometheus部署相关软件版本
为了更好的更全面的演示 Prometheus 功能,我们将相关的组件也安装起来,以实现业务环境的全监 控,相关软件版本信息如下
|
|
prometheus |
|
node_exporter |
|
grafana |
|
alertmanager |
1. 部署prometheus
1.1 常见部署方式
-
包安装
-
Ubuntu和Debian可直接使用apt命令安装
-
二进制安装
-
基于 Docker 运行
-
基于 Kubernetes Operator 安装
1.2 Docker镜像直接启动
[root@cs7 data]# docker run -d --name prometheus -p 9090:9090 \
-v /opt/prometheus/node_exporter/prometheus.yml:/etc/prometheus/conf/prometheus.yml \
-v /prometheus-data:/opt/bitnami/prometheus/data \
-v /etc/localtime:/etc/localtime \
--restart=always \
bitnami/prometheus:latest \
--storage.tsdb.retention=15d \ #修改默认存储指标的时间
--config.file=/etc/prometheus/conf/prometheus.yml #修改加载配置文件的路径
#浏览器访问:http://prometheus服务器:9090/
1.3 二进制安装prometheus
1.3.1 下载二进制包并解压
官方下载:
https://prometheus.io/download/
[root@prometheus local]#pwd
/usr/local/
[root@prometheus local]#wget
https://github.com/prometheus/prometheus/releases/download/v2.19.2/prometheus-
2.19.2.linux-amd64.tar.gz
[root@prometheus local]#ls -l prometheus-2.19.2.linux-amd64.tar.gz
-rw-r--r-- 1 root root 64152439 Jul 5 21:13 prometheus-2.19.2.linux-
amd64.tar.gz
[root@prometheus local]#tar xvf prometheus-2.19.2.linux-amd64.tar.gz
[root@prometheus local]#ln -s prometheus-2.19.2.linux-amd64 prometheus
[root@prometheus local]#cd prometheus/
[root@prometheus prometheus]#mkdir bin conf data
[root@prometheus prometheus]#mv prometheus promtool bin/
[root@prometheus prometheus]#mv prometheus.yml conf/
[root@prometheus prometheus]#useradd -r -s /sbin/nologin prometheus
[root@prometheus prometheus]#chown -R prometheus.prometheus
/usr/local/prometheus/
[root@prometheus ~]#cat /etc/profile.d/prometheus.sh
export PROMETHEUS_HOME=/usr/local/prometheus
export PATH=${PROMETHEUS_HOME}/bin:$PATH
[root@prometheus ~]#source /etc/profile.d/prometheus.sh
#查看配置文件,默认可不修改
[root@prometheus ~]#grep -Ev "^ *#|^$" /usr/local/prometheus/conf/prometheus.yml global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
#属性解析:主要是global和 scrape_configs 两部分的配置,在这里目前我们保持默认即可
#检查配置文件是否正确
[root@prometheus ~]#promtool check config /usr/local/prometheus/conf/prometheus.yml
Checking /usr/local/prometheus/conf/prometheus.yml
SUCCESS: 0 rule files found
1.3.2 创建service文件
[root@prometheus ~]#vim /lib/systemd/system/prometheus.service
[root@prometheus ~]#cat /lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
Restart=on-failure
User=prometheus
Group=prometheus
WorkingDirectory=/usr/local/prometheus/
ExecStart=/usr/local/prometheus/bin/prometheus --config.file=/usr/local/prometheus/conf/prometheus.yml
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target
#配置解析:
#我们需要将定制的prometheus的配置文件和数据目录作为启动参数配置好
#其它的参数,可以基于prometheus --help 查看更多
[root@prometheus ~]#systemctl daemon-reload
[root@prometheus ~]#systemctl enable --now prometheus.service
[root@prometheus ~]#ss -tnlp |grep prometheus
LISTEN 0 128 *:9090 *:*
users:(("prometheus",pid=84755,fd=10))
#结果显示:可以看到当前主机上可以看到一个端口9090,可通过下面地址看到prometheus的服务页面
范例: 查看prometheus命令选项
I have no name!@59b9dbdb09c4:/usr/sbin$ prometheus --help
usage: prometheus [<flags>]
The Prometheus monitoring server
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--version Show application version.
--config.file="prometheus.yml"
Prometheus configuration file path.
--web.listen-address="0.0.0.0:9090"
Address to listen on for UI, API, and telemetry.
--web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or
authentication.
--web.read-timeout=5m Maximum duration before timing out read of the request, and closing idle
connections.
--web.max-connections=512 Maximum number of simultaneous connections.
--web.external-url=<URL> The URL under which Prometheus is externally reachable (for example, if
Prometheus is served via a reverse proxy). Used for generating relative and
absolute links back to Prometheus itself. If the URL has a path portion, it
will be used to prefix all HTTP endpoints served by Prometheus. If omitted,
relevant URL components will be derived automatically.
--web.route-prefix=<path> Prefix for the internal routes of web endpoints. Defaults to path of
--web.external-url.
--web.user-assets=<path> Path to static asset directory, available at /user.
--web.enable-lifecycle Enable shutdown and reload via HTTP request.
--web.enable-admin-api Enable API endpoints for admin control actions.
--web.console.templates="consoles"
Path to the console template directory, available at /consoles.
--web.console.libraries="console_libraries"
Path to the console library directory.
--web.page-title="Prometheus Time Series Collection and Processing Server"
Document title of Prometheus instance.
--web.cors.origin=".*" Regex for CORS origin. It is fully anchored. Example:
'https?://(domain1|domain2)\.com'
--storage.tsdb.path="data/"
Base path for metrics storage. Use with server mode only.
--storage.tsdb.retention=STORAGE.TSDB.RETENTION
[DEPRECATED] How long to retain samples in storage. This flag has been
deprecated, use "storage.tsdb.retention.time" instead. Use with server mode
only.
--storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME
How long to retain samples in storage. When this flag is set it overrides
"storage.tsdb.retention". If neither this flag nor "storage.tsdb.retention"
nor "storage.tsdb.retention.size" is set, the retention time defaults to 15d.
Units Supported: y, w, d, h, m, s, ms. Use with server mode only.
--storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE
Maximum number of bytes that can be stored for blocks. A unit is required,
supported units: B, KB, MB, GB, TB, PB, EB. Ex: "512MB". Use with server mode
only.
--storage.tsdb.no-lockfile
Do not create lockfile in data directory. Use with server mode only.
--storage.tsdb.allow-overlapping-blocks
Allow overlapping blocks, which in turn enables vertical compaction and
vertical query merge. Use with server mode only.
--storage.agent.path="data-agent/"
Base path for metrics storage. Use with agent mode only.
--storage.agent.wal-compression
Compress the agent WAL. Use with agent mode only.
--storage.agent.retention.min-time=STORAGE.AGENT.RETENTION.MIN-TIME
Minimum age samples may be before being considered for deletion when the WAL
is truncated Use with agent mode only.
--storage.agent.retention.max-time=STORAGE.AGENT.RETENTION.MAX-TIME
Maximum age samples may be before being forcibly deleted when the WAL is
truncated Use with agent mode only.
--storage.agent.no-lockfile
Do not create lockfile in data directory. Use with agent mode only.
--storage.remote.flush-deadline=<duration>
How long to wait flushing sample on shutdown or config reload.
--storage.remote.read-sample-limit=5e7
Maximum overall number of samples to return via the remote read interface, in
a single query. 0 means no limit. This limit is ignored for streamed response
types. Use with server mode only.
--storage.remote.read-concurrent-limit=10
Maximum number of concurrent remote read calls. 0 means no limit. Use with
server mode only.
--storage.remote.read-max-bytes-in-frame=1048576
Maximum number of bytes in a single frame for streaming remote read response
types before marshalling. Note that client might have limit on frame size as
well. 1MB as recommended by protobuf by default. Use with server mode only.
--rules.alert.for-outage-tolerance=1h
Max time to tolerate prometheus outage for restoring "for" state of alert. Use
with server mode only.
--rules.alert.for-grace-period=10m
Minimum duration between alert and restored "for" state. This is maintained
only for alerts with configured "for" time greater than grace period. Use with
server mode only.
--rules.alert.resend-delay=1m
Minimum amount of time to wait before resending an alert to Alertmanager. Use
with server mode only.
--alertmanager.notification-queue-capacity=10000
The capacity of the queue for pending Alertmanager notifications. Use with
server mode only.
--query.lookback-delta=5m The maximum lookback duration for retrieving metrics during expression
evaluations and federation. Use with server mode only.
--query.timeout=2m Maximum time a query may take before being aborted. Use with server mode only.
--query.max-concurrency=20
Maximum number of queries executed concurrently. Use with server mode only.
--query.max-samples=50000000
Maximum number of samples a single query can load into memory. Note that
queries will fail if they try to load more samples than this into memory, so
this also limits the number of samples a query can return. Use with server
mode only.
--enable-feature= ... Comma separated feature names to enable. Valid options: agent,
exemplar-storage, expand-external-labels, memory-snapshot-on-shutdown,
promql-at-modifier, promql-negative-offset, remote-write-receiver,
extra-scrape-metrics, new-service-discovery-manager. See
https://prometheus.io/docs/prometheus/latest/feature_flags/ for more details.
--log.level=info Only log messages with the given severity or above. One of: [debug, info,
warn, error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]
1.3.3 测试访问
1.3.4 dashboard菜单说明
#一级目录解析
Alerts #Prometheus的告警信息菜单
Graph #Prometheus的图形展示界面,这是prometheus默认访问的界面
Status #Prometheus的状态数据界面
Help #Prometheus的帮助信息界面
#Status子菜单,在Status菜单下存在很多的子选项,其名称和功能效果如下:
Runtime & Build Information 服务主机的运行状态信息及内部的监控项基本信息
Command-Line Flags 启动时候从配置文件中加载的属性信息
Configuration 配置文件的具体内容(yaml格式)
Rules 查询、告警、可视化等数据分析动作的规则记录
Targets 监控的目标对象,包括主机、服务等以endpoint形式存在
Service Discovery 自动发现的各种Targets对象列表
点击一级菜单Graph,该界面的基本功能分区样式如下
注意: 我们选择监控项的作用就是生成规则表达式,当然,规则表达式的功能远远超过选择监控项的功能。 但是这些定制采集数据的表达式,在刷新的时候,就没有了,这也是我们需要可视化插件的原因。 我们选择一个监控项”scrape_duration_seconds”,然后点击”Execute”,查看效果
1.3.5 优化配置
[root@prometheus ~]#/usr/local/prometheus/bin/prometheus --help
.......
--web.read-timeout=5m #Maximum duration before timing out read of the request, and closing idle connections. 请求连接的最大等待时间,可以防止太多的空闲连接占用资源
--web.max-connections=512 #Maximum number of simultaneous connections. 最大链接数
--storage.tsdb.retention=15d #How long to retain samples in the storage.prometheus开始采集监控数据后 会存在内存中和硬盘中对于保留期限的设置,太长硬盘和内存都 吃不消,太短要查历史数据就没有了,企业中设置15天为宜,默认值为0
--storage.tsdb.path="data/" #Base path for metrics storage. 存储数据路径,建议独立分 区,防止把根目录塞满,默认data/目录
--query.timeout=2m #Maximum time a query may take before being aborted.此为默认值 2m
--query.max-concurrency=20 #Maximum number of queries executed concurrently.此为默 认值20
查看配置
2. 安装node exporter
安装 Node Exporter 用于收集各 node 主机节点上的监控指标数据,监听端口为9100
官方下载:
https://prometheus.io/download/
2.1 下载并解压
在需要监控的所有节点主机上进行安装
[root@node1 ~]#wget -P /usr/local/
https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_export
er-1.2.2.linux-amd64.tar.gz
[root@node1 ~]#cd /usr/local
[root@node1 local]#tar xvf node_exporter-1.2.2.linux-amd64.tar.gz
[root@node1 local]#ln -s node_exporter-1.2.2.linux-amd64 node_exporter
[root@node1 local]#cd node_exporter
[root@node1 node_exporter]#mv node_exporter /usr/local/bin/
2.2 准备service文件
[root@node1 ~]#vim /lib/systemd/system/node_exporter.service
[root@node1 ~]#cat /lib/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
范例: 查看node_exporter命令选项
[root@cs7 node_exporter]# /usr/local/bin/node_exporter --help
usage: node_exporter [<flags>]
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--collector.arp.device-include=COLLECTOR.ARP.DEVICE-INCLUDE
Regexp of arp devices to include (mutually exclusive to device-exclude).
--collector.arp.device-exclude=COLLECTOR.ARP.DEVICE-EXCLUDE
Regexp of arp devices to exclude (mutually exclusive to device-include).
--collector.bcache.priorityStats
Expose expensive priority stats.
--collector.cpu.guest Enables metric node_cpu_guest_seconds_total
--collector.cpu.info Enables metric cpu_info
--collector.cpu.info.flags-include=COLLECTOR.CPU.INFO.FLAGS-INCLUDE
Filter the `flags` field in cpuInfo with a value that must be a regular
expression
--collector.cpu.info.bugs-include=COLLECTOR.CPU.INFO.BUGS-INCLUDE
Filter the `bugs` field in cpuInfo with a value that must be a regular
expression
--collector.diskstats.device-exclude="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$"
Regexp of diskstats devices to exclude (mutually exclusive to device-include).
--collector.diskstats.device-include=COLLECTOR.DISKSTATS.DEVICE-INCLUDE
Regexp of diskstats devices to include (mutually exclusive to device-exclude).
--collector.ethtool.device-include=COLLECTOR.ETHTOOL.DEVICE-INCLUDE
Regexp of ethtool devices to include (mutually exclusive to device-exclude).
--collector.ethtool.device-exclude=COLLECTOR.ETHTOOL.DEVICE-EXCLUDE
Regexp of ethtool devices to exclude (mutually exclusive to device-include).
--collector.ethtool.metrics-include=".*"
Regexp of ethtool stats to include.
--collector.filesystem.mount-points-exclude="^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)"
Regexp of mount points to exclude for filesystem collector.
--collector.filesystem.fs-types-exclude="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
Regexp of filesystem types to exclude for filesystem collector.
--collector.ipvs.backend-labels="local_address,local_port,remote_address,remote_port,proto,local_mark"
Comma separated list for IPVS backend stats labels.
--collector.netclass.ignored-devices="^$"
Regexp of net devices to ignore for netclass collector.
--collector.netclass.ignore-invalid-speed
Ignore devices where the speed is invalid. This will be the default behavior
in 2.x.
--collector.netclass.netlink
Use netlink to gather stats instead of /proc/net/dev.
--collector.netclass_rtnl.with-stats
Expose the statistics for each network device, replacing netdev collector.
--collector.netdev.device-include=COLLECTOR.NETDEV.DEVICE-INCLUDE
Regexp of net devices to include (mutually exclusive to device-exclude).
--collector.netdev.device-exclude=COLLECTOR.NETDEV.DEVICE-EXCLUDE
Regexp of net devices to exclude (mutually exclusive to device-include).
--collector.netdev.address-info
Collect address-info for every device
--collector.netdev.enable-detailed-metrics
Use (incompatible) metric names that provide more detailed stats on Linux
--collector.netdev.netlink
Use netlink to gather stats instead of /proc/net/dev.
--collector.netstat.fields="^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans|TCPTimeouts)|Tcp_(ActiveOpens|InSegs|OutSegs|OutRsts|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts|RcvbufErrors|SndbufErrors))$"
Regexp of fields to return for netstat collector.
--collector.ntp.server="127.0.0.1"
NTP server to use for ntp collector
--collector.ntp.server-port=123
UDP port number to connect to on NTP server
--collector.ntp.protocol-version=4
NTP protocol version
--collector.ntp.server-is-local
Certify that collector.ntp.server address is not a public ntp server
--collector.ntp.ip-ttl=1 IP TTL to use while sending NTP query
--collector.ntp.max-distance=3.46608s
Max accumulated distance to the root
--collector.ntp.local-offset-tolerance=1ms
Offset between local clock and local ntpd time to tolerate
--path.procfs="/proc" procfs mountpoint.
--path.sysfs="/sys" sysfs mountpoint.
--path.rootfs="/" rootfs mountpoint.
--path.udev.data="/run/udev/data"
udev data path.
--collector.perf.cpus="" List of CPUs from which perf metrics should be collected
--collector.perf.tracepoint=COLLECTOR.PERF.TRACEPOINT ...
perf tracepoint that should be collected
--collector.powersupply.ignored-supplies="^$"
Regexp of power supplies to ignore for powersupplyclass collector.
--collector.qdisc.fixtures=""
test fixtures to use for qdisc collector end-to-end testing
--collector.qdisk.device-include=COLLECTOR.QDISK.DEVICE-INCLUDE
Regexp of qdisk devices to include (mutually exclusive to device-exclude).
--collector.qdisk.device-exclude=COLLECTOR.QDISK.DEVICE-EXCLUDE
Regexp of qdisk devices to exclude (mutually exclusive to device-include).
--collector.rapl.enable-zone-label
Enables service unit metric unit_start_time_seconds
--collector.runit.servicedir="/etc/service"
Path to runit service directory.
--collector.stat.softirq Export softirq calls per vector
--collector.supervisord.url="http://localhost:9001/RPC2"
XML RPC endpoint.
--collector.sysctl.include=COLLECTOR.SYSCTL.INCLUDE ...
Select sysctl metrics to include
--collector.sysctl.include-info=COLLECTOR.SYSCTL.INCLUDE-INFO ...
Select sysctl metrics to include as info metrics
--collector.systemd.unit-include=".+"
Regexp of systemd units to include. Units must both match include and not
match exclude to be included.
--collector.systemd.unit-exclude=".+\\.(automount|device|mount|scope|slice)"
Regexp of systemd units to exclude. Units must both match include and not
match exclude to be included.
--collector.systemd.enable-task-metrics
Enables service unit tasks metrics unit_tasks_current and unit_tasks_max
--collector.systemd.enable-restarts-metrics
Enables service unit metric service_restart_total
--collector.systemd.enable-start-time-metrics
Enables service unit metric unit_start_time_seconds
--collector.tapestats.ignored-devices="^$"
Regexp of devices to ignore for tapestats.
--collector.textfile.directory=""
Directory to read text files with metrics from.
--collector.vmstat.fields="^(oom_kill|pgpg|pswp|pg.*fault).*"
Regexp of fields to return for vmstat collector.
--collector.wifi.fixtures=""
test fixtures to use for wifi collector metrics
--collector.arp Enable the arp collector (default: enabled).
--collector.bcache Enable the bcache collector (default: enabled).
--collector.bonding Enable the bonding collector (default: enabled).
--collector.btrfs Enable the btrfs collector (default: enabled).
--collector.buddyinfo Enable the buddyinfo collector (default: disabled).
--collector.cgroups Enable the cgroups collector (default: disabled).
--collector.conntrack Enable the conntrack collector (default: enabled).
--collector.cpu Enable the cpu collector (default: enabled).
--collector.cpufreq Enable the cpufreq collector (default: enabled).
--collector.diskstats Enable the diskstats collector (default: enabled).
--collector.dmi Enable the dmi collector (default: enabled).
--collector.drbd Enable the drbd collector (default: disabled).
--collector.drm Enable the drm collector (default: disabled).
--collector.edac Enable the edac collector (default: enabled).
--collector.entropy Enable the entropy collector (default: enabled).
--collector.ethtool Enable the ethtool collector (default: disabled).
--collector.fibrechannel Enable the fibrechannel collector (default: enabled).
--collector.filefd Enable the filefd collector (default: enabled).
--collector.filesystem Enable the filesystem collector (default: enabled).
--collector.hwmon Enable the hwmon collector (default: enabled).
--collector.infiniband Enable the infiniband collector (default: enabled).
--collector.interrupts Enable the interrupts collector (default: disabled).
--collector.ipvs Enable the ipvs collector (default: enabled).
--collector.ksmd Enable the ksmd collector (default: disabled).
--collector.lnstat Enable the lnstat collector (default: disabled).
--collector.loadavg Enable the loadavg collector (default: enabled).
--collector.logind Enable the logind collector (default: disabled).
--collector.mdadm Enable the mdadm collector (default: enabled).
--collector.meminfo Enable the meminfo collector (default: enabled).
--collector.meminfo_numa Enable the meminfo_numa collector (default: disabled).
--collector.mountstats Enable the mountstats collector (default: disabled).
--collector.netclass Enable the netclass collector (default: enabled).
--collector.netdev Enable the netdev collector (default: enabled).
--collector.netstat Enable the netstat collector (default: enabled).
--collector.network_route Enable the network_route collector (default: disabled).
--collector.nfs Enable the nfs collector (default: enabled).
--collector.nfsd Enable the nfsd collector (default: enabled).
--collector.ntp Enable the ntp collector (default: disabled).
--collector.nvme Enable the nvme collector (default: enabled).
--collector.os Enable the os collector (default: enabled).
--collector.perf Enable the perf collector (default: disabled).
--collector.powersupplyclass
Enable the powersupplyclass collector (default: enabled).
--collector.pressure Enable the pressure collector (default: enabled).
--collector.processes Enable the processes collector (default: disabled).
--collector.qdisc Enable the qdisc collector (default: disabled).
--collector.rapl Enable the rapl collector (default: enabled).
--collector.runit Enable the runit collector (default: disabled).
--collector.schedstat Enable the schedstat collector (default: enabled).
--collector.selinux Enable the selinux collector (default: enabled).
--collector.slabinfo Enable the slabinfo collector (default: disabled).
--collector.sockstat Enable the sockstat collector (default: enabled).
--collector.softnet Enable the softnet collector (default: enabled).
--collector.stat Enable the stat collector (default: enabled).
--collector.supervisord Enable the supervisord collector (default: disabled).
--collector.sysctl Enable the sysctl collector (default: disabled).
--collector.systemd Enable the systemd collector (default: disabled).
--collector.tapestats Enable the tapestats collector (default: enabled).
--collector.tcpstat Enable the tcpstat collector (default: disabled).
--collector.textfile Enable the textfile collector (default: enabled).
--collector.thermal_zone Enable the thermal_zone collector (default: enabled).
--collector.time Enable the time collector (default: enabled).
--collector.timex Enable the timex collector (default: enabled).
--collector.udp_queues Enable the udp_queues collector (default: enabled).
--collector.uname Enable the uname collector (default: enabled).
--collector.vmstat Enable the vmstat collector (default: enabled).
--collector.wifi Enable the wifi collector (default: disabled).
--collector.xfs Enable the xfs collector (default: enabled).
--collector.zfs Enable the zfs collector (default: enabled).
--collector.zoneinfo Enable the zoneinfo collector (default: disabled).
--web.telemetry-path="/metrics"
Path under which to expose metrics.
--web.disable-exporter-metrics
Exclude metrics about the exporter itself (promhttp_*, process_*, go_*).
--web.max-requests=40 Maximum number of parallel scrape requests. Use 0 to disable.
--collector.disable-defaults
Set all collectors to disabled by default.
--runtime.gomaxprocs=1 The target number of CPUs Go will run on (GOMAXPROCS)
--web.systemd-socket Use systemd socket activation listeners instead of port listeners (Linux
only).
--web.listen-address=:9100 ...
Addresses on which to expose metrics and web interface. Repeatable for
multiple addresses.
--web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or
authentication.
--log.level=info Only log messages with the given severity or above. One of: [debug, info,
warn, error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]
--version Show application version.
2.3 启动node exporter服务
[root@node1 ~]#systemctl daemon-reload
[root@node1 ~]#systemctl enable --now node_exporter.service
[root@node1 ~]#systemctl is-active node_exporter
active
[root@node1 ~]#ss -ntlp|grep node_exporter
LISTEN 0 128 *:9100 *:* users:(("node_exporter",pid=33761,fd=3))
#结果显示:可以看到当前主机上可以看到一个端口9100
2.4 容器方式部署node_exporter
[root@cs7 ~]# docker run -itd --name node_exporter -p 9100:9100/tcp --restart=always bitnami/node-exporter:latest
78cacac1e53a83a20792e995cd9fec08f04d9088a8153fce04c468f53adfa1ed
2.5 访问node exporter web界面
浏览器访问
3. prometheus采集node_exporter数据
配置 prometheus 通过 node_exporter 组件采集node节点的监控指标数据
3.1 修改prometheus配置文件
scrape_interval: 15s #采集目标主机监控据的时间间隔
scrape_timeout: 10s #数据采集超时时间,默认10s
evaluation_interval: 1m #查看规则的时间间隔,默认是1m
[root@cs7 ~]# vim /opt/prometheus/node_exporter/prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is `every 1 minute.
alerting:
alertmanagers:
- static_configs:
- targets:
rule_files:
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: 'node_exporter' #添加以下行,指定监控的node exporter节点
metrics_path: '/metrics' #指定采集指标的路径
static_configs:
- targets: ['192.168.225.167:9100','192.168.225.112:9100','192.168.225.113:9100','192.168.225.114:9100']
#- targets: 或者以这种格式也也可以支持
# - "192.168.225.167:9100"
# - "192.168.225.112:9100"
# - "192.168.225.113:9100"
# - "192.168.225.114:9100"
#属性解析:
#新增一个job_name 和 static_configs的属性
#targets 就是前面基本概念中提到的instance,格式就是"ip:port"
#检查语法
[root@prometheus ~]#promtool check config /usr/local/prometheus/conf/prometheus.yml
#重启prometheus服务:
[root@prometheus ~]#systemctl restart prometheus
#如果是以容器方式运行的prometheus,需要把配置文件挂载到容器中:
[root@cs7 data]# docker run -d --name prometheus -p 9090:9090 \
-v /opt/prometheus/node_exporter/prometheus.yml:/etc/prometheus/conf/prometheus.yml \
-v /prometheus-data:/opt/bitnami/prometheus/data \
-v /etc/localtime:/etc/localtime \
bitnami/prometheus:latest \
--storage.tsdb.retention=15d \ #修改默认存储指标的时间
--config.file=/etc/prometheus/conf/prometheus.yml #修改加载配置文件的路径
#浏览器访问:http://prometheus服务器:9090/
3.2 prometheus验证node_exporter节点状态
浏览器访问如下地址
http://192.168.225.167:9090/flags
#结果显示:当我们把鼠标放到Labels部位的时候,就会将改target的属性信息标识出来,其中”__*”开头 的就是该对象的私有属性,当我们点击Endpoint的时候,就可以看到该node节点上的所有定制好的监控项。
通过抓包可以看到,prometheus server节点采用http get的方式主动从node_expoter节点采集metrics
3.3 prometheus验证node_exporter节点监控数据
点上面页面的主机链接,可以直接跳转至对应节点的页面
3.4 使用promql获取节点已用内存百分百和剩余内存百分百
-
可用内存百分比:(可用内存/总内存)*100
-
(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)*100
-
已用内存百分比:100-((可用内存/总内存)*100)
-
100-(((node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)*100))
-
使用标签过滤固定主机的可用内存
-
node_memory_MemAvailable_bytes{instance=”192.168.225.167:9100″}
4. grafana展示prometheus数据
4.1 grafana简介
Grafana是一个开源的度量分析与可视化套件,它基于go语言开发。经常被用作基础设施的时间序列数 据和应用程序分析的可视化,应用场景非常多。
Grafana不仅仅支持很多类型的时序数据库数据源,比如Graphite、InfluxDB、Prometheus、Elasticsearch等,虽然每种数据源都有独立的查询语言和能力,但是Grafana的查询编辑器支持所有的 数据源,而且能很好的支持每种数据源的特性。通过该查询编辑器可以对所有数据进行整合,而且还可 以在同一个dashboard上进行综合展示。
尤其是Grafana最具特色的就是各种各样漂亮的可视化界面,在Grafana提供了各种定制好的,可以直接 给各种软件直接使用的展示界面模板,
默认监听于TCP协议的3000端口,支持集成其他认证服务,且能够通过/metrics输出内建指标; 可以在
https://grafana.com/dashboards/
页面查询到我们想要的各种dashboard模板。
4.2 grafana部署
Grafana官方下载链接
https://grafana.com/grafana/download
在prometheus服务器同时也安装grafana
#安装软件
[root@grafana ~]#wget https://dl.grafana.com/enterprise/release/grafana- enterprise_8.2.1_amd64.deb
#从国内镜站下载
[root@grafana ~]#wget https://mirrors.tuna.tsinghua.edu.cn/grafana/apt/pool/main/g/grafana/grafana_9.0 .6_amd64.deb
[root@grafana ~]#apt update
[root@grafana ~]#apt -y install ./grafana-enterprise_8.2.1_amd64.deb
#注意:这里安装的是本地文件,所以要加文件路径
#如果安装失败,解决依赖关系
[root@grafana ~]#apt -y --fix-broken install
#查看插件列表(此步可选)
[root@grafana ~]#grafana-cli plugins list-remote
#安装插件,如果安装失败,多试几次(此步可选)
[root@grafana ~]#grafana-cli plugins install grafana-piechart-panel [root@grafana ~]#grafana-cli plugins ls
#启动服务
[root@grafana ~]#systemctl daemon-reload
[root@grafana ~]#systemctl enable --now grafana-server.service
#检查效果
[root@grafana ~]#systemctl status grafana-server.service
[root@grafana ~]#ss -ntulp|grep grafana
tcp LISTEN 0 4096 *:3000 *:* users:(("grafana-server",pid=35383,fd=8))
#结果显示:当前主机上出现了一个端口3000
docker方式安装部署grafana
#1 创建grafana数据目录、配置文件目录、日志目录
[root@cs7 prometheus-data]# mkdir /prometheus-data/grafana/{data,log,conf} -p
#2 运行grafana临时容器
[root@cs7 prometheus-data]# docker run -d --name=grafana \
--restart=always -p 3000:3000 \
grafana/grafana-enterprise
#3 查看grafana容器默认的数据目录、配置文件目录、日志目录
[root@cs7 grafana]# docker inspect grafana
"Env": [
"PATH=/usr/share/grafana/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"GF_PATHS_CONFIG=/etc/grafana/grafana.ini", #配置文件
"GF_PATHS_DATA=/var/lib/grafana", #数据目录
"GF_PATHS_HOME=/usr/share/grafana",
"GF_PATHS_LOGS=/var/log/grafana", #日志目录
"GF_PATHS_PLUGINS=/var/lib/grafana/plugins", #插件目录
"GF_PATHS_PROVISIONING=/etc/grafana/provisioning"
]
#4 拷贝配置文件等信息到宿主机上面的相关目录中
[root@cs7 grafana]# docker cp grafana:/etc/grafana/. /prometheus-data/grafana/conf/
[root@cs7 grafana]# docker cp grafana:/var/lib/grafana/. /prometheus-data/grafana/data/
[root@cs7 grafana]# docker cp grafana:/var/log/grafana/. /prometheus-data/grafana/log/
#5 创建grafana运行用户
[root@cs7 grafana]# useradd grafana
#6 修改uid和gid为472,默认情况下grafana运行的的uid和gid为472
[root@cs7 grafana]# getent passwd | grep grafana
grafana:x:472:472::/home/grafana:/bin/bash
[root@cs7 grafana]# getent group | grep grafana
grafana:x:472:
#7 修改文件目录
[root@cs7 grafana]# chown -R grafana:grafana /prometheus-data/grafana/
#8 删除临时grafana容器
[root@cs7 grafana]# docker rm -f grafana
#9 创建grafana容器启动脚本
[root@cs7 grafana]# cat grafana.sh
docker run -d --name grafana -p 3000:3000 \
-v /prometheus-data/grafana/conf:/etc/grafana \
-v /prometheus-data/grafana/data:/var/lib/grafana \
-v /prometheus-data/grafana/log:/var/log/grafana \
--restart=always \
grafana/grafana-enterprise:latest
#10 运行脚本创建容器
[root@cs7 grafana]# bash grafana.sh
bca57cce8e2f4c1853d6ce979272efaecb68c500b507484fdbb81001c2b143ec
4.3 配置prometheus数据源
4.3.1 登录grafana web
#浏览器访问 http://master1.wang.org:3000/,查看效果
#输入用户名和密码:admin/admin,就会进入到更改密码的页面,查看效果
输入更改后的密码后,此处为了方便还用原来的密码admin, 点击”Submit”后,就会进入到首页,查看效果
4.3.2 添加prometheus的数据源
添加数据源: 点击 “Add your first data source”
选择 “Prometheus” 出现添加界面
按照如下配置信息,在Settings界面对Prometheus进行配置 ,效果如下
#注意:这里默认地址无法使用,将Prometheus的地址(10.0.0.101:9090或者localhost:9090)输入进去,其它没有做任何变动。
其它信息不用设置,点击最下面的”Save & Test” 查看效果
4.3.3 使用数据源插件中内置的dashboard
点Dashboards页签内的import,导入内置的三个模板,内置的模版可能会出现无法获取数据的情况,此时需要我们重新导入对应的grafana模版;
需要点击inmport导入模版
4.4 导入指定模版展示node exporter数据
上面内置的模板不太理想,导入指定的网络上比较合适的 Dashboard 模板
4.4.1 登录grafana官网查找模版
https://grafana.com/grafana/dashboards/?dataSource=prometheus
4.4.2 导入模版
导入完成后即可显示监控大屏
结果显示:一旦我们的node_exporter部署完毕后,节点的信息立刻就展示出来了。
如果看不到数据的话
-
没有安装节点监控 NodePort软件
-
浏览器的时间与服务器的时间误差超过3s以上
5. 监控zookeeper
监控的指标可以通过下面方式提供
-
Prometheus内置
-
应用内置的功能模块,比如: Zookeeper ,内置了exporter的功能,不需要额外再去安装
-
额外的exporter,使用第三方开发的功能
-
程序语言内置的功能代码,需要开发能力
5.1 安装和配置zookeeper
#安装zookeeper 略
#修改配置支持prometheus
[root@node3 ~]#vim /usr/local/zookeeper/conf/zoo.cfg
#在最后面加下面三行,默认带有下面三行只需要取消注释即可
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
[root@node3 ~]#systemctl restart zookeeper.service
#浏览器访问:http://node1.wang.org:7000/metrics
5.2 修改prometheus配置
5.3 查看prometheus的监控数据
5.4 查看grafana监控数据
6. pushgateway采集自定义数据
6.1 pushgateway介绍
官方链接:
https://prometheus.io/docs/practices/pushing/
虽然有很多的Exporter提供了丰富的数据,但生产环境中仍需要采集用户自定义的数据,可以利用 Pushgateway实现
Pushgateway 是另一种采用客户端主动推送数据的方式,也可以获取监控数据的prometheus 插件与exporter 不同, Exporter是被动采集数据
它是可以单独运行在任何节点上的插件(并不一定要在被监控客户端)
用户自定义的脚本或程序将需要监控的数据推送给 Pushgateway ,然后prometheus server再向 pushgateway拉取数据
Pushgateway 缺点
-
Pushgateway 会形成一个单点瓶颈,假如好多个应用同时发送给一个pushgateway的进程,如果这个进程没了,那么监控数据也就无法获取了
-
Pushgateway并不能对发送过来的数据进行更智能的判断,假如脚本中间采集出问题,那么有问题的数据 pushgateway一样照单全收发送给prometheus
6.2 安装pushgateway
-
二进制方式安装
[root@prometheus ~]#wget https://github.com/prometheus/pushgateway/releases/download/v1.4.3/pushgateway-1.4.3.linux-amd64.tar.gz
[root@prometheus ~]#tar xf pushgateway-1.4.3.linux-amd64.tar.gz -C /usr/local/
[root@prometheus ~]#ln -s pushgateway-1.4.3.linux-amd64/ pushgateway
[root@prometheus ~]#ls /usr/local/pushgateway
LICENSE NOTICE pushgateway
[root@prometheus ~]#mkdir /usr/local/pushgateway/bin
[root@prometheus ~]#mv /usr/local/pushgateway/pushgateway /usr/local/pushgateway/bin
[root@prometheus ~]#ldd /usr/local/pushgateway/bin/pushgateway
不是动态可执行文件
[root@prometheus ~]#ln -s /usr/local/pushgateway/bin/pushgateway /usr/local/bin/
[root@prometheus ~]#pushgateway --help
usage: pushgateway [<flags>]
The Pushgateway
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
--web.listen-address=":9091"
Address to listen on for the web interface, API, and telemetry.
--web.telemetry-path="/metrics"
Path under which to expose metrics.
--web.external-url= The URL under which the Pushgateway is externally reachable.
--web.route-prefix="" Prefix for the internal routes of web endpoints. Defaults to the path of
--web.external-url.
--web.enable-lifecycle Enable shutdown via HTTP request.
--web.enable-admin-api Enable API endpoints for admin control actions.
--persistence.file="" File to persist metrics. If empty, metrics are only kept in memory.
--persistence.interval=5m The minimum interval at which to write out the persistence file.
--push.disable-consistency-check
Do not check consistency of pushed metrics. DANGEROUS.
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn,
error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]
--version Show application version.
[root@prometheus ~]#pushgateway
ts=2022-08-06T14:55:02.911Z caller=main.go:85 level=info msg="starting
pushgateway" version="(version=1.4.3, branch=HEAD,
revision=f9dc1c8664050edbc75916c3664be1df595a1958)"
ts=2022-08-06T14:55:02.911Z caller=main.go:86 level=info build_context="
(go=go1.18.2, user=root@75e397dd33fe, date=20220530-19:02:00)"
ts=2022-08-06T14:55:02.913Z caller=main.go:139 level=info listen_address=:9091
ts=2022-08-06T14:55:02.914Z caller=tls_config.go:195 level=info msg="TLS is
disabled." http2=false
-
容器方式部署
[root@cs7 prometheus-data]# docker run -d --name pushgateway --restart=always -p 9091:9091/tcp bitnami/pushgateway:latest
-
登录pushgateway
6.3 配置prometheus收集pushgateway数据
[root@prometheus ~]#vim /usr/local/prometheus/conf/prometheus.yml
- job_name: "pushgateway"
static_configs:
- targets:
- "192.168.225.167:9091"
-
登录prometheus server 查看target
6.4 配置客户端发送数据给pushgateway
6.4.1 收集当前主机的进程数量,发送给pushgateway
#/bin/bash
METRIC_NAME=process_number
VALUE=$[$(ps -ef | wc -l)-1]
PUSHGATEWAY_HOST='192.168.225.167:9091'
EXPORTER_JOB='pushgateway'
EXPORTER_HOST=$(hostname -I| cut -d " " -f 1)
CURL_URL="curl --data-binary @- http://${PUSHGATEWAY_HOST}/metrics/job/${EXPORTER_JOB}/instance/${EXPORTER_HOST}"
echo "${METRIC_NAME} ${VALUE}" | ${CURL_URL}
-
验证数据
6.4.2 收集当前主机的登录用户数,发送给pushgateway
#下面为发送一次数据,如果想周期性发送,可以通过cron或脚本循环实现
[root@ubuntu2004 ~]#echo "age 8" | curl --data-binary @- http://192.168.225.167:9091/metrics/job/pushgateway/instance/`hostname -I`
#说明
192.168.225.167:9091 #安装为Pushgateway主机的IP和端口
pushgateway #Prometheus的job名称
`hostname -I` #取当前主机的IP为instance名称
#编定自定义的metric的脚本
[root@ubuntu2004 ~]#vim pushgateway_metric.sh
#!/bin/bash
# #********************************************************************
#Author: wangxiaochun
#QQ: 29308620
#Date: 2022-08-08
#FileName: pushgateway_metric.sh
#URL: http://www.wangxiaochun.com
#Description: The test script
#Copyright (C): 2022 All rights reserved
#********************************************************************
METRIC_NAME=login_number
METRIC_VALUE="who | wc -l"
PUSHGATEWAY_HOST=192.168.225.167:9091
EXPORTED_JOB=pushgateway
SLEEP_TIME=1
CURL_URL="curl --data-binary @- http://${PUSHGATEWAY_HOST}/metrics/job/${EXPORTED_JOB}/instance/`hostname -I`"
action() {
while true ;do
VALUE=`eval "$METRIC_VALUE"`
echo "${METRIC_NAME} ${VALUE}" | $CURL_URL
sleep $SLEEP_TIME
done
}
action
[root@ubuntu2004 ~]#bash pushgateway_metric.sh
-
验证数据