Centos7使用kubeadm安装部署kubernetes1.14.2(最新,19.5.28)
2080Ti*4
k8s一般两种安装方式,一种是kubeadm,另外一种是二进制安装(相对繁琐)
测试环境:
-
Centos7.6 master 192.168.0.108
-
Centos7.6 node1 192.168.0.107
-
Centos7.6 node2 192.168.0.109
网络插件 calico
具体步骤:
1.安装centos7.6系统(所有主机)
2.安装nvidia显卡驱动(所有主机)
(如果之前安装过)
yum list installed(查看安装的所有的包)
找到NVIDIA为首或者相关的包:yum remove -f
卸载完成
(如果没有安装过,上面的步骤省略, 进root后操作)
(1)添加ElRepo源
rpm –import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
(2)安装显卡驱动检查
yum install -y nvidia-detect
(3)运行 nvidia-detect -v 显示结果为我们要安装的版本号
(4)这里下载该版本的.run文件
https://www.nvidia.com/Download/index.aspx?lang=en-us
(5)进入安装前的准备工作
yum -y update //注意这是升级系统,请酌情使用
yum -y groupinstall “Development Tools” 安装一些必要的开发工具
yum -y install kernel-devel
yum -y install epel-release
yum -y install dkms
(6)编写grub文件
vi /etc/default/grub
在“GRUB_CMDLINE_LINUX”中添加
rd.driver.blacklist=nouveau nouveau.modeset=0
随后生成配置
grub2-mkconfig -o /boot/grub2/grub.cfg
(7)创建blacklist
vi /etc/modprobe.d/blacklist.conf
添加:blacklist nouveau
(8)更新配置
mv /boot/initramfs-
(
u
n
a
m
e
−
r
)
.
i
m
g
/
b
o
o
t
/
i
n
i
t
r
a
m
f
s
−
(uname -r).img /boot/initramfs-
(uname−r).img/boot/initramfs−(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)
(9)重启
reboot
(10)确认仅用了nouveau
lsmod | grep nouveau 无输出则说明禁用成功
(11)开始安装
sh NVIDIA-Linux-x86_64-418.74.run
3.安装cuda(所有主机)
(1)检查是否安装了gcc、g++编译器
yum list installed | grep gcc
yum list installed | grep g++
(2)安装gcc、g++
yum install gcc
yum install gcc-c++
(3)下载cuda
https://developer.nvidia.com/cuda-downloads
(4)根据安装提示安装cuda
(5)vi .bashrc
添加内容:(具体看你安装的版本,对cuda-<版本号>进行修改)
export PATH=$PATH:/usr/local/cuda-10.1/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/lib64
保存退出
source .bashrc(立即生效文件)
4.安装cudnn
(1)自行注册并下载cudnn(下载的后缀可能没见过并且比较长,自行改为.tgz)
(2)解压cudnn
tar -xzvf cudnn-10.1-linux-x64-v7.5.1.10.tgz
(3)复制相关文件到特定目录
(版本号自行更改)
sudo cp cuda/include/cudnn.h /usr/local/cuda-10.1/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.1/lib64
(4)修改文件权限
sudo chmod a+r /usr/local/cuda-10.1/include/cudnn.h /usr/local/cuda-10.1/lib64/libcudnn*
5.安装kubernetes
(1)环境预设(所有主机)
关闭firewalld
systemctl stop firewalld && systemctl disable firewalld
关闭SElinux
setenforce 0 && sed -i "s/SELINUX=enforcing/SELINUX=disabled/g" /etc/selinux/config
关闭swap
swapoff -a && sed -i "s/\/dev\/mapper\/centos-swap/\#\/dev\/mapper\/centos-swap/g" /etc/fstab
使用阿里云yum源:
mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.bak (备份)
wget -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
更新/etc/hosts文件加入
192.168.0.108 master
192.168.0.107 node1
192.168.0.109 node2
(2)安装docker引擎(所有主机)
卸载旧版本的docker
yum remove docker docker-common docker-selinux docker-engine
安装依赖包
yum install -y yum-utils device-mapper-persistent-data lvm2
安装阿里云docker源
yum-config-manager \
--add-repo \
https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
安装docker-ce
yum makecache fast
yum install -y docker-ce
安装nvidia-docker2
删除旧版本的docker
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker
添加nvidia-docker源
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo
安装
yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
调整docker部分参数
mkdir -p /etc/docker
tee /etc/docker/daemon.conf <<-'EOF'
{
"registry-mirrors": ["https://5twf62k1.mirror.aliyuncs.com"],
"exec-opts": ["native.cgroupdriver=systemd"]
}
EOF
tee /etc/docker/daemon.json <<-'EOF'
{
"registry-mirrors": ["https://docker.mirrors.ustc.edu.cn/"],
"exec-opts": ["native.cgroupdriver=systemd"],
"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
sudo pkill -SIGHUP dockerd
测试nvidia-smi
docker run –runtime=nvidia –rm nvidia/cuda nvidia-smi(能够显示和nvidia-smi一样的效果)
检查确认docker的Cgroup Driver信息:
docker info | grep Cgroup(可能会有warning,度娘搜索一下即可解决)
显示:Cgoup Driver: systemd
(3)安装kubernetes初始化工具(在所有主机)
使用阿里云的kubernetes源
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
EOF
yum install -y kubelet kubeadm kubectl //此时最新版本为1.14.2
启动:systemctl enable kubelet && systemctl start kubelet
此时运行 systemctl status kubelet 可能不成功,后面初始化的时候会成功
(4)预下载镜像(在master上操作)
查看集群初始化所需镜像及对应依赖版本号:
kubeadm config images list (因为这些重要镜像都被墙了,所以要预先单独下载好,然后才能初始化集群。)
……
k8s.gcr.io/kube-apiserver:v1.14.1
k8s.gcr.io/kube-controller-manager:v1.14.1
k8s.gcr.io/kube-scheduler:v1.14.1
k8s.gcr.io/kube-proxy:v1.14.1
k8s.gcr.io/pause:3.1
k8s.gcr.io/etcd:3.3.10
k8s.gcr.io/coredns:1.3.1
下载脚本
#!/bin/bash
set -e
KUBE_VERSION=v1.14.2
KUBE_PAUSE_VERSION=3.1
ETCD_VERSION=3.3.10
CORE_DNS_VERSION=1.3.1
GCR_URL=k8s.gcr.io
ALIYUN_URL=registry.cn-hangzhou.aliyuncs.com/google_containers
images=(kube-proxy:${KUBE_VERSION}
kube-scheduler:${KUBE_VERSION}
kube-controller-manager:${KUBE_VERSION}
kube-apiserver:${KUBE_VERSION}
pause:${KUBE_PAUSE_VERSION}
etcd:${ETCD_VERSION}
coredns:${CORE_DNS_VERSION})
for imageName in ${images[@]} ; do
docker pull $ALIYUN_URL/$imageName
docker tag $ALIYUN_URL/$imageName $GCR_URL/$imageName
docker rmi $ALIYUN_URL/$imageName
done
(5)初始化集群(在master上操作)
kubeadm init –kubernetes-version=v1.14.2 –pod-network-cidr=192.168.0.0/16
(最后显示kubeadm join 192.168.0.108:6443 –token …说明初始化成功)
提示信息给出了接下来的必要步骤和节点加入集群的命令,照着做即可
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
查看已经运行的pod
kubectl get pod -n kube-system -owide
可以看到里面还有一些preding,因为这时候还没有安装网络插件,接下来安装calico后就变成 正常runing(多等会再次看)
(6)安装calico(在master上操作)
kubectl apply -f \
https://docs.projectcalico.org/v3.5/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml
查看节点状态
kubectl get node -owide
(7)加入集群(在node节点操作)
先在需要加入集群的节点上下载必要镜像,下载脚本如下:
#!/bin/bash
set -e
KUBE_VERSION=v1.14.1
KUBE_PAUSE_VERSION=3.1
GCR_URL=k8s.gcr.io
ALIYUN_URL=registry.cn-hangzhou.aliyuncs.com/google_containers
images=(kube-proxy-amd64:${KUBE_VERSION}
pause:${KUBE_PAUSE_VERSION})
for imageName in ${images[@]} ; do
docker pull $ALIYUN_URL/$imageName
docker tag $ALIYUN_URL/$imageName $GCR_URL/$imageName
docker rmi $ALIYUN_URL/$imageName
done
输入主节点初始化输出中获取加入集群的命令,复制到工作点即可:
kubeadm join 192.168.0.108:6443 –token …
(8)在master节点上查看各节点工作状态
kubectl get nodes
(9)增加节点
master节点初始化token 会24小时过期(master)
kubeadm token list(过24小时就会没有)(master)
kubeadm token create(生成token)(master)
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed ‘s/^.* //’ (生成哈希值)(master)
kubeadm join 192.168.0.108:6443 –token –discovery-token-ca-cert-hash sha256: (node)
kubectl get nodes(master)
(10)node节点配置nvidia-device-plugin
scp -r /etc/kubernetes/admin.conf 192.168.0.107(109):/etc/kubernetes/ (master)
echo “export KUBECONFIG=/etc/kubernetes/admin.conf” >> ~/.bash_profile(node)
source ~/.bash_profile(node)
wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml(node)
kubectl delete -f nvidia-device-plugin.yml(node)
kubectl create -f nvidia-device-plugin.yml(node)
systemctl restart kubelet(node)
kubectl describe node master/node1/node2就可以看见nvidia.com/gpu: 4了,说明成功了
(如果不出现,看看是否显卡驱动挂了 ,nvidia-smi查看)
(安装中会有好多镜像下载不成功,多试几次就好了