本文是自己在学习U-net模型时做的一些记录,源码来自于B站的一位up,这也是自己跑的第一个模型。
数据来自于自己的老师项目,和其他专业的一个师兄做对接。由于收到的数据是json格式的,所以:
第二部分是将json转换成png格式。
第三部分是在自己的电脑上,即win10+pycharm上部署模型,但是做到最后一步发现自己显卡不行,分配不了足够的内存,所以决定在学校的集群上运行,直接部署道理一位师兄的环境中,所以很容易就跑起来了。
第四部分是自己配置pytorch环境运行过程中遇到的一些问题。由于学院部署了新的集群,所以自己还是要配置一个属于自己的环境的,本节主要是讲述自己在集群申请了一个pytorch实例,即已经配置好anaconda,所以自己只需要安装对应版本torch即可。
一、网络准备
二、数据集准备
首先原图和标签数据老师发给我了。
2.1 JSON标签数据转png
2.1.1 安装LABELME
在conda命令行输入:
pip install labelme==3.16.7
运行截图
2.1.2 制作标签图片
命令行输入labelme打开软件
labelme
点击opendir打开文件夹
打开文件时出现问题:
问题原因:版本不匹配的问题,我安装的是3.17,但是打开后发现标签文件的版本是5.0.1
解决办法:卸载重装高版本labelme
出现问题:
‘labelme’ 不是内部或外部命令,也不是可运行的程序
解决办法:卸载重装
再次命令行输入:labelme
成功打开
2.1.3 将json转换格式
打开py文件:json_to_dataset.py
修改classes,改为自己的类,增加一个类为背景:
background
修改后,运行这个py文件。
报错:
Traceback (most recent call last):
File "F:/pytorch_project/unet-pytorch-main/unet-pytorch-main/json_to_dataset.py", line 71, in <module>
utils.lblsave(osp.join(pngs_path, count[i].split(".")[0]+'.png'), new)
File "C:\Users\zhw\.conda\envs\pytorch01\lib\site-packages\labelme\utils\_io.py", line 15, in lblsave
lbl_pil = PIL.Image.fromarray(lbl.astype(np.uint8), mode="P")
File "C:\Users\zhw\.conda\envs\pytorch01\lib\site-packages\PIL\Image.py", line 2965, in fromarray
raise ValueError(f"Too many dimensions: {ndim} > {ndmax}.")
ValueError: Too many dimensions: 3 > 2.
错误原因:版本太高
参考博文:https://blog.csdn.net/m0_48095841/article/details/123998484
解决办法:卸载5.0.1,安装3.16.7
pip uninstall labelme
pip install labelme==3.16.7
再次运行json_to_dataset:
最后:将JPEGImages和SegmentationClass复制到VOC2007文件夹下面
三、 win10+pycharm训练模型
3.1 训练集和数据集的划分
运行:voc_annotation.py进行划分
由于数据较少,所以全部数据用于训练
运行结果:
train: 用于训练
val: 用于验证
3.2 修改参数
修改
train.py
中的num_classes 区分的类别个数+1
读Readme文件,下载modelpath
运行报错:
initialize network with normal type
Load weights model_data/resnet50-19c8e357.pth.
Traceback (most recent call last):
File "F:/pytorch_project/unet-pytorch-main/unet-pytorch-main/train.py", line 284, in <module>
pretrained_dict = torch.load(model_path, map_location = device)
File "C:\Users\zhw\.conda\envs\pytorch01\lib\site-packages\torch\serialization.py", line 795, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "C:\Users\zhw\.conda\envs\pytorch01\lib\site-packages\torch\serialization.py", line 987, in _legacy_load
return legacy_load(f)
File "C:\Users\zhw\.conda\envs\pytorch01\lib\site-packages\torch\serialization.py", line 924, in legacy_load
storage._storage, storage_offset, numel, stride)
RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "cuda:0". This is no longer allowed; the devices must match.
Process finished with exit code 1
问题:可能是版本问题
解决办法:下载其他权重使用
运行出现新问题:
Traceback (most recent call last):
File "F:/pytorch_project/unet-pytorch-main/unet-pytorch-main/train.py", line 410, in <module>
raise ValueError("数据集过小,无法继续进行训练,请扩充数据集。")
ValueError: 数据集过小,无法继续进行训练,请扩充数据集。
代码文件:
if epoch_step == 0 or epoch_step_val == 0:
raise ValueError("数据集过小,无法继续进行训练,请扩充数据集。")
通过看代码,发现是数据集的划分有问题,没有划分测试集。
所以修改测试集代码
运行annotation代码。
划分成功。
运行train.py运行成功。
运行一半出现新问题。
Total Loss: 0.359 || Val Loss: 1.937
Epoch 51/100: 0%| | 0/7 [00:00<?, ?it/s<class 'dict'>]Start Train
Epoch 51/100: 29%|██▊ | 2/7 [00:08<00:18, 3.75s/it, f_score=0.837, lr=4.88e-5, total_loss=0.555]Traceback (most recent call last):
File "F:/pytorch_project/unet-pytorch-main/unet-pytorch-main/train.py", line 491, in <module>
epoch_step, epoch_step_val, gen, gen_val, UnFreeze_Epoch, Cuda, dice_loss, focal_loss, cls_weights, num_classes, fp16, scaler, save_period, save_dir, local_rank)
File "F:\pytorch_project\unet-pytorch-main\unet-pytorch-main\utils\utils_fit.py", line 58, in fit_one_epoch
loss.backward()
File "C:\Users\zhw\.conda\envs\pytorch01\lib\site-packages\torch\_tensor.py", line 488, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "C:\Users\zhw\.conda\envs\pytorch01\lib\site-packages\torch\autograd\__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 2.00 GiB total capacity; 1.37 GiB already allocated; 0 bytes free; 1.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Epoch 51/100: 29%|██▊ | 2/7 [00:09<00:23, 4.74s/it, f_score=0.837, lr=4.88e-5, total_loss=0.555]
Process finished with exit code 1
解决办法:自己电脑不太行,放到学院集群跑成功了
3.3 结果预测
修改unet.py中的数据:
视频中说可以直接运行predicate.py进行预测
但是我修改了下 num_classes
运行:
四、集群实例中部署项目遇到的问题
1、No module named ‘cv2‘
解决办法:
pip install opencv-python
2、Torch not compiled with CUDA enabled
解决办法:卸载torch
pip uninstall torch
因为当我们在python中输入:
因为服务器:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 308... Off | 00000000:50:00.0 Off | N/A |
| 30% 45C P2 99W / 350W | 1030MiB / 12053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
下载cuda11.2对应的pytorch(这个方法没有成功)
输入:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.2 -c pytorch -c conda-forge
参考:链接:
下载CUDA11.2
但是安装成功后我运行代码:
>>> torch.cuda.is_available() # cuda是否可用
False
>>> torch.cuda.get_device_name(0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/torch/cuda/__init__.py", line 276, in get_device_name
return get_device_properties(device).name
File "/home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/torch/cuda/__init__.py", line 306, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/torch/cuda/__init__.py", line 164, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
en…还是出问题了,应该还是同样的问题
继续找合适的版本:
链接:
参考博客
卸载之前的torch版本
安装:
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
启动python测试:
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'GeForce RTX 3080 Ti'
启动模型:
3、AttributeError: module ‘distutils’ has no attribute ‘version’
(pytorch01) jessy@n1:~/unet-pytorch-main/unet-pytorch-main$ python train.py
Traceback (most recent call last):
File "train.py", line 13, in <module>
from utils.callbacks import LossHistory, EvalCallback
File "/home/jessy/unet-pytorch-main/unet-pytorch-main/utils/callbacks.py", line 17, in <module>
from torch.utils.tensorboard import SummaryWriter
File "/home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'
(pytorch01) jessy@n1:~/unet-pytorch-main/unet-pytorch-main$
解决办法:
pip uninstall setuptools
pip install setuptools==59.5.0 //需要比你之前的低
链接:
参考博客
唉~终于跑起来了。果然版本就是最难搞的
五、公共集群上运行Unet模型
在运行之前,要配置环境。
首先要配置Anaconda,pytorch:
https://blog.csdn.net/qq_43718758/article/details/128106044
申请GPU:
https://blog.csdn.net/qq_43718758/article/details/128129733
问题1:ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9’ not found
(pytorch01) [jessy@workstation unet-pytorch-main]$ python train.py
Traceback (most recent call last):
File "train.py", line 13, in <module>
from utils.callbacks import LossHistory, EvalCallback
File "/home/jessy/torch_project/unet-pytorch-main/utils/callbacks.py", line 3, in <module>
import matplotlib
File "/home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/matplotlib/__init__.py", line 109, in <module>
from . import _api, _version, cbook, docstring, rcsetup
File "/home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/matplotlib/rcsetup.py", line 27, in <module>
from matplotlib.colors import Colormap, is_color_like
File "/home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/matplotlib/colors.py", line 51, in <module>
from PIL import Image
File "/home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/PIL/Image.py", line 100, in <module>
from . import _imaging as core
ImportError: /lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by /home/jessy/.conda/envs/pytorch01/lib/python3.7/site-packages/PIL/../../.././libLerc.so)
搜到一种解决办法:
链接:
https://blog.csdn.net/qq_41475067/article/details/116895497
链接:
https://blog.csdn.net/weixin_43770077/article/details/109532739
这个是搜到的最多的一种方法。自己的路径和解决方法中不一样,所以把根据自己路径修改的命令显示在下面:
输入:
strings /usr/lib64/libstdc++.so.6 | grep CXXABI
输入:
strings /opt/app/anaconda3/lib/libstdc++.so.6.0.26 | grep 'CXXABI'
输入:
sudo cp /opt/app/anaconda3/lib/libstdc++.so.6.0.26 /usr/lib64/libstdc++.so.6
但是这个方法并没有解决我的问题,因为使用的是学院的集群,我没有权限写入文件即
(pytorch01) [jessy@workstation unet-pytorch-main]$ sudo cp /opt/app/anaconda3/lib/libstdc++.so.6.0.26 /usr/lib64/libstdc++.so.6
[sudo] password for jessy:
jessy is not in the sudoers file. This incident will be reported.
所以搜到了第二种解决方法
参考博文:
https://www.jianshu.com/p/bbd19ff140bd
其实这个方法之前搜到过,并且试过,但是自己
忘记改路径
了,所以当时没有生效。
以至于还在学校的大群里问了一下如何解决,有点丢人,心血来潮又试了一下这个方法,发现有路径,所以改了一下,结果成功了
首先看一下自己ananconda的安装路径:
(pytorch01) [jessy@workstation unet-pytorch-main]$ conda env list
# conda environments:
#
pytorch01 * /home/jessy/.conda/envs/pytorch01
base /opt/app/anaconda3
py27 /opt/app/anaconda3/envs/py27
py36 /opt/app/anaconda3/envs/py36
可以看到自己ananconda路径为:
/opt/app/anaconda3
接下来输入:
vim ~/.bash_profile
在文件加入:
第一句的路径需要修改!!!!!!改为自己安装的anaconda的路径
LD_LIBRARY_PATH=~/anaconda3/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
但是,文章并没有说在文件的哪里加入,所以自己就随便试了下:
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/.local/bin:$HOME/bin
LD_LIBRARY_PATH=/opt/app/anaconda3/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
export PATH
修改完后需要输入:
(pytorch01) [jessy@workstation unet-pytorch-main]$ source ~/.bash_profile
到此就结束了。
自己需要重新激活pytorch环境,然后运行即可。(直接运行发现自己有torch,但是报错说我没有torch,所以需要再激活一下)
(pytorch01) [jessy@workstation unet-pytorch-main]$ python train.py
Traceback (most recent call last):
File "train.py", line 13, in <module>
from utils.callbacks import LossHistory, EvalCallback
File "/home/jessy/torch_project/unet-pytorch-main/utils/callbacks.py", line 9, in <module>
import scipy.signal
ModuleNotFoundError: No module named 'scipy'
(pytorch01) [jessy@workstation unet-pytorch-main]$ pip install scipy
可以看到,出现了其他问题,说明刚刚那个问题解决了,哈哈哈哈!!!!