Search

21.12.29: Unsupervised SemSeg

1) Distributed Training

코드가 DDP를 기본전제로 짜여져있어 내 모델로 바꾸기가 까다로웠다.
그러던 중 이해를 도와주는 매우 멋진 포스트 발견.

2) GPU Lost 문제

DDP Train 코드를 실행시키면, GPU를 lost하며 강제종료되는 현상이 발생했다.
Exception in thread StatsThr: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/stats.py", line 133, in _thread_body stats = self.stats() File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/stats.py", line 179, in stats handle = pynvml.nvmlDeviceGetHandleByIndex(i) File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/pynvml/pynvml.py", line 819, in nvmlDeviceGetHandleByIndex _nvmlCheckReturn(ret) File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/pynvml/pynvml.py", line 310, in _nvmlCheckReturn raise NVMLError(ret) wandb.vendor.pynvml.pynvml.NVMLError_GpuIsLost: GPU is lost
Shell
복사
이러면 얄짤없이 재부팅이다.
진짜 열받았다.