Containerization technology is greatly simplified GPU computations for machine learning, wrap- ping SW stack above Kernel level in containers and allowed to juggle with different combinations of frameworks, low lever libraries, and HW drivers. Technologies like Nvidia-docker even unlocked new stack combinations (driver plus low lever CUDA libraries) that were not feasible before. This, however, adds another dimension to the performance optimization problem, and now you not only need to choose optimal HW for your machine learning task but also variate driver- CUDA combination to fine tune the performance further.
The goal of this paper is to shed some light on the later problem. We compared how different single-GPU VMs performed on training Mask R- CNN neural network for various combinations of Nvidia driver, CUDA libraries, and GPU fam- ily. The results show a meaningful difference in training performance for different combinations of driver-CUDA toolkit, this performance differ- ence depends on the GPU family and can be in double-digit percentage points region for training time. It indicates that low lever SW stack opti- mization of the host system for containerized ML workloads is feasible and can provide meaningful time- or cost-savings.