Pod Placement Techniques to Avoid Job Failures Due to Low GPU Memory in a Kubernetes Environment with Shared GPUs
How to cite (IJASEIT) :
Docker, Docker engine [Online]. Available: https://docs.docker.com/engine/
CUDA C Programming Guide, NVIDIA Corporation, CA, USA, 2024.
Linux Foundation, Kubernetes, [Online]. Available: https://kubernetes.io/docs/setup/
CUDA API Reference Manual, CA, USA, pp. 59, 2012.
M. Abadi et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, arXiv preprint arXiv:1603.04467, 2016.
TensorFlow, Use GPU, [Online]. Available: https://www.tensorflow.org/guide/gpu.
T.-A. Yeh, H.-H. Chen, and J. Chou, “KubeShare: A Framework to Manage GPUs as First-Class and Shared Resources in Container Cloud,” Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, vol. 2014, pp. 173–184, Jun. 2020, doi: 10.1145/3369583.3392679.
I. Harichane, S. A. Makhlouf, and G. Belalem, “KubeSC‐RTP: Smart scheduler for Kubernetes platform on CPU‐GPU heterogeneous systems,” Concurrency and Computation: Practice and Experience, vol. 34, no. 21, Jun. 2022, doi: 10.1002/cpe.7108.
G. El Haj Ahmed, F. Gil‐Castiñeira, and E. Costa‐Montenegro, “KubCG: A dynamic Kubernetes scheduler for heterogeneous clusters,” Software: Practice and Experience, vol. 51, no. 2, pp. 213–234, Sep. 2020, doi: 10.1002/spe.2898.
P. Thinakaran, J. R. Gunasekaran, B. Sharma, M. T. Kandemir, and C. R. Das, “Kube-Knots: Resource Harvesting through Dynamic Container Orchestration in GPU-based Datacenters,” 2019 IEEE International Conference on Cluster Computing (CLUSTER), Sep. 2019, doi: 10.1109/cluster.2019.8891040.
S. Wang et al., “An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems,” SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13, Nov. 2020, doi: 10.1109/sc41405.2020.00094.
N. Zhou et al., “Container orchestration on HPC systems through Kubernetes,” Journal of Cloud Computing, vol. 10, no. 1, Feb. 2021, doi: 10.1186/s13677-021-00231-z.
J. Shi, D. Chen, J. Liang, L. Li, Y. Lin, and J. Li, “New YARN sharing GPU based on graphics memory granularity scheduling,” Parallel Computing, vol. 117, p. 103038, Sep. 2023, doi: 10.1016/j.parco.2023.103038.
T.-T. Hsieh and C.-R. Lee, “Voda: A GPU Scheduling Platform for Elastic Deep Learning in Kubernetes Clusters,” 2023 IEEE International Conference on Cloud Engineering (IC2E), vol. 27, pp. 131–140, Sep. 2023, doi: 10.1109/ic2e59103.2023.00023.
H. Albahar, S. Dongare, Y. Du, N. Zhao, A. K. Paul, and A. R. Butt, “SchedTune: A Heterogeneity-Aware GPU Scheduler for Deep Learning,” 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), May 2022, doi: 10.1109/ccgrid54584.2022.00079.
G. Yeung, D. Borowiec, R. Yang, A. Friday, R. Harper, and P. Garraghan, “Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 1, pp. 88–100, Jan. 2022, doi: 10.1109/tpds.2021.3079202.
J. Gu, Y. Zhu, P. Wang, M. Chadha, and M. Gerndt, “FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference,” Proceedings of the 52nd International Conference on Parallel Processing, pp. 635–644, Aug. 2023, doi: 10.1145/3605573.3605638.
Z. Liu, C. Chen, J. Li, Y. Cheng, Y. Kou, and D. Zhang, “KubFBS: A fine‐grained and balance‐aware scheduling system for deep learning tasks based on kubernetes,” Concurrency and Computation: Practice and Experience, vol. 34, no. 11, Jan. 2022, doi: 10.1002/cpe.6836.
I. Harichane, S. A. Makhlouf, and G. Belalem, “A Proposal of Kubernetes Scheduler Using Machine-Learning on CPU/GPU Cluster,” Intelligent Algorithms in Software Engineering, pp. 567–580, 2020, doi: 10.1007/978-3-030-51965-0_50.
L. Liu, J. Yu, and Z. Ding, “Adaptive and Efficient GPU Time Sharing for Hyperparameter Tuning in Cloud,” Proceedings of the 51st International Conference on Parallel Processing, pp. 1–11, Aug. 2022, doi: 10.1145/3545008.3545027.
J. Lou, Y. Sun, J. Zhang, H. Cao, Y. Zhang, and N. Sun, “ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs,” CCF Transactions on High Performance Computing, vol. 5, no. 3, pp. 304–321, May 2023, doi: 10.1007/s42514-023-00154-y.
W. Shen, Z. Liu, Y. Tan, Z. Luo, and Z. Lei, “KubeGPU: efficient sharing and isolation mechanisms for GPU resource management in container cloud,” The Journal of Supercomputing, vol. 79, no. 1, pp. 591–625, Jul. 2022, doi: 10.1007/s11227-022-04682-2.
M. Zhao, K. Jha, and S. Hong, “GPU-enabled Function-as-a-Service for Machine Learning Inference,” 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), vol. 11, pp. 918–928, May 2023, doi: 10.1109/ipdps54959.2023.00096.
Q. Weng, L. Yang, Y. Yu, W. Wang, X. Tang, G. Yang, et al., "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent", 2023 USENIX Annual Technical Conference (USENIX ATC 23), pp. 995-1008, 2023, [online] Available: https://www.usenix.org/conference/atc23/presentation/weng.
D. Jorge-Martinez et al., “Artificial intelligence-based Kubernetes container for scheduling nodes of energy composition,” International Journal of System Assurance Engineering and Management, Jul. 2021, doi: 10.1007/s13198-021-01195-8.
M. Saravanan and R. Vignesh, “DSTS: A hybrid optimal and deep reinforcement learning for dynamic scalable task scheduling on container cloud environment,” Mar. 2022, doi: 10.21203/rs.3.rs-1431790/v1.
Y. Mao et al., “Differentiate Quality of Experience Scheduling for Deep Learning Inferences With Docker Containers in the Cloud,” IEEE Transactions on Cloud Computing, vol. 11, no. 2, pp. 1667–1677, Apr. 2023, doi: 10.1109/tcc.2022.3154117.
A. Zou, J. Li, C. D. Gill, and X. Zhang, “RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 5, pp. 1450–1465, May 2023, doi: 10.1109/tpds.2023.3235439.
Z. Chen, X. Zhao, C. Zhi, and J. Yin, “DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 9, pp. 2553–2567, Sep. 2023, doi: 10.1109/tpds.2023.3293835.
J. Kennedy, V. Sharma, B. Varghese, and C. Reaño, “Multi-Tier GPU Virtualization for Deep Learning in Cloud-Edge Systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 7, pp. 2107–2123, Jul. 2023, doi: 10.1109/tpds.2023.3274957.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).