HPC Engineer📣 Job Ad

in Penta Consulting

20 days ago

	Contract Type	Full-time
	Workplace type	On-site
	Location	Riyadh

Job Description

About the Role

Penta Consulting, a technology service provider delivering professional and managed solutions across EMEA, is seeking an experienced HPC Engineer for a full-time position in Riyadh, Saudi Arabia. This role is essential for the design, deployment, configuration, and operation of large-scale high-performance computing (HPC) environments. The ideal candidate will possess a deep, hands-on understanding of all components within an HPC ecosystem and a proven track record of successfully implementing and managing complex infrastructure.

Key Responsibilities

Design, deploy, and maintain HPC clusters end-to-end, including compute nodes, storage tiers, high-speed networking (InfiniBand / RoCE), and management fabric.
Provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster imaging, OS lifecycle management, and GPU fleet health monitoring.
Deploy and manage the NVIDIA AI Enterprise Suite, covering installation, licensing, updates, and integration with MLOps pipelines (NeMo, Triton, RAPIDS).
Deploy and operate NVIDIA GPU Operator and Network Operator on Kubernetes to automate driver and CUDA lifecycle, DCGM exporter, and MIG configuration.
Configure and serve NVIDIA NIM inference endpoints, and implement NVIDIA Blueprint reference architectures for production AI workloads.
Install, administer, and tune Slurm, including partitions, QOS, fair-share policies, node accounting, MPI integration, and Slurm-on-Kubernetes hybrid scheduling.
Bootstrap and operate Kubernetes clusters using kubeadm, ensuring control plane High Availability (HA), etcd backup, and zero-downtime upgrades.
Administer RHEL and Canonical Ubuntu across all cluster nodes.
Build and maintain CI/CD pipelines using GitLab CI or GitHub Actions for infrastructure provisioning and HPC software delivery.
Profile and tune GPU and CPU workload performance, resolving bottlenecks across hardware, drivers, MPI fabric, and application layers.
Implement cluster monitoring solutions using Prometheus, Grafana, and DCGM, defining alerting and capacity planning thresholds.
Enforce security best practices, including node hardening, kernel patching, Role-Based Access Control (RBAC), and compliance audits across the HPC environment.

Required Qualifications

Proven experience in designing, deploying, configuring, and operating all components of a large-scale high-performance computing environment.
Hands-on experience with RHEL and Canonical Ubuntu administration.
Experience with CI/CD pipelines for infrastructure provisioning and software delivery.
Proficiency in performance profiling and tuning for GPU and CPU workloads.
Experience with cluster monitoring tools and capacity planning.
Strong understanding of security best practices for HPC environments.
Excellent communication and problem-solving skills.

Technical Skills

HPC Clusters
Compute Nodes
Storage Tiers
High-Speed Networking (InfiniBand, RoCE)
Management Fabric
NVIDIA Base Command Manager (BCM)
NVIDIA AI Enterprise Suite
MLOps Pipelines (NeMo, Triton, RAPIDS)
NVIDIA GPU Operator
NVIDIA Network Operator
Kubernetes
NVIDIA NIM
NVIDIA Blueprint
Slurm
MPI Integration
kubeadm
RHEL
Canonical Ubuntu
CI/CD Pipelines (GitLab CI, GitHub Actions)
Infrastructure Provisioning
GPU and CPU Workload Performance Tuning
Prometheus
Grafana
DCGM
Capacity Planning
Node Hardening
Kernel Patching
RBAC
Compliance Audits

Work Environment and Details

This is a full-time position for an HPC Engineer located in Riyadh, Saudi Arabia. The role requires 5-10 years of relevant experience in high-performance computing environments.