img
Contract TypeFull-time
Workplace typeOn-site
LocationRiyadh

Job Description

About the Role

Penta Consulting, a technology service provider delivering professional and managed solutions across EMEA, is seeking an experienced HPC Engineer for a full-time position in Riyadh, Saudi Arabia. This role is essential for the design, deployment, configuration, and operation of large-scale high-performance computing (HPC) environments. The ideal candidate will possess a deep, hands-on understanding of all components within an HPC ecosystem and a proven track record of successfully implementing and managing complex infrastructure.

Key Responsibilities

  • Design, deploy, and maintain HPC clusters end-to-end, including compute nodes, storage tiers, high-speed networking (InfiniBand / RoCE), and management fabric.
  • Provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster imaging, OS lifecycle management, and GPU fleet health monitoring.
  • Deploy and manage the NVIDIA AI Enterprise Suite, covering installation, licensing, updates, and integration with MLOps pipelines (NeMo, Triton, RAPIDS).
  • Deploy and operate NVIDIA GPU Operator and Network Operator on Kubernetes to automate driver and CUDA lifecycle, DCGM exporter, and MIG configuration.
  • Configure and serve NVIDIA NIM inference endpoints, and implement NVIDIA Blueprint reference architectures for production AI workloads.
  • Install, administer, and tune Slurm, including partitions, QOS, fair-share policies, node accounting, MPI integration, and Slurm-on-Kubernetes hybrid scheduling.
  • Bootstrap and operate Kubernetes clusters using kubeadm, ensuring control plane High Availability (HA), etcd backup, and zero-downtime upgrades.
  • Administer RHEL and Canonical Ubuntu across all cluster nodes.
  • Build and maintain CI/CD pipelines using GitLab CI or GitHub Actions for infrastructure provisioning and HPC software delivery.
  • Profile and tune GPU and CPU workload performance, resolving bottlenecks across hardware, drivers, MPI fabric, and application layers.
  • Implement cluster monitoring solutions using Prometheus, Grafana, and DCGM, defining alerting and capacity planning thresholds.
  • Enforce security best practices, including node hardening, kernel patching, Role-Based Access Control (RBAC), and compliance audits across the HPC environment.

Required Qualifications

  • Proven experience in designing, deploying, configuring, and operating all components of a large-scale high-performance computing environment.
  • Hands-on experience with RHEL and Canonical Ubuntu administration.
  • Experience with CI/CD pipelines for infrastructure provisioning and software delivery.
  • Proficiency in performance profiling and tuning for GPU and CPU workloads.
  • Experience with cluster monitoring tools and capacity planning.
  • Strong understanding of security best practices for HPC environments.
  • Excellent communication and problem-solving skills.

Technical Skills

  • HPC Clusters
  • Compute Nodes
  • Storage Tiers
  • High-Speed Networking (InfiniBand, RoCE)
  • Management Fabric
  • NVIDIA Base Command Manager (BCM)
  • NVIDIA AI Enterprise Suite
  • MLOps Pipelines (NeMo, Triton, RAPIDS)
  • NVIDIA GPU Operator
  • NVIDIA Network Operator
  • Kubernetes
  • NVIDIA NIM
  • NVIDIA Blueprint
  • Slurm
  • MPI Integration
  • kubeadm
  • RHEL
  • Canonical Ubuntu
  • CI/CD Pipelines (GitLab CI, GitHub Actions)
  • Infrastructure Provisioning
  • GPU and CPU Workload Performance Tuning
  • Prometheus
  • Grafana
  • DCGM
  • Capacity Planning
  • Node Hardening
  • Kernel Patching
  • RBAC
  • Compliance Audits

Work Environment and Details

This is a full-time position for an HPC Engineer located in Riyadh, Saudi Arabia. The role requires 5-10 years of relevant experience in high-performance computing environments.


Requirements

  • Requires 5-10 Years experience

Similar Jobs