HPC Senior Systems Administrator📣 Job Ad
in Kaust
about 13 hours ago
| Contract Type | Full-time | |
| Workplace type | On-site | |
| Location | Makkah |
Job Description
About the Role
KAUST is seeking a Senior HPC Systems Administrator to join the KAUST Supercomputing Laboratory (KSL). This full-time position is based in Makkah, Saudi Arabia, and requires 5-10 years of experience. The role focuses on managing and supporting a large-scale High-Performance Computing (HPC) environment for researchers.
Role Overview
The Senior HPC Systems Administrator will be responsible for the operational management of an HPC cluster comprising approximately 600 CPU and GPU nodes, associated storage systems, and high-speed networks (InfiniBand and Ethernet). This role provides essential support to researchers across computational science, engineering, big data analysis, and artificial intelligence/machine learning workloads.
Key Responsibilities
- Provide timely and effective user support via multiple channels, maintaining high customer service standards.
- Install, configure, and manage HPC subsystems including compute nodes, high-performance storage, InfiniBand, Ethernet, and configuration management tools (*, Ansible, Puppet).
- Deploy and manage cluster management software, monitoring tools, and supporting services for HPC clusters.
- Administer the Slurm workload manager, including QOS policies, accounts, accounting, and related automation scripts (Python and C++).
- Develop and maintain automation scripts in Bash and Python to streamline system administration tasks.
- Deploy and manage container environments (Singularity/Apptainer, Docker) for HPC workloads.
- Periodically benchmark HPC system components (CPU, memory, InfiniBand, storage) to ensure optimal performance and identify tuning opportunities.
- Enforce security best practices, including node hardening, kernel patching, and compliance across all systems.
- Manage parallel file systems (*, Lustre, GPFS, Weka, Vast), including performance tuning and capacity planning.
- Directly support research activities by collaborating with faculty, researchers, and partners, in conjunction with application support teams.
- Develop software tools and utilities to support research projects on cluster systems.
- Drive proof-of-concept projects, technology evaluations, and research industry best practices.
- Coordinate with vendors and service providers to resolve issues promptly.
- Develop and maintain user documentation, standard operating procedures, and training materials.
- Stay current with HPC advancements through continuous learning and professional collaboration, contributing to future hardware procurement decisions.
Required Qualifications and Experience
- Bachelor’s or Master’s degree in Computer Science/Engineering, Information Systems, or a related field.
- Minimum of five years of experience supporting large-scale computing platforms and related subsystems.
- Experience troubleshooting complex hardware issues and documenting root cause analysis.
- Experience managing parallel storage systems (*, Lustre, GPFS, Weka, Vast).
- Experience benchmarking HPC system components (CPU, memory, InfiniBand, storage).
- Experience administering workload managers/schedulers (*, Slurm, LSF, PBS).
- Strong Linux system administration experience (RHEL, Rocky Linux, or CentOS).
- Experience with configuration management tools such as Ansible or Puppet.
- Ability to coordinate with researchers, application support teams, and vendors to resolve complex issues.
- Proven ability to collaborate cross-functionally and drive initiatives to completion.
- Familiarity with Kubernetes and container orchestration platforms is desirable.
Essential Skills and Competencies
- Expertise in supporting users of computational science and engineering, data analysis, and artificial intelligence applications and libraries in HPC environments.
- Strong proficiency with HPC applications and programming models (Fortran, C/C++, Python, MPI, OpenMP, CUDA, OpenACC).
- Demonstrated track record of managing complex HPC systems, including parallel file systems, job schedulers, InfiniBand/Ethernet networks, and monitoring systems.
- Familiarity with computational science, data analysis, and AI/ML applications and libraries used in HPC environments.
- Knowledge of project management principles and practices.
- Strong analytical, problem-solving, and decision-making skills.
- Proactive approach to identifying and implementing system improvements.
- Ability to manage multiple concurrent projects and deliver high-quality results within deadlines.
- Effective collaboration skills in multi-cultural, international work environments.
- Excellent verbal and written communication skills in English.
Requirements
- Requires 5-10 Years experience
Similar Jobs
You may also like
- Related HPC Senior Systems Administrator Opportunities
- Sales Manager Jobs in Riyadh
- Courier Jobs in Riyadh
- Receptionist Jobs in Riyadh
- Human Resources Specialist Jobs in Riyadh
- Marketing Specialist Jobs in Riyadh
- Other Job Fields in Makkah
- Sales Manager Jobs in Makkah
- Receptionist Jobs in Makkah
- Human Resources Specialist Jobs in Makkah
- Marketing Specialist Jobs in Makkah
- Sales Representative Jobs in Makkah
- Business Development Manager Jobs in Makkah
- Human Resources Manager Jobs in Makkah
- Seller Jobs in Makkah
- Security Cameras’ Observer Jobs in Makkah
- Administrative Assistant Jobs in Makkah
- Explore Jobs Across Saudi Arabia
- Sales Specialist Jobs in Najran
- Sales Consultant Jobs in Makkah
- Ticket Seller Jobs in Riyadh
- Administrative Assistant Jobs in Bishah
- Teacher Assistant Jobs in Taif
- Pastry Chef Jobs in Riyadh
- Secretary Jobs in Dhahran
- Medical Laboratory Scientist Jobs in Al Khobar
- Medical Laboratory Technician Jobs in Khamis Mushayt
- Cashier Jobs in Sabya