HPC Senior Systems Administrator📣 إعلان
في Kaust
منذ 13 ساعة تقريباً
| نوع العقد | دوام كامل | |
| طبيعة الوظيفة | بالموقع | |
| الموقع | مكة المكرمة |
وصف الوظيفة
About the Role
KAUST is seeking a Senior HPC Systems Administrator to join the KAUST Supercomputing Laboratory (KSL). This full-time position is based in Makkah, Saudi Arabia, and requires 5-10 years of experience. The role focuses on managing and supporting a large-scale High-Performance Computing (HPC) environment for researchers.
Role Overview
The Senior HPC Systems Administrator will be responsible for the operational management of an HPC cluster comprising approximately 600 CPU and GPU nodes, associated storage systems, and high-speed networks (InfiniBand and Ethernet). This role provides essential support to researchers across computational science, engineering, big data analysis, and artificial intelligence/machine learning workloads.
Key Responsibilities
- Provide timely and effective user support via multiple channels, maintaining high customer service standards.
- Install, configure, and manage HPC subsystems including compute nodes, high-performance storage, InfiniBand, Ethernet, and configuration management tools (*, Ansible, Puppet).
- Deploy and manage cluster management software, monitoring tools, and supporting services for HPC clusters.
- Administer the Slurm workload manager, including QOS policies, accounts, accounting, and related automation scripts (Python and C++).
- Develop and maintain automation scripts in Bash and Python to streamline system administration tasks.
- Deploy and manage container environments (Singularity/Apptainer, Docker) for HPC workloads.
- Periodically benchmark HPC system components (CPU, memory, InfiniBand, storage) to ensure optimal performance and identify tuning opportunities.
- Enforce security best practices, including node hardening, kernel patching, and compliance across all systems.
- Manage parallel file systems (*, Lustre, GPFS, Weka, Vast), including performance tuning and capacity planning.
- Directly support research activities by collaborating with faculty, researchers, and partners, in conjunction with application support teams.
- Develop software tools and utilities to support research projects on cluster systems.
- Drive proof-of-concept projects, technology evaluations, and research industry best practices.
- Coordinate with vendors and service providers to resolve issues promptly.
- Develop and maintain user documentation, standard operating procedures, and training materials.
- Stay current with HPC advancements through continuous learning and professional collaboration, contributing to future hardware procurement decisions.
Required Qualifications and Experience
- Bachelor’s or Master’s degree in Computer Science/Engineering, Information Systems, or a related field.
- Minimum of five years of experience supporting large-scale computing platforms and related subsystems.
- Experience troubleshooting complex hardware issues and documenting root cause analysis.
- Experience managing parallel storage systems (*, Lustre, GPFS, Weka, Vast).
- Experience benchmarking HPC system components (CPU, memory, InfiniBand, storage).
- Experience administering workload managers/schedulers (*, Slurm, LSF, PBS).
- Strong Linux system administration experience (RHEL, Rocky Linux, or CentOS).
- Experience with configuration management tools such as Ansible or Puppet.
- Ability to coordinate with researchers, application support teams, and vendors to resolve complex issues.
- Proven ability to collaborate cross-functionally and drive initiatives to completion.
- Familiarity with Kubernetes and container orchestration platforms is desirable.
Essential Skills and Competencies
- Expertise in supporting users of computational science and engineering, data analysis, and artificial intelligence applications and libraries in HPC environments.
- Strong proficiency with HPC applications and programming models (Fortran, C/C++, Python, MPI, OpenMP, CUDA, OpenACC).
- Demonstrated track record of managing complex HPC systems, including parallel file systems, job schedulers, InfiniBand/Ethernet networks, and monitoring systems.
- Familiarity with computational science, data analysis, and AI/ML applications and libraries used in HPC environments.
- Knowledge of project management principles and practices.
- Strong analytical, problem-solving, and decision-making skills.
- Proactive approach to identifying and implementing system improvements.
- Ability to manage multiple concurrent projects and deliver high-quality results within deadlines.
- Effective collaboration skills in multi-cultural, international work environments.
- Excellent verbal and written communication skills in English.
متطلبات الوظيفة
- تتطلب ٥-١٠ سنوات خبرة
وظائف مشابهة
قد يعجبك أيضاً
- وظائف ذات صلة بـ HPC Senior Systems Administrator
- وظائف مدير مبيعات في الرياض
- وظائف مندوب توصيل في الرياض
- وظائف موظف استقبال في الرياض
- وظائف أخصائي عمليات موارد بشرية في الرياض
- وظائف أخصائي تسويق في الرياض
- مجالات وظيفية أخرى في مكة المكرمة
- وظائف مدير مبيعات في مكة المكرمة
- وظائف موظف استقبال في مكة المكرمة
- وظائف أخصائي عمليات موارد بشرية في مكة المكرمة
- وظائف أخصائي تسويق في مكة المكرمة
- وظائف مدير تطوير اعمال في مكة المكرمة
- وظائف مدير موارد بشرية في مكة المكرمة
- وظائف بائع في مكة المكرمة
- وظائف مراقب كاميرات أمنية في مكة المكرمة
- وظائف مساعد إداري في مكة المكرمة
- وظائف فني مختبر طبي في مكة المكرمة
- استكشف الوظائف في أنحاء المملكة
- وظائف Security Guard في الجبيل
- وظائف مدير مالي في الرياض
- وظائف معد وجبات سريعة في جدة
- وظائف أخصائي عمليات موارد بشرية في مكة المكرمة
- وظائف Technical Support Specialist في الرياض
- وظائف Maintenance Engineer في مكة المكرمة
- وظائف أخصائي تسويق في بريدة
- وظائف سكرتير تنفيذي في الدمام
- وظائف أخصائي معلوماتية صحية في الرياض
- وظائف Merchandizer في الرياض