Infrastructure & Site Reliability Engineer Datacentre AI Engineering SA📣 إعلان

منذ 21 يوم

	نوع العقد	دوام كامل
	طبيعة الوظيفة	بالموقع
	الموقع	الرياض

وصف الوظيفة

About the Role

Qualcomm is expanding its presence in Riyadh and is seeking to hire an Infrastructure & Site Reliability Engineer for its Datacentre AI Engineering team. This full-time position is based in Riyadh, Saudi Arabia, and requires 2-5 years of experience. The role focuses on supporting Qualcomm's growing infrastructure and critical AI use cases as Saudi Arabia advances its digital transformation.

Role Overview

This position involves the design, operation, and continuous improvement of large-scale AI inference systems within a datacenter environment. The engineer will ensure Qualcomm's AI infrastructure is reliable, scalable, and production-ready for advanced machine-learning workloads. The role demands strong systems and software engineering fundamentals, hands-on execution, and the ability to work independently while collaborating with cross-functional teams.

Key Responsibilities

Design, deploy, and operate large-scale AI inference systems for critical AI workloads.
Ensure the reliability, availability, and scalability of Qualcomm datacenter AI clusters.
Develop and maintain software tools and support infrastructure for AI software stacks.
Analyze software requirements and collaborate with architecture and hardware engineers.
Build, deploy, and operate components supporting LLM inference, agentic AI workflows, and AI services.
Improve model performance on AI100 deployments by working with models, systems, and software teams.
Identify and implement optimizations for workloads on multi-SoC and multi-card systems.
Apply Site Reliability Engineering (SRE) fundamentals including monitoring, alerting, incident response, and performance optimization.
Support production ML systems using MLOps tools and operational best practices.
Contribute to incident reviews, operational documentation, and continuous reliability improvements.
Build and maintain observability tools, dashboards, and alerts.
Monitor infrastructure and services using tools like Prometheus, Grafana, CloudWatch, and custom telemetry.
Create and maintain technical documentation, runbooks, and knowledge-base articles.
Develop automation to reduce manual operational tasks and improve system reliability.
Support CI/CD pipelines for AI service and agent deployment.
Apply Infrastructure-as-Code practices using tools such as Terraform and Ansible.

Required Qualifications and Skills

Bachelor's or Master's degree in engineering, Computer Science, AI/ML, or a related field.
2–8 years of software, systems, or infrastructure engineering experience, preferably in production or datacenter environments.
Experience with AI/ML workloads such as LLMs, NLP, Vision, Audio, or Recommendation systems.
Understanding of ML inference concepts including batching, token streaming, and performance considerations.
Hands-on experience with PyTorch and familiarity with modern ML frameworks.
Familiarity with distributed inference, checkpointing, and accelerator-based compute environments.
Experience supporting AI or ML applications in production environments.
Familiarity with LLM inference pipelines and AI service operations.
Strong programming skills in Python with experience building and supporting production systems.
Experience with scripting and automation using Python and Bash.
Familiarity with configuration management and orchestration tools.
Strong Linux fundamentals including shell, containers, system services, and networking basics (DNS, TLS, HTTP/gRPC).
Experience working with cluster schedulers such as Slurm or equivalent systems.
Experience operating distributed systems with high availability and fault tolerance.
Hands-on experience with monitoring and logging tools such as Prometheus, Grafana, ELK, or Loki.
Understanding of incident management, service health metrics, and system reliability monitoring.
Solid understanding of SDLC, release processes, and operational reliability practices.
Familiarity with CI/CD pipelines and Infrastructure-as-Code tools.

Preferred Skills

Experience with GenAI, Agentic AI systems, or LLM orchestration frameworks.
Exposure to LangChain, AutoGen, or RAG-based systems.
Experience with additional ML frameworks such as TensorFlow, JAX, or Ray.
Knowledge of GPU/accelerator-based systems and high-performance networking (RDMA, InfiniBand, RoCE).
Experience with advanced MLOps workflows or large-scale AI platform operations.

Work Environment and Benefits

This is a full-time role based in Riyadh, Saudi Arabia. Qualcomm offers a competitive compensation package that includes salary, housing and transport allowance, stock (RSUs), and a performance-related bonus. Additional benefits include paid maternity and paternity leave, an employee stock purchase scheme, child education allowance, relocation and immigration support, and life and medical insurance. A Live+ Well reimbursement is also provided for health and recreational membership fees.