ML Operations & Customer Support Engineer, Staff/Senior Staff level - Riyadh, KSA📣 Job Ad

in Qualcomm

20 days ago

	Contract Type	Full-time
	Workplace type	On-site
	Location	Riyadh

Job Description

About the Role

Qualcomm Middle East Information Technology Company LLC is seeking an experienced ML Operations & Customer Support Engineer to join their Customer Engineering team in Riyadh, KSA. This customer-facing role focuses on supporting strategic customers in deploying AI inference workloads on advanced Qualcomm AI inference accelerators. These accelerators utilize Qualcomm's expertise in hardware-accelerated AI to provide high-performance, energy-efficient generative AI and computer vision inference solutions for modern data centers. The position requires a strong background in ML model deployment, systems engineering, rack-scale management software, DevOps/MLOps automation, and cross-functional collaboration to ensure system uptime, reliability, and performance, while resolving customer support cases within defined SLAs/KPIs. This role is essential for ensuring customer success with Qualcomm's AI technology, involving deep dives into ML inference pipelines, systems troubleshooting, and data center operations, in collaboration with customers and internal teams.

Key Responsibilities

Serve as the primary technical escalation point for customer issues related to AI inference workloads.
Manage end-to-end case resolution, ensuring adherence to Service Level Agreements (SLAs) and Key Performance Indicators (KPIs).
Lead incident response, triage, and root cause analysis (RCA) for critical issues.
Provide timely and transparent communication to customers regarding issue status and resolution progress.
Maintain high levels of customer satisfaction and service reliability.
Ensure high availability and uptime of customer AI deployments, particularly rack-scale systems.
Monitor system health, performance metrics, and workload behavior to proactively identify potential issues.
Implement and manage failover, redundancy, and resiliency mechanisms for continuous operation.
Proactively identify operational risks and implement preventative actions.
Support the deployment, optimization, and troubleshooting of ML inference pipelines.
Debug issues across model, runtime, system, and hardware layers.
Analyze model performance, including latency, throughput, and accuracy tradeoffs, in production environments.
Support various ML frameworks such as PyTorch, TensorFlow, and ONNX, and model conversion flows.
Assist in applying model optimization techniques including quantization, batching, compilation, and runtime tuning.
Support AI workloads in bare-metal and virtualized environments.
Troubleshoot issues across Linux operating systems, drivers, firmware, and the networking stack.
Support deployment and maintenance using Infrastructure as Code (IaC) and automation tools.
Work with Data Center Infrastructure Management (DCIM) tools and monitoring systems.
Coordinate with hardware vendors for accelerator, server, and networking-related issues.
Implement and manage monitoring systems, including logs, metrics, and traces.
Build dashboards to track uptime, SLA adherence, performance, and utilization metrics.
Automate repetitive operational tasks using scripts and workflows.
Establish and enforce runbooks and standard operating procedures (SOPs).
Collaborate closely with Customer Engineering, Product, Engineering, and Support teams.
Provide structured feedback to engineering teams for product improvements and defect resolution.
Support customer onboarding, deployment readiness, and operational handover processes.
Participate in customer reviews, escalations, and technical deep dives.

Qualifications and Experience

Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field.
10-15+ years of experience in ML operations, systems engineering, or customer support engineering.
Proven experience in customer-facing technical roles with SLA-driven support models.
Strong experience with AI/ML inference workloads in production environments.
Deep understanding of end-to-end ML inference pipelines.
Hands-on experience with Linux systems, system bring-up, drivers, and debugging tools.
Strong understanding of AI accelerator architecture and system bottlenecks.
Experience with model deployment, optimization, and performance tuning.
Experience with data center operations and rack-scale deployments.
Familiarity with bare-metal, virtualization, and containerization technologies such as Docker and Kubernetes.
Knowledge of networking concepts including TCP/IP, RDMA, and storage systems.
Experience with cloud and hybrid environments.
Experience with monitoring and observability tools like Prometheus, Grafana, and ELK stack.
Strong skills in incident management, RCA, and production operations.
Experience defining and tracking SLAs, KPIs, and operational metrics.
Proficiency in Python, Bash, or similar scripting languages.
Experience in automation, DevOps, and MLOps tooling.
Strong problem-solving and diagnostic skills.
Excellent communication and customer engagement skills.
Ability to operate effectively in high-pressure, mission-critical environments.
High attention to detail with a focus on quality, reliability, and accountability.
Experience with Qualcomm Cloud AI or similar AI accelerator platforms.
Experience supporting large-scale AI deployments (LLMs, CV pipelines, generative AI).
Familiarity with inference runtimes (TensorRT, ONNX Runtime, custom runtimes).
Experience with CI/CD pipelines for ML deployment.

Required Skills and Competencies

ML inference pipelines
Systems troubleshooting
Data center operations
ML model deployment
Systems engineering
Rack-scale management software
DevOps/MLOps automation
Cross-functional collaboration
Customer Support
SLA Ownership
Incident response
Triage
Root cause analysis (RCA)
Customer satisfaction
Service reliability
High availability
System health monitoring
Performance metrics
Failover, redundancy, and resiliency mechanisms
Risk identification and preventative actions
AI inference workload support
ML inference pipeline optimization
Model performance analysis
PyTorch, TensorFlow, ONNX
Model conversion flows
Model optimization techniques (quantization, batching, compilation, runtime tuning)
Bare-metal and virtualized environments
Linux OS, drivers, firmware, and networking stack
Infrastructure as Code (IaC) and automation tools
DCIM tools and monitoring systems
Logs, metrics, and traces
Dashboards for uptime, SLA adherence, performance, and utilization
Automating repetitive operational tasks
Scripts and workflows
Runbooks and Standard Operating Procedures (SOPs)
Customer Engineering, Product, and Support teams collaboration
Customer onboarding, deployment readiness, and operational handover
Customer reviews and technical deep dives
AI/ML inference workloads
Linux systems, system bring-up, and debugging tools
AI accelerator architecture and system bottlenecks
Model performance tuning
Rack-scale deployments
Virtualization and containerization technologies (Docker, Kubernetes)
Networking concepts (TCP/IP, RDMA, storage systems)
Cloud and hybrid environments
Monitoring/observability tools (Prometheus, Grafana, ELK)
Incident management and production operations
Operational metrics definition and tracking
Python, Bash, and scripting languages
DevOps and MLOps tooling
Problem-solving and diagnostic skills
Communication and customer engagement
High-pressure and mission-critical environments
Attention to detail, quality, reliability, and accountability