Every year, we help hundreds of thousands of people find rewarding jobs in the ever-changing world of work.
We understand the importance of a job in peoples lifes and we want to help them find work that feels good. And we’ll help them continue to grow as their needs and ambitions change.
At Randstad, our value comes from our people and that is why we put them first. We are proud of our learning culture and career architecture framework that encourages ours team to develop both personally and professionally.
We believe that talent grows when presented with opportunity and this is why we encourage our people to think beyond their role. We have created a culture that enables talent to flourish, encouraging entrepreneurship, fostering team spirit, and continually building mutual trust.
Job Title: Production Support (AI Platform)Experience Level: 3–4 Years
Location: Hyderabad
Role Overview:
We are seeking a highly skilled Production Support Engineer to ensure the reliability and performance of our AI platform and agentic frameworks. You will be responsible for L1/L2 support, utilizing deep technical expertise in SQL and monitoring tools to maintain high availability for critical AI-driven applications.
Key Responsibilities:
AI Platform & Agentic Framework Support: Provide L2/L3 production support for critical AI applications and agentic frameworks to ensure continuous system reliability.
Incident Management: Resolve 25–30 production incidents daily by utilizing SQL queries, log analysis, and advanced debugging techniques to minimize downtime.
Advanced Monitoring: Build and maintain proactive monitoring dashboards and alerts using tools like AWS Cloudwatch, Splunk to detect performance bottlenecks in AI models and workflows.
Root Cause Analysis (RCA): Conduct deep-dive RCAs for recurring platform issues or failures within agentic workflows and implement permanent fixes to reduce recurrence.
Operational Excellence: Manage incident, problem, and change tickets within ServiceNow, consistently maintaining 95%+ SLA compliance.
System Reliability & Observability: Focus on improving platform availability and observability through proactive monitoring of batch jobs (Autosys) and business-critical AI processes.
Deployment & Validation: Support production deployments and perform post-release validation to ensure stable releases of AI updates with minimal business impact.
Collaborative Troubleshooting: Partner with development, infrastructure, and database teams as a Subject Matter Expert (SME) to resolve complex issues within the AI stack.
On-Call Support: Provide on-call assistance for high-severity incidents to ensure rapid recovery of AI services.
Required Skills & Qualifications:
Technical Expertise: Strong proficiency in SQL, AI Agents and GCP Cloud environments.
Monitoring Tools: Extensive experience with AWS Cloudwatch, Splunk, and other application monitoring tools.
Process Knowledge: Solid understanding of the ITIL Framework, SLA management, and Site Reliability Engineering (SRE) principles.
Troubleshooting: Proven ability in technical support, application support, and performing thorough RCAs.
Tools: Familiarity with ServiceNow for incident tracking.
Is this the job for you? We would love to hear from you! Please apply directly to the role and we will get in touch with you.
...