8We are looking for a proactive and experienced Monitoring Engineer to lead the design, implementation, and
optimization of monitoring solutions for both on-premises and cloud-based infrastructure and application services.
The ideal candidate should have hands-on experience with tools like SolarWinds, Prometheus-Grafana, and other
...
open-source monitoring platforms. This role will also involve building custom dashboards and alerts for
operational visibility, enhancing observability, and collaborating with application and infrastructure teams to
improve service KPIs, availability, and incident response.
Key Responsibilities:
• Design, deploy, and maintain monitoring solutions for infrastructure (servers, network, storage) and
application services across on-prem and cloud environments.
• Configure and optimize monitoring platforms including SolarWinds, Prometheus, Grafana, and other
open-source tools.
• Develop custom dashboards and alerting mechanisms that provide real-time visibility into system health
and performance.
• Collaborate with NOC teams to ensure effective use of monitoring dashboards for incident detection,
troubleshooting, and root cause analysis.
• Define and track key performance indicators (KPIs) and service-level objectives (SLOs) for infrastructure
and applications.
• Integrate monitoring tools with ITSM platforms (e.g., ServiceNow) to support incident, change, and
availability management processes.
• Work closely with application owners, infrastructure, and DevOps teams to ensure comprehensive
observability coverage and improved service reliability.
• Participate in incident post-mortems, identifying monitoring gaps and implementing improvements.
• Stay current with industry trends in observability, AIOps, and monitoring automation.Required Qualifications:
• 5+ years of experience working with monitoring and observability platforms.
• Strong hands-on experience with SolarWinds, Grafana, Prometheus, and open-source tools such as
Zabbix, Nagios or InfluxDB/Telegraf.
• Experience building custom dashboards, complex alert conditions, and data visualizations.
• Good understanding of infrastructure components (network, compute, storage), application services, and
how to monitor them effectively.
• Solid grasp of ITSM practices, including incident, change, and availability management.
• Ability to interpret logs and metrics to identify performance bottlenecks or failures.• Familiarity with APIs and scripting (e.g., Python, Bash) for monitoring automation or tool integration.Preferred Qualifications:
• Experience with AIOps or machine learning-based anomaly detection tools.
• Knowledge of Kubernetes, Docker, and observability in containerized environments is a plus.
• Experience integrating with ticketing systems like ServiceNow or Jira.
• Exposure to SRE principles, SLI/SLO/SLA tracking, and observability strategy design.
• ITIL Foundation Certification or relevant training.
experience
10show more 8We are looking for a proactive and experienced Monitoring Engineer to lead the design, implementation, and
optimization of monitoring solutions for both on-premises and cloud-based infrastructure and application services.
The ideal candidate should have hands-on experience with tools like SolarWinds, Prometheus-Grafana, and other
open-source monitoring platforms. This role will also involve building custom dashboards and alerts for
operational visibility, enhancing observability, and collaborating with application and infrastructure teams to
improve service KPIs, availability, and incident response.
Key Responsibilities:
• Design, deploy, and maintain monitoring solutions for infrastructure (servers, network, storage) and
application services across on-prem and cloud environments.
• Configure and optimize monitoring platforms including SolarWinds, Prometheus, Grafana, and other
open-source tools.
• Develop custom dashboards and alerting mechanisms that provide real-time visibility into system health
and performance.
• Collaborate with NOC teams to ensure effective use of monitoring dashboards for incident detection,
...
troubleshooting, and root cause analysis.
• Define and track key performance indicators (KPIs) and service-level objectives (SLOs) for infrastructure
and applications.
• Integrate monitoring tools with ITSM platforms (e.g., ServiceNow) to support incident, change, and
availability management processes.
• Work closely with application owners, infrastructure, and DevOps teams to ensure comprehensive
observability coverage and improved service reliability.
• Participate in incident post-mortems, identifying monitoring gaps and implementing improvements.
• Stay current with industry trends in observability, AIOps, and monitoring automation.Required Qualifications:
• 5+ years of experience working with monitoring and observability platforms.
• Strong hands-on experience with SolarWinds, Grafana, Prometheus, and open-source tools such as
Zabbix, Nagios or InfluxDB/Telegraf.
• Experience building custom dashboards, complex alert conditions, and data visualizations.
• Good understanding of infrastructure components (network, compute, storage), application services, and
how to monitor them effectively.
• Solid grasp of ITSM practices, including incident, change, and availability management.
• Ability to interpret logs and metrics to identify performance bottlenecks or failures.• Familiarity with APIs and scripting (e.g., Python, Bash) for monitoring automation or tool integration.Preferred Qualifications:
• Experience with AIOps or machine learning-based anomaly detection tools.
• Knowledge of Kubernetes, Docker, and observability in containerized environments is a plus.
• Experience integrating with ticketing systems like ServiceNow or Jira.
• Exposure to SRE principles, SLI/SLO/SLA tracking, and observability strategy design.
• ITIL Foundation Certification or relevant training.
experience
10show more