Site Reliability Engineer Hybrid,Nuñez CABA ID #00306
Role Overview
We are seeking 2 experienced Site Reliability Engineers (SREs) to support the transition and operational uplift of a key application being handed off to our team. This application underpins critical workflows for our Global Markets business, and your work will directly impact the stability, reliability, and efficiency of our client-facing technology.
You’ll be joining a high-performing production management team responsible for ensuring operational excellence across the platforms that power our Sales and Research professionals. This is a hands-on role with a clear mandate: take ownership of the application, drive improvements in supportability and observability, and ensure a seamless handover to our production management team.
Key Responsibilities
- BAU Support & Ownership
- Provide day-to-day (BAU) support for the application’s processes and workflows, ensuring stability, availability, and swift response to end-user issues.
- Act as the primary point of contact for all production support matters related to the application.
- Application Maturity Assessment
- Evaluate the current state of the application with respect to supportability, reliability, and observability.
- Identify gaps and areas for improvement, documenting findings and recommendations.
- Observability Integration & Enhancement
- Remediate observability gaps by integrating the application’s processes with our standard monitoring and alerting tools, including Dynatrace, Splunk, Grafana, and Geneos.
- Ensure robust monitoring coverage and actionable alerting for all critical workflows.
- Operational Toil Reduction & Automation
- Identify and remediate sources of operational toil and manual intervention.
- Build automation solutions or integrate with existing in-house platforms to streamline support activities and improve operational efficiency.
- Documentation & Handover
- Develop comprehensive support documentation and runbooks, ensuring all procedures, troubleshooting steps, and escalation paths are clearly captured.
- Prepare and execute a structured handover to the permanent production management team at the end of the engagement.
Technical Environment
- Hosting: AWS and internal cloud platforms
- Orchestration: Astronomer (Airflow jobs)
- Monitoring & Observability: Dynatrace, Splunk, Grafana, Geneos
Required Skills & Experience
- SRE & Production Support: Proven experience in SRE, production management, or application support roles within large-scale, mission-critical environments.
- Cloud Platforms: Hands-on expertise with AWS and internal cloud platforms.
- Programming: Proficiency in at least one programming language such as Python or Java/Spring Boot, with the ability to script, automate, and troubleshoot application workflows.
- CI/CD Tools: Experience with continuous integration and continuous delivery tools, such as Jules, Jenkins, GitLab, or Terraform, to support automated build, deployment, and infrastructure management.
- Observability Tooling: Strong background in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Workflow Orchestration: Experience with workflow orchestration tools, ideally Astronomer/Airflow.
- Containers & Orchestration: Familiarity with container technologies and orchestration platforms such as ECS and Kubernetes, including deployment and operational best practices.
- Automation & Toil Reduction: Demonstrated ability to automate operational tasks and reduce manual toil, preferably using in-house or open-source solutions.
- Documentation: Excellent documentation skills and experience creating runbooks for production support teams.
- Communication: Strong communication and stakeholder management skills, with a collaborative and proactive approach.
What Success Looks Like
- The application is fully integrated with our monitoring and automation platforms, with clear, actionable alerting and minimal manual intervention required.
- All support documentation and runbooks are complete, accurate, and enable a seamless transition to the permanent team.
- The application’s operational maturity is uplifted, with measurable improvements in reliability, supportability, and efficiency.


