Technical Operations Engineer II/SRE – ID #00050
The individual will collaborate closely with Infrastructure Architecture and Operations, Security, Development, and Business units’ teams throughout Hybrid Cloud GitOps Platform journey to execute on large-scale projects, operationalize technologies, establish technology road maps, and help set standards & processes. Design, implementation, and management internal and perimeter networks, including end-to-end networking, load balancing and firewall automation. Evaluates and selects technologies, establishes governance, standards, and deployment methodologies for consistent delivery of network infrastructure service offerings. The company is a cloud-based infrastructure company that powers the blockchain ecosystem. The mission is to be the indispensable utility that empowers companies and innovators globally to build next-generation, Web3 enabled businesses & applications using blockchain technology. Our company is backed by some of the world’s best investors.
The team has over 120 people maintaining high performance global data infrastructure for customers serving billions of requests daily. Our client is a global remote/hybrid company HQ’d in Miami, Florida.
We are looking for an experienced Technical Operations engineer II to join our team.
The Technical Operations team is responsible for ensuring the stability, reliability, and performance of our production systems. In this pivotal role as a Technical Operations Engineer II, you will be instrumental in maintaining, troubleshooting, and enhancing the performance and reliability of our production systems, with a special emphasis on complex Web3 technologies. This position demands a unique blend of Web3 expertise, technical skills, analytical prowess, and a collaborative approach to ensure operational
excellence.
Hours: 6AM – 2PM EST.
Responsibilities:
● Advanced Web3 Support: Tackle complex Web3 issues, including conducting thorough post-mortems, providing technical direction in customer calls, and leading remediation efforts during incidents.
● System Monitoring and Analytical Skills: Utilize advanced analytical and dashboard skills, including proficiency in tools like Grafana or DataDog, to monitor system performance and health.
● Process and SLA/SLO Management: Define, enforce, and ensure the organization meets stringent SLO/SLA objectives, contributing to the overall health and reliability of our platform services.
● Hands-On Problem Solving: Innovate to solve complex challenges quickly and efficiently, while being disciplined in task follow-through to minimize technical debt and implement preventive measures.
● Collaboration and Support: Work closely with the Support L1 team and other cross-functional teams, fostering a collaborative environment to maintain system efficiency.
Requirements:
● Proven Experience: At least 5 years in Technical Operations, SRE, or a similar role, with a deep understanding of Linux/Unix systems.
● Deep Blockchain / Web3 Expertise: Ability to handle complex Web3-related issues with proficiency, including troubleshooting JSON-RPC responses, analyzing validator logs, and working with chain foundations directly on improving network performance.
● DevOps Experience: Proficiency in automation and configuration management tools (e.g., Ansible, Terraform, Consul), and in programming languages such as
Python, Go, or JavaScript. Familiarity with containerization technologies like Docker and Kubernetes.
● System Optimization: Skilled in system optimization, including benchmarking using in-house tools, cost analysis and optimization, and system-level tuning by
comparing various cloud providers, hardware configurations, and kernel parameters.
● Analytical and Dashboard Proficiency: Demonstrated expertise in using tools like Grafana or DataDog for detailed system analysis and monitoring, essential for proactive system management and data-driven decision-making.
● SLA/SLO and Incident Management Expertise: Proven ability in defining and adhering to SLA/SLO objectives, coupled with efficient incident management using tools like PagerDuty, ensuring operational reliability and customer satisfaction.
● Proactive and Hands-On Approach: A proactive mindset with a hands-on approach to problem-solving, capable of innovating under pressure and committed to reducing risks and technical debt.
● Communication and Collaboration: Excellent communication skills, with the ability to collaborate effectively across teams and with various stakeholders.