SRE, Chaos Engineering, Search Resilience | Amazon | InterviewCat Jobs - テック企業エンジニア厳選求人検索

業務内容

Design, implement, execute, and automate chaos experiments to continuously test Amazon Search resilience against hardware failures, dependency outages, traffic spikes, and other disruption scenarios.
Develop and maintain chaos experiment orchestrator to enable repeatable, scalable testing in production environments.
Develop and maintain distributed load generator and petabyte-scale log archival and query service to feed resilience experiments and incident analyses.
Collaborate with service owners to remediate vulnerabilities, minimize blast radius, and harden Amazon Search.
Join a 12/12 on-call rotation for incident response and mitigation, contributing to post-incident reviews and improvements.

Experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust

Experience with Linux/Unix
Experience in networking, storage systems, operating systems and hands-on systems engineering
Experience with distributed operational health and performance monitoring systems