Design, implement, execute, and automate chaos experiments to continuously test Amazon Search resilience against hardware failures, dependency outages, traffic spikes, and other disruption scenarios.
Develop and maintain chaos experiment orchestrator to enable repeatable, scalable testing in production environments.
Develop and maintain distributed load generator and petabyte-scale log archival and query service to feed resilience experiments and incident analyses.
Collaborate with service owners to remediate vulnerabilities, minimize blast radius, and harden Amazon Search.
Join a 12/12 on-call rotation for incident response and mitigation, contributing to post-incident reviews and improvements.
技術スタック
必須スキル
Experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust
歓迎スキル(該当する場合)
Experience with Linux/Unix
Experience in networking, storage systems, operating systems and hands-on systems engineering
Experience with distributed operational health and performance monitoring systems