Ensure availability, scalability, and operational excellence of Oracle Cloud Infrastructure (OCI) Japan Sovereign Cloud services; translate operational and business requirements into reliability plans and execute improvements via tooling, automation, runbooks, and process changes.
Design and implement automation to reduce toil and improve MTTR; own and prioritize the SRD backlog based on shift feedback, incident reviews, alert quality reviews, and business reliability needs.
Lead complex incident investigations, perform root-cause analysis, and drive preventive actions; coordinate cross-team response and communicate findings.
Collaborate with development teams to improve operational readiness and reliability of services; mentor less experienced engineers and contribute to continuous improvement initiatives.
Participate in 24x7 shift rotation, providing technical leadership during critical service events and ensuring timely incident response and documentation.
Improve alert quality, reduce noise, and maintain robust runbooks and reliability-related documentation; balance business requirements with technical feasibility and risk.
技術スタック
必須スキル
Linux system administration and performance optimization
Proficiency in one or more programming languages (Java, Python, Go, C++, or similar)
Experience with cloud platforms, infrastructure automation, observability/monitoring, and incident response practices
Troubleshooting of cross-functional production issues and root-cause analysis
Networking and storage fundamentals relevant to cloud infrastructure
歓迎スキル(該当する場合)
Experience leading 24x7 on-call rotations and incident management
Technical mentorship and cross-team collaboration across JP/EU Sovereign Cloud teams
Familiarity with alerting improvements, runbook automation, and documentation standards
Ability to translate business needs into reliable, scalable solutions; strong communication in bilingual environments (Japanese/English)
キャリア成長観点
大規模クラウド運用の中核を担い、日本の Sovereign Cloudにおける信頼性設計と実装をリードする機会。