Operate and improve the reliability, scalability, and performance of the Japan Sovereign Cloud platform in Oracle Cloud Infrastructure (OCI).
Participate in a 24x7 shift rotation: manage alerts, incidents, escalation paths, runbooks, and customer-impacting reliability risks; collaborate with Japanese and global teams.
Partner with software engineering and cloud operations to automate tasks, implement practical fixes, and enhance service resiliency.
Capture recurring operational issues, improve alert actionability, and maintain up-to-date runbooks and documentation.
Contribute tooling, automation, and process improvements to reduce toil and prevent incidents.
技術スタック
必須スキル
Linux-based production environments
Python
Reliability Engineering / SRE practices
Scripting/Programming: Python, Java, Go, Shell
Cloud concepts (OCI), networking, distributed systems, and automation
Native-level Japanese and business-level English
歓迎スキル(該当する場合)
Java or Go programming
Shell scripting and automation tooling
Experience with on-call incident handling and runbook improvements