Lead the operation and evolution of production-grade Kubernetes clusters across cloud environments, making architectural decisions on upgrades, scaling, disaster recovery, and reliability improvements.
Define and drive GitOps strategy and standards across the organization, owning ArgoCD-based workflows (Application Sets, sync policies, deployment standards) and mentoring teams on GitOps best practices.
Architect and establish Terraform-based infrastructure-as-code standards, building reusable modules, managing state, and implementing drift detection for safe provisioning.
Lead platform observability strategy and incident response processes, standardizing monitoring and post-incident reviews to improve availability, performance, and resilience.
Partner with and mentor application teams to onboard services onto the platform, creating documentation, runbooks, and self-service tooling to scale and boost developer productivity.
Design and implement security control standards (RBAC, network policies, secrets management via Vault/Sealed Secrets/External Secrets Operator) to meet compliance and scale.
Drive integration of platform capabilities with CI pipelines (GitHub Actions, GitLab CI, Tekton) to establish end-to-end delivery workflows and standards.
技術スタック
必須スキル
Production Kubernetes clusters operations across cloud environments (e.g., EKS, GKE, AKS).
GitOps-based CD workflows with ArgoCD, Flux, or similar; establishing deployment standards across environments.
Infrastructure as Code (Terraform or equivalent), including reusable modules, state management, and drift detection.
Automation scripting (Python, Bash, or Go) and coaching others on best practices.
Networking fundamentals (DNS, load balancing, ingress) and platform patterns (service mesh) for reliable architectures.
Strong written/verbal communication, mentoring, documentation, and runbook creation.
歓迎スキル(該当する場合)
Practical secret management experience with Vault, Sealed Secrets, or External Secrets Operator.
Experience extending GitOps with advanced ArgoCD features or similar tooling; multi-cloud/multi-cluster exposure.
Observability/monitoring tooling and incident response practices (SRE-minded) across large platforms.