Senior Site Reliability Engineer (Australian Project)
1
2026-04-30
Responsibilities
- Operate and improve monitoring, alerting and dashboards to quickly detect and respond to production incidents.
- Join the on‑call roster and be ready to handle and coordinate production issues during assigned shifts.
- Lead or contribute to incident response and post‑incident reviews, driving long‑term reliability improvements.
- Define and maintain SLOs/SLIs and error budgets for key services; continuously improve system health visibility.
- Partner with development teams to ensure production‑ready services (observability, deployment, rollback, performance).
- Work with engineering and QA/Automation teams to embed observability into CI/CD pipelines and maintain an accurate service catalogue and service scorecards for key services.
- Automate recurring operational tasks and support issues; build self‑healing workflows where appropriate.
- Collaborate with infrastructure/platform teams to operate auto‑scaling, highly available cloud infrastructure (e.g. AWS/Azure/GCP) using Infrastructure as Code.
- Apply FinOps thinking to optimise telemetry and platform costs (sampling, retention, storage strategies).
Requirements
Must have:
- At least 05 years of experience in working as Site Reliability, DevOps or Software Engineer with strong production operations responsibilities.
- Hands‑on experience running workloads on at least one public cloud (AWS, Azure or GCP).
- Solid skills in scripting/programming (e.g. Python, Bash, Go) and working with Linux.
- Experience with observability tools (e.g. Prometheus/Grafana, ELK/APM, Datadog, New Relic or similar).
- Experience with containers and orchestration (Docker, Kubernetes) and CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions or similar).
- Understanding of high‑availability and resilience patterns (load balancing, auto‑scaling, blue‑green/canary, rollback).
- Willingness to participate in on‑call and occasional out‑of‑hours incident / release support.
- Contribute to improving system resilience through practices such as chaos engineering, resilience testing and close collaboration with security/risk teams where appropriate.
- Leverage AI/ML‑assisted diagnostics and automation tools where appropriate to improve incident response and support workflows.
- Good English communication skills for daily work with international stakeholders.
Nice to have:
- Experience with OpenTelemetry or similar observability frameworks.
- Experience with self‑healing / event‑driven automation, AIOps or low‑code automation.
- Fintech / financial services background, or experience in regulated environments.
- DevSecOps exposure and relevant cloud/SRE certifications.
- Strong ownership, problem‑solving and collaboration in cross‑functional teams.
- Clear, structured communication; calm under pressure during incidents.
- Curious, continuous‑learning mindset, open to new tools and practices.
What we offer
- Attractive and competitive performance-based compensation package.
- Full gross salary during probation.
- Generous 13-month salary and dedication bonus.
- Comprehensive healthcare insurance package and annual health check-ups.
- Flexible check-in time before 10:00 AM on weekdays.
- 1-day remote work per week.
- 12 annual leave days, 5 sick leave days, 11 public holidays as required by Vietnamese Labor Law, plus one extra day off for Christmas.
- Opportunity to work on global projects, collaborate with international teams, and have business trip to Australia.
- Daily breakfast, Happy Thursday gatherings to connect with colleagues.
- Active sport clubs such as badminton, running, football, music clubs.
- Teambuilding activities, annual company trips, and year end party.
- Continuous learning opportunities through technical & soft skills training, English classes, and internal communities.
- Financial assistance for important life events, including marriage, childbirth, and bereavement, ensuring support at every stage of life.