Senior Site Reliability Engineer (Australian Project)

1 2026-04-30

Apply Now

Responsibilities

Operate and improve monitoring, alerting and dashboards to quickly detect and respond to production incidents.
Join the on‑call roster and be ready to handle and coordinate production issues during assigned shifts.
Lead or contribute to incident response and post‑incident reviews, driving long‑term reliability improvements.
Define and maintain SLOs/SLIs and error budgets for key services; continuously improve system health visibility.
Partner with development teams to ensure production‑ready services (observability, deployment, rollback, performance).
Work with engineering and QA/Automation teams to embed observability into CI/CD pipelines and maintain an accurate service catalogue and service scorecards for key services.
Automate recurring operational tasks and support issues; build self‑healing workflows where appropriate.
Collaborate with infrastructure/platform teams to operate auto‑scaling, highly available cloud infrastructure (e.g. AWS/Azure/GCP) using Infrastructure as Code.
Apply FinOps thinking to optimise telemetry and platform costs (sampling, retention, storage strategies).

At least 05 years of experience in working as Site Reliability, DevOps or Software Engineer with strong production operations responsibilities.
Hands‑on experience running workloads on at least one public cloud (AWS, Azure or GCP).
Solid skills in scripting/programming (e.g. Python, Bash, Go) and working with Linux.
Experience with observability tools (e.g. Prometheus/Grafana, ELK/APM, Datadog, New Relic or similar).
Experience with containers and orchestration (Docker, Kubernetes) and CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions or similar).
Understanding of high‑availability and resilience patterns (load balancing, auto‑scaling, blue‑green/canary, rollback).
Willingness to participate in on‑call and occasional out‑of‑hours incident / release support.
Contribute to improving system resilience through practices such as chaos engineering, resilience testing and close collaboration with security/risk teams where appropriate.
Leverage AI/ML‑assisted diagnostics and automation tools where appropriate to improve incident response and support workflows.
Good English communication skills for daily work with international stakeholders.

Experience with OpenTelemetry or similar observability frameworks.
Experience with self‑healing / event‑driven automation, AIOps or low‑code automation.
Fintech / financial services background, or experience in regulated environments.
DevSecOps exposure and relevant cloud/SRE certifications.
Strong ownership, problem‑solving and collaboration in cross‑functional teams.
Clear, structured communication; calm under pressure during incidents.
Curious, continuous‑learning mindset, open to new tools and practices.

Attractive and competitive performance-based compensation package.
Full gross salary during probation.
Generous 13-month salary and dedication bonus.
Comprehensive healthcare insurance package and annual health check-ups.
Flexible check-in time before 10:00 AM on weekdays.
1-day remote work per week.
12 annual leave days, 5 sick leave days, 11 public holidays as required by Vietnamese Labor Law, plus one extra day off for Christmas.
Opportunity to work on global projects, collaborate with international teams, and have business trip to Australia.
Daily breakfast, Happy Thursday gatherings to connect with colleagues.
Active sport clubs such as badminton, running, football, music clubs.
Teambuilding activities, annual company trips, and year end party.
Continuous learning opportunities through technical & soft skills training, English classes, and internal communities.
Financial assistance for important life events, including marriage, childbirth, and bereavement, ensuring support at every stage of life.

BACK TO CAREERS