System Reliability Engineer (SRE) - (Intermediate-to-Senior)

littlefish

Marketing & Communications

Sterling, VA, USA · Remote

Posted on May 11, 2026

About littlefish

littlefish is a Unified Commerce Platform for small and medium businesses across Africa, distributed through Financial Institution partners — banks, telcos, and insurers. We are live with 3 of the 5 major South African banks and onboarding 10–12 FI clients across the continent. The platform gives SMEs access to payments, invoicing, stock management, and expense management — tools previously only available to large corporations.

We are an AI-native engineering team. Every engineer operates as an Intent Architect: directing AI with precision, owning every decision about correctness, security, and quality, and building software that impacts real businesses across Africa. We are not vibe coders. The AI accelerates execution. The engineering judgment is yours.

Role Summary

When the platform is down, SMEs cannot trade. When security posture slips, FI relationships are at risk. The Senior SRE owns both — production health and platform security posture across a growing estate of live FI client environments on GKE / GCP, fronted by Cloudflare.

You bring the same Intent Architect discipline to infrastructure and reliability that Platform Engineers bring to feature development — defining system behaviour and failure models before directing AI to generate configuration or runbooks, and validating the output against real production constraints. You think architecturally before you build, direct AI with precision, and review every change against your intent, the existing system, and the security requirements of a regulated fintech.

Specialist security work (penetration testing, compliance audits, FI vendor risk packages) is managed through external providers — you own the day-to-day posture and the remediation roadmap.

Key Responsibilities

Reliability and SLOs

Define and own SLOs and SLIs across all live FI environments — establishing what reliability means per client and making it measurable.

Drive error budget discipline — use it as the mechanism that balances reliability work against feature velocity.

Own the production health of GKE workloads, MongoDB Atlas, PostgreSQL, and Redis across all FI environments.

Decompose reliability problems and articulate the correct implementation approach before a single prompt is written.

Observability

Build and maintain the observability stack: OpenTelemetry → Dash0 / GCP Cloud Operations — distributed tracing, log-based alerting, structured metrics.

Design per-tenant dashboards that surface issues before clients report them.

Tune alerts ruthlessly for signal over noise — every page should be actionable.

Make production behaviour debuggable from telemetry alone, not from luck.

Incident Response

Lead incident response — on-call triage, cross-service diagnosis, resolution, and clear communication during the event.

Run blameless post-mortems that produce actionable improvements, not assigned blame.

Own follow-up actions through to closure — the post-mortem is not done when the document is written, it is done when the system is better.

Carry primary on-call for the platform and rotate with the DevOps engineer.

Platform Security

Own the platform security posture — Cloudflare edge, GCP IAM, network policies, secrets management, access controls.

Coordinate external security specialists for penetration testing and compliance audits (PCI-DSS, POPIA, SOC 2 where applicable), then own the remediation roadmap.

Review auth patterns, network exposure, and secrets handling on new services as a first-class part of the build process.

Maintain the security audit trail required for FI vendor risk packages.

Build Integration

Work with Platform Engineers to embed reliability and security requirements into the build, not after it — failure modes, rollback strategy, observability instrumentation, and auth review as part of development.

Direct AI to generate configuration, runbooks, and automation; validate output against real production constraints, not just syntactic correctness.

Champion engineering standards: idempotency at the infrastructure layer, blast radius containment, principle of least privilege.

Automation and Toil Reduction

Eliminate toil programmatically — if it has been done twice manually, it should be automated by the third time.

Build and maintain platform tooling in Python or Go that compounds team leverage.

Maintain Terraform as the source of truth for infrastructure across all environments.

AI-Native Engineering Practice

Direct AI with precise, contextual prompts that encode infrastructure knowledge and the constraints of the regulated environment.

Review all AI-generated output critically — correctness, blast radius, security implications, and edge cases that only show up in production.

Contribute to and uphold the team's evolving AI tooling standards, prompt review practices, and output quality gates.

Collaboration & Delivery

Work alongside Platform Engineers, the DevOps engineer, the Principal Architect, and Delivery Engineers managing FI environments.

Mentor junior SREs and the broader engineering team on reliability patterns, observability, and operational discipline.

Contribute to internal documentation, runbooks, and incident post-mortems.

Ask about the FI requirement, the SME use case, and what production health actually means for a client before you act.

Skills and Qualifications

Experience

4–10 years of experience in SRE, production engineering, or platform / infrastructure engineering on a distributed system.

Proven ownership of SLOs and incident response at scale — you have been the engineer the page goes to, and you have the post-mortems to show what came of it.

Deep observability expertise — distributed tracing, structured logging, alerting design, reasoning about system behaviour from telemetry data.

Strong working knowledge of GKE / Kubernetes and GCP at production scale.

Working knowledge of cloud and application security — GCP IAM, network security, secrets management, auth pattern review.

Scripting and automation ability in Python or Go — you eliminate toil programmatically, not by working harder.

Experience with Terraform at production scale.

Multi-tenant or multi-environment platform experience strongly preferred.

Fintech, payments, or regulated industry experience strongly preferred.

Technical Skills

Cloud and orchestration: GKE, GCP (compute, networking, IAM, Cloud Operations), Docker, Kubernetes at production depth.

Edge and networking: Cloudflare, DNS, TLS, network policies, WAF rules.

Observability: OpenTelemetry, Dash0, GCP Cloud Ops, structured logging, distributed tracing, SLO tooling.

Data layer operations: MongoDB Atlas, PostgreSQL, Redis — reasoning about reliability, backup, failover, and performance.

Infrastructure as code: Terraform at scale, modular design, drift management.

CI/CD: Azure DevOps pipelines, deployment strategies (blue/green, canary, progressive rollout).

Security: GCP IAM, Secret Manager, certificate management, network policies, principle of least privilege.

Compliance frameworks (bonus): PCI-DSS, POPIA, SOC 2 — enough to drive remediation, not necessarily lead the audit.

Automation: Python or Go for tooling and platform work.

AI tooling: practical fluency with AI-assisted development workflows — prompt design, output review, and the discipline of directing AI rather than being directed by it.

Soft Skills & Leadership

Architectural thinking — decompose, identify failure modes, define blast radius, and articulate the correct approach before building.

Genuine review rigour — substantive checks against intent, existing patterns, and security implications.

Calm under pressure — incident leadership requires staying measured when others are not.

Ownership — reliability, security posture, and the quality of post-incident learning are your problem to solve.

Curiosity about the business — you ask about the FI context and the SME use case before you act.

You feel the weight of what the software does — when the platform is down, SMEs cannot trade.

Clarity — you can explain reliability and risk to engineers, leadership, and FI stakeholders alike.

Constructive challenge — you raise risks, say no with reasoning, defend decisions on substance.

Senior bias for finishing and mentoring — you leave systems better than you found them and contribute to a culture of operational excellence.

To apply, please send your CV and a short note on why littlefish to hiring@littlefishapp.com. We do not require cover letters, but we do want to know what draws you to this role specifically — curiosity about the problem is part of what we are hiring for.

See more open positions at littlefish

Portfolio Jobs