What is Platform Engineering?

Mark
Sep 1
11 min read

Platform Engineering is a sociotechnical approach that addresses the complexity of scaling applications within organisations. It empowers developers by providing a robust support ecosystem, which in turn accelerates innovation and developer productivity across the organisation.

DevOps is Not Dead; It Has Evolved

First and foremost, let’s address the main issue: DevOps is not dead; it has simply evolved. A common mistake we observed a few years ago was that many organisations attempted to "achieve DevOps" by merely renaming system administrators as DevOps engineers or by forming DevOps teams without fully embracing the underlying culture.

This is where they went wrong. DevOps was never meant to be a job title or a specific team; it's a mindset. At its core, it represents a cultural shift that emphasises collaboration between application development teams and operations teams.

Enter Platform Engineering

Platform Engineering is founded on the principles of DevOps. It incorporates key DevOps practices, including automation, continuous integration/continuous deployment (CI/CD), infrastructure as code, observability, and accelerated feedback loops. The goal of Platform Engineering is to operationalise DevOps by scaling, productising, and ensuring the reliability of these practices across multiple teams.

A Platform Engineering team is essential in complex organisations where developer experience, standardisation, and security are critical and increasingly complex to manage. At Savannah Informatics, for example, the core pillars of the Platform Engineering team include:

Site Reliability Engineering (SRE)
Operational Excellence
Security
Orchestration
Modernisation/Containerisation
CI/CD Pipelines
Observability/Monitoring
Connectivity/Networking
Provisioning Infrastructure
Cost Management
Developer Productivity/Experience

While these are typical DevOps practices that are widely adopted, the key distinction of Platform Engineering is its ability to integrate the complexities of each of these pillars into a coherent, developer-centric experience. This integration significantly accelerates innovation across the organisation. We will explore these principles that make the Platform Engineering team vital in the modern software development lifecycle, noting that in practice, multiple pillars often overlap and intersect.

Site Reliability Engineering: Building Observability into the Organization’s DNA

Site Reliability Engineering(SRE), originally developed at Google, is a cornerstone of the platform engineering team. SRE describes the stability (resilience and reliability) and quality of service that an application offers after it has been made available to end-users. One of the editors of the popular SRE Book, Chris Jones, says that the significant cost of software is associated with its upkeep, support and maintenance. A study conducted by the School of Computing at the University of Eastern Finland estimates that 60–75% of a software system’s total lifetime cost arises after its initial development, with roughly one-quarter of that spent specifically on migrations and other forms of “adaptive” maintenance.

As resilience and reliability are core responsibilities of product development teams, the platform engineering team also plays a crucial part in this. How? Well, by designing and creating underlying systems that are inherently reliable and resilient, we, the platform engineering team, enable product development teams to inherit these capabilities automatically. With centralised metrics, logs and tracing tools, we provide comprehensive observability that gives teams deep visibility into their application performance, user experience and system health across the entire stack. Product teams gain immediate access to golden signals (latency, traffic, errors and saturation), business metrics and infrastructure health indicators without needing to build these capabilities from scratch.

When both the platform team and product development teams are involved in incidents, we can collaboratively identify and resolve platform and application issues more quickly. Each group, however, still maintains its own focus and priorities and does not do the bidding of the other. By fostering a blameless culture where postmortems are not about pointing fingers, but about surfacing insights, we improve our processes and ensure continuous learning to prevent recurrence.

The product development teams have a stake in the success of the platform engineering team, just as the platform engineering team have a stake in the success of the product development teams.

Operational Excellence

Operational excellence in platform engineering goes beyond simply "keeping the lights on." It involves fostering a culture and implementing practices that ensure system platforms are reliable, trustworthy, adaptable and transparent.

Key aspects of operational excellence include maintaining Data Confidentiality, Integrity, and Availability (CIA). In today's digital landscape, platforms not only deploy code but also manage sensitive data. Protecting this data requires strict enforcement of privacy policies related to access control, encryption, data retention and disaster recovery, among other measures.

Moreover, effective change management, user acceptance testing (UAT) and release management are essential components of the operational excellence framework. No one enjoys chaotic releases and nothing erodes trust faster than downtime resulting from poorly tested changes. Operational excellence here entails establishing structured and repeatable processes for testing in controlled environments and safely rolling out updates. When change management is executed effectively, it empowers teams to deliver enhancements more quickly without compromising stability.

Additionally, clear internal communication keeps teams aligned, while transparent external communication during incidents helps maintain trust with users and stakeholders. When incidents arise, a well-defined incident response process, coupled with blameless postmortems, transforms failures and defects into valuable learning opportunities.

Finally, reporting and metrics are crucial for tying all these elements together. Daily health checks reinforce operational excellence in even the mundane tasks. Having a standardised checklist for monitoring dashboards, error logs and system metrics allows us to identify issues early before they impact users. Metrics related to uptime, infrastructure costs, release frequency, incident resolution and operational risk provide the transparency necessary for leadership to make informed decisions. Reporting is not merely a scoreboard; it is how the platform engineering team demonstrates its value to the broader organisation.

Security

If operational excellence is the backbone of platform engineering, then security serves as its immune system. A platform can be reliable, scalable and fast, but without security embedded in its design, all that value can be undone in an instant. In today’s world of sophisticated attacks and strict compliance expectations, we don’t just enable continuous delivery; we are custodians of trust.

One crucial security aspect to address is secret management. Hardcoding credentials or distributing API keys across different systems creates an invisible web of vulnerabilities. Centralised secret management solutions not only protect sensitive tokens, passwords and certificates but also automate the rotation of these secrets and enforce the principle of least privilege. This approach reduces the risk of human error and minimises exposure to security threats.

Another critical area of security is access control. Tools like Teleport are redefining secure access by providing audited, role-based and ephemeral connections to infrastructure thus enhancing security while reducing complexity. When combined with multi-factor authentication (MFA), this approach lessens the impact of compromised accounts and increases the cost of cyberattacks. Additionally, well-designed identity and access management (IAM) policies enforce the principle of least privilege across cloud and platform resources, lowering the risk of unauthorised access.

At the application level, implementing an API gateway enhances security by managing traffic, streamlining authentication and authorisation, throttling and monitoring across every service interaction. Think of it as a front door that determines who gets in, how often and under what conditions. When paired with a centralised authentication tool, the platform achieves a single source of truth for identity. This not only simplifies processes for developers but also strengthens security by ensuring that every request is consistently validated against the same set of rules.

CI/CD Pipeline Standardization: Enabling Consistency Without Compromising Autonomy

Imagine an organisation where multiple product development teams create their own isolated CI/CD (Continuous Integration/Continuous Deployment) pipelines. Each team develops its unique method for building, testing, and deploying applications. This could lead to significant chaos, characterised by inconsistent deployments, duplicated efforts and entrenched knowledge silos, making it difficult for developers to work effectively.

To prevent this scenario, the Platform Engineering team follows the principle of subsidiarity. This means that while product development teams can make decisions within their own areas, they must do so within the framework of platform-wide guidelines.

Think of it like a well-planned town. Town planners designate specific areas for either residential or industrial activities and provide essential features tailored to these areas, such as bike lanes, well-structured road networks, and reliable electricity.

This is similar to how our platform team enables decentralised governance. We manage shared utilities for all product development teams (which we can think of as the tenants) within our development ecosystem (similar to the town plan). Our primary focus is on how these systems interact and the domain-specific application protocols in use, rather than the content of each service.

To ensure a smooth process, we establish guardrails and principles that allow teams to work independently while preserving overall system coherence. These guardrails are implemented across various platforms: the language platform, infrastructure platform, data platform, product platform, core platform, and CI/CD platform.

Taking our CI/CD platform as an example, we provide base templates that are used consistently across all pipelines. These templates are designed with modularity and extensibility in mind, allowing development teams to easily override defaults and customise their pipelines as needed. This template system also abstracts much of the complexity that product development teams do not need to understand to contribute to and maintain their own pipelines. This approach not only empowers product development teams to take ownership of their pipelines and troubleshoot any CI/CD issues independently but also reduces their reliance on us, the platform team.

Another advantage of standardisation is that it allows any developer from one product development team to seamlessly contribute to another team without needing to relearn core practices or adjust to an entirely new environment. This strategy has enabled us to become enablers of self-autonomous product development teams, rather than bottlenecks.

Developer Productivity and Experience

Developer productivity is a key driver of business value. The less time engineers spend dealing with infrastructure, approvals or tooling, the more time they can dedicate to building valuable features. Platform engineering facilitates this by optimising the developer experience.

By implementing golden paths and self-service tools, platforms eliminate repetitive tasks and reduce cognitive load. Developers no longer need to reinvent processes for deploying, testing or securing services; instead, they automatically adopt best practices. Automated pipelines, service catalogues and infrastructure as code accelerate delivery while ensuring consistency and safety.

Additionally, platforms instil confidence in developers. With integrated testing, observability and standardised release processes, teams can experiment and deploy features without the fear of disrupting production. Unified tools for authentication, monitoring and communication enhance collaboration among development, operations and security teams. By prioritising the developer experience, platforms enable faster delivery, safer releases and greater innovation.

The Complexity Crisis: Why Product Teams Cannot Drive Modernization Alone

Modern software systems have become so complex that it is impractical for product development teams to simultaneously manage cloud-native orchestration, containerisation and system modernisation while also focusing on delivering business value. The technical expertise required encompasses multiple domains, including:

Container orchestration tooling such as Kubernetes, service mesh architectures and workload management
Multi-cloud configuration and management across multiple environments, regions and hybrid infrastructures
GitOps workflows for reliable, declarative deployments and infrastructure as code
Event-driven architectures with autoscaling, observability and traffic management capabilities
Service mesh implementation for security, traffic policies and inter-service communication
API gateway management for modern ingress patterns and protocol handling.

This challenge extends beyond technical issues, significantly impacting business agility, time-to-market, and the rate of innovation. By having a dedicated platform engineering team to manage cloud-native foundations, product development teams can focus on delivering user-facing features without being bogged down by the complexities of orchestration. This clear division of responsibilities also ensures that modernisation efforts are carried out proactively and intentionally, aligning with long-term architectural goals. As a result, it minimises the risk of fragmented or abandoned cloud-native initiatives.

By delegating platform-related orchestration tasks to the platform team, we have unlocked an unprecedented capacity for innovation within our product development teams. This strategic separation has allowed us to adopt cutting-edge cloud-native technologies, including:

ArgoCD for GitOps-driven continuous deployment, featuring automated rollbacks and drift detection.
Knative for orchestrating serverless workloads and enabling event-driven scaling.
Istio Ambient for zero-trust service mesh security without the overhead of sidecars.
Kubernetes Gateway API for modern ingress management and efficient traffic routing.
KEDA for event-driven autoscaling based on external metrics and message queues.

These technologies would have been overly complex for individual product teams to evaluate, implement, and maintain. Instead, we conduct thorough evaluations, implement best practices and provide these capabilities as seamless abstractions.

A compelling example of this approach was evident during our organisation-wide transition to cloud-native orchestration patterns. Rather than expecting product teams to navigate this transformation independently, the platform team led a systematic modernisation initiative. We first integrated cloud-native orchestration into our own workflows, documenting every step from containerisation strategies and Kubernetes resource patterns to service mesh configuration and autoscaling policies. After extensively testing ArgoCD for GitOps, Knative for serverless workloads and Istio Ambient for service mesh across staging and production-like environments, the team created standardised templates, comprehensive onboarding guides, and self-service tools to facilitate seamless adoption.

By thoroughly validating these cloud-native platforms and establishing proven patterns before the organisational rollout, we ensured that the solutions were production-ready, performant and capable of functioning at an enterprise scale. This approach significantly reduced the risk of abandoning promising initiatives due to implementation challenges or operational complexity.

As a result, product development teams could immediately leverage enterprise-grade container orchestration, event-driven scaling, and service mesh capabilities without needing to understand the underlying complexities of Kubernetes, GitOps workflows, or traffic management policies. This transformation revamped our organisation’s delivery model, introducing cloud-native consistency, automated resilience and facilitating faster, safer releases through modern orchestration patterns across all product teams.

FinOps & Cost Control: The Business Imperative

Provisioning infrastructure through product development teams can often result in misallocated cloud resources and unnecessary expenses. This issue is not due to negligence; rather, it stems from teams prioritising feature delivery over cost management. That's where we come in. We implement cost-aware workload placement and continuously monitor resource usage across all environments to ensure optimal utilisation. By providing a centralised view of resource utilisation, we enable informed decisions that allow the organisation to achieve more with the same or even fewer resources. Ultimately, this approach reduces unnecessary costs and ensures that cloud spending aligns with business objectives.

In early 2024, shortly after migrating all our Kubernetes workloads from Google Kubernetes Engine (GKE) to Amazon Elastic Kubernetes Service (EKS), we had around 40 applications across all our environments. Within six months of this migration, the number of workloads running on Kubernetes quadrupled, resulting in over 160 applications in our EKS clusters. This includes several key components of our products that are currently in production.

At that moment, we realised that our infrastructure provisioning was becoming a critical bottleneck. The rapid growth in workloads meant that our traditional Cluster Autoscaler was struggling to efficiently provision nodes. This led to over-provisioned resources, delays in pod scheduling, and escalating costs that threatened our ability to scale sustainably. We frequently faced situations where nodes were under-utilised due to rigid node group configurations, while applications had to wait extended periods for the necessary compute resources to become available.

Due to these challenges, we needed to implement an intelligent node autoscaling solution that could effectively handle our diverse workload requirements while optimising costs and maintaining performance. Our goal was to replace the rigid, predefined node group approach with a dynamic provisioning system. This system would select optimal instance types, availability zones, and capacity types based on actual workload characteristics, rather than relying on static configurations.

To address this issue, we implemented Karpenter as our advanced node autoscaling solution. Karpenter is a cloud-native tool that automatically provisions the right compute resources to support our cluster's applications. Unlike the standard Cluster Autoscaler, Karpenter enables us to create nodes based on the actual requirements of our pods, rather than relying on predefined node groups. We configured Karpenter to intelligently select the optimal instance types, availability zones, and capacity types based on the characteristics of our workloads, including CPU and memory requirements, networking needs, and cost constraints.

We configured Karpenter node classes to create nodes with detailed tags, which helps us easily categorise billing based on the specific environments and workloads running on those nodes. This granular tagging strategy enables accurate cost allocation across development, staging and production environments. Additionally, it provides insight into the costs associated with each product team, application type and resource usage patterns.

This significantly improved resource utilisation by removing waste linked to predefined node groups, with Karpenter creating nodes that precisely aligned with workload needs while also utilising spot instances where fitting.

Platform Engineering is an Evolution, Not a Revolution

The Kwisatz Haderach (Shortening of the Way) from Dune: Part Two — The Kwisatz Haderach (Shortening of the Way) from *Dune: Part Two*

Our experience with implementing modern platform engineering in our organisation has taught us that it is not a one-size-fits-all solution. However, when done effectively, it becomes a powerful enabler by:

Reducing cognitive load
Improving reliability and resilience
Accelerating product development and delivery
Minimising the risk of security, regulatory and functional issues
Supporting cost-effective and productive utilisation of infrastructure

In summary, the Platform Engineering Team serves as the Kwisatz Haderach (Shortening of the Way) for product development teams, enabling them to operate quickly and confidently without compromising security, compliance or system reliability.

Acknowledgments

This wouldn’t have been possible without the support of multiple members of the Platform Pillar. We’d like to thank Mark Kibara, Grace Musenyia, Brian Mumo, Kennedy Kori, Allan Sifuna, Arnold Kisuri and Ronny Mokaya.

Cover photo attribution: Image created with an AI-generated design from Freepik.