Six pillars of well-architected framework and resilient multi-agent systems • sniki.dev

I recently attended a fascinating talk by Andrey Nosov, an AI Architect at Raft LLC and a PhD in Communication Science from Tampere University. Titled “How to Join the Swarm: Architecture of High-Performance and Fault-Tolerant Multi-Agent Systems”. Andrey dove into the challenges of scaling AI from single agents to enterprise-grade “swarms”—networks of interconnected agents that handle complex tasks with resilience and efficiency.

As someone who’s been in software engineering for years, what struck me most wasn’t the novelty of his ideas, but how they echoed timeless principles we’ve already been using in “old-school” software engineering. Although in the rush to embrace AI, it feels like many have forgotten the fundamentals of building reliable systems.

Let me explain my takeaway: with the explosion of AI and agentic systems, there’s a tendency to treat everything as a shiny new problem. But the issues Andrey highlighted in production AI systems were nothing new. They’re the same issues we’ve been fighting with for decades.

The Familiar Failures in AI Production

Andrey shared some stats on why multi-agent systems fail in real-world deployments:

External API failures: 35%
Incorrect data formats: 25%
Logical errors and edge cases: 20%
Resource limit exceedances: 15%
Other bugs: 5%

If you’re a software engineer and you forget for a second that we’re talking about AI solutions in this post, these issues would seem like something you’ve been doing your entire career.

API downtime? We’ve handled that with retries, circuit breakers, and fallback mechanisms.

Data format mismatches? Schema validation and contract testing have been our go-to for years.

Logical errors? Unit tests, integration tests, and code reviews.

Resource limits? Just throw more hardware at it! Nah, I kid, it’s mostly about monitoring, autoscaling, and resource quotas.

We’ve built entire frameworks around these in distributed systems, microservices, and cloud architectures. Yet, in the AI world, people are reinventing the wheel. Often poorly, by slapping together agents without these safeguards, leading to brittle “distributed monoliths” as Andrey aptly called them.

Frameworks like CrewAI, AutoGen, and LangGraph, while powerful, often exacerbate this by encouraging synchronous calls, unchecked state growth, and loose integrations that ignore distributed systems best practices. The result? Deadlocks, skyrocketing costs, and security holes.

Four Pillars: A Solid Foundation, But Let’s Go Further

Andrey proposed four pillars to build reliable AI swarms, drawing from proven engineering patterns:

Asynchronicity and Event-Driven Model (e.g., Apache Kafka): Decouple agents via an event bus for loose coupling, scalability, and resilience. No blocking calls! 👏 Just events that allow independent operation and easy recovery.
Persistent, Externalized State (e.g., PostgreSQL + Redis): Store states outside agents using cache-aside patterns. This ensures fault tolerance, with Redis for speed and Postgres for durability.
Hybrid Intelligence Model (LLM + SLM with Fine-Tuning): Use heavy-hitting LLMs for planning and lightweight small language models (SLMs) for execution, slashing costs by up to 90% while boosting speed and predictability.
Security Through Isolation (e.g., Docker, OPA, Sandboxing): Enforce Zero Trust with containers, policy agents like Open Policy Agent (OPA), and sandboxes like gVisor. Minimal privileges, schema-validated contracts, and auditing at every step.

These are spot-on, and Andrey illustrated them beautifully in a pharma case study: a swarm of 6+1 agents screening patients for clinical trials, processing PDFs, querying EHRs, and ensuring compliance in under a minute—all with full audit trails and fault tolerance.

But as I listened, I couldn’t help thinking: this maps almost perfectly to established frameworks like AWS Well-Architected. AWS outlines six pillars for cloud architectures, not four or five. Andrey’s ideas align with four of them, reminding us that AI engineering isn’t a silo - it’s software engineering with a twist:

Hybrid Intelligence Model → Cost Optimization Pillar: By mixing LLMs for complex tasks and fine-tuned SLMs for routine ones, you’re optimizing resource use and reducing token costs dramatically. AWS emphasizes rightsising instances, using managed services, and analyzing spend—principles that apply directly to AI inference costs.
Asynchronicity and Event-Driven Model → Performance Efficiency Pillar: Event buses like Kafka enable efficient, scalable processing. AWS talks about selecting optimal compute (e.g., serverless for bursts), reviewing architectures regularly, and using data to drive efficiency—perfect for handling AI’s variable workloads without overprovisioning.
Persistent, Externalized State → Reliability Pillar: External storage prevents data loss in ephemeral agents, supporting recovery and high availability. AWS’s reliability guidance includes designing for failure, testing recovery, and distributing workloads - essentials for AI swarms where agents might crash mid-task.
Security Through Isolation → Security Pillar: Zero Trust, isolation, and policy enforcement mirror AWS’s focus on identity management, data protection, and detective controls. In AI, this means sandboxing untrusted code execution and validating agent interactions to prevent breaches.

The Missing Pillars: Operational Excellence and Sustainability in AI

Andrey’s framework is robust, but it overlooks two critical AWS pillars that are even more vital in AI’s fast-paced world. Let’s apply them to AI engineering.

Operational Excellence: Building and Running AI Like a Well-Oiled Machine

AWS defines Operational Excellence as organizing teams around business outcomes, implementing observability, automating safely, making small reversible changes, refining procedures, anticipating failure, learning from events, and using managed services.

In AI engineering, this pillar is a game-changer. AI development often feels chaotic: models hallucinate, agents go rogue, and deployments break unpredictably. Apply OE like this:

Organize Teams and Observability: Align AI teams with business KPIs (e.g., response accuracy, latency). Use tools like Prometheus or AWS CloudWatch for telemetry on agent performance, error rates, and hallucinations. This turns opaque AI into actionable insights.
Automate and Small Changes: CI/CD pipelines for model training and deployment (e.g., via AWS SageMaker). Push frequent, incremental updates to agents. Test in shadow mode before production to minimize risks.
Anticipate and Learn from Failures: Run chaos engineering on swarms (e.g., simulate API failures). Post-mortems on incidents should feed back into fine-tuning, just as Andrey’s event logs enable auditing.
Managed Services: Leverage managed services to reduce toil, focusing engineers on AI logic rather than infra. Unless, of course, you have your own DevOps team. 🤷‍♂️

Without this, AI projects stall in “experiment mode” and never scale reliably.

Sustainability: Greening AI’s Massive Footprint

AWS’s Sustainability Pillar addresses environmental impacts, focusing on energy efficiency, reducing waste, and quantifying emissions through scopes (direct, indirect, and supply chain).

This pillar is new even for myself. I think there was just 5 pillars when I first found out about them. The sustainability pillar only was added in December 2021.

AI is notoriously energy-hungry! Training a single LLM can emit as much CO2 as five cars over their lifetimes. In multi-agent systems, inefficient inference multiplies this.

Because this pillar is new to me, I, honestly, wasn’t paying that much attention to it during my usual workflow. However, here are the principles that I could find (with the help of Grok):

Understand and Quantify Impacts: Track your AI workload’s carbon footprint using AWS Customer Carbon Footprint Tool. Factor in Scope 3 emissions from data centers.
Design for Efficiency: Optimize models (e.g., Andrey’s hybrid LLM/SLM approach reduces energy by using smaller models for most tasks). Use serverless AI services like AWS Lambda for inference to scale down to zero when idle.
Reduce Resource Usage: Context engineering (e.g., compressing prompts) cuts token consumption. Schedule training during low-carbon grid hours or in renewable-powered regions.
Shared Responsibility: AWS handles Scope 1/2 emissions with renewable energy procurement; you minimize Scope 3 by efficient design.

All in all, by aligning well with sustainability pillar, not only we’re benefiting the environment, but also keeping all sorts of regulators at bay, and lowering our costs.

Don’t forget what we already know

The main lesson here is to not throw away decades of software engineering knowledge in the AI gold rush. Instead, repurpose it.

If you’re building AI, start with these pillars. Go through AWS’s Well-Architected Framework again, give yourself a refresher. 😁