## Problem
Development velocity is throttled by rising operational overhead. Teams burn cycles maintaining systems instead of shipping features: monitoring gaps stretch incident timelines, cross-cutting concerns (logging, metrics, auth, caching) fragment across codebases, and docs drift within days of changescreating tribal knowledge and onboarding drag. Meanwhile, AI tooling promises leverage, but most architectures aren’t designed for agent/LLM collaboration. Without structural fixes, technical debt compounds faster than we can service it, operational excellence devolves into firefighting, and we miss the AI-native development shift that could change how we build and run software.
---
## Principles
**1. Code Maintainability + Best Practices:** “**Slice boundaries are change boundaries**” — organize by feature/capability (vertical slices), not technical layers, to minimize ripple effects and cognitive load.
**2. AI Integration + Tooling:** “**Automate with agency, not dependency**” — use AI agents to accelerate work while keeping tools portable and avoiding vendor lock-in.
**3. Monitoring + Alarming:** “**Observable by contract**” — every slice implements standard telemetry & alerts; observability is non-negotiable.
**4. Documentation:** “**Living docs, minimal drift**” — docs co-evolve with code via automation/agents so spec and reality stay aligned.
---
# 1. Code Maintainability + Best Practices
## Vertical Slices Architecture
**Purpose:** Self-contained feature units owning their API boundary, domain logic, data access, tests, and docs.
- **Change isolation:** Most edits touch one slice.
- **Cognitive load:** One spec per slice; agents ingest focused context.
- **Deployability:** Ship behind flags and roll back independently.
- **Quality:** Tests and observability travel with the slice.
Reference: [Vertical Slice Architecture](https://www.jimmybogard.com/vertical-slice-architecture/](https://www.jimmybogard.com/vertical-slice-architecture/) ) (Jimmy Bogard).
#### Package Example
```
/<slice-name>/
build.* # build file(s) for this slice
/application/ # endpoints/handlers, DTOs, mappers
/domain/ # entities, services, policies
/data/ # repositories, gateways (SQL/NoSQL/cache)
/tests/ # unit, integration, e2e (slice-scoped)
/docs/
spec.md # LLD: contracts, sequences, schemas
runbook.md # alarms, on-call, SLOs, rollback
adr/*.md # Architecture Decision Records
```
**Note:** Keep infra and UI in their own packages to avoid tangled layer merges, but follow the same slice pattern.
**Infra Example (IaC-agnostic)**
```
/<slice-name>-infra/
main.* # Terraform/Pulumi/CDK module(s)
/docs/
spec.md # infra design, topology, policies
runbook.md # incident playbooks
adr/*.md
```
**UI Example (framework-agnostic)**
```
/<slice-name>-ui/
Component.* # React/Vue/Svelte/etc.
/docs/
spec.md # flows, states, accessibility notes
runbook.md
adr/*.md
```
## Supporting Packages
### Alignment Package (Platform Alignment Library)
**Purpose:** The “north star” for consistency. Ships steering docs, hook contracts, templates, and quality rules consumed by slices.
```
/platform/
/alignment/
/steering/ # standards: naming, boundaries, SLOs, security
/hooks/ # editor/CI agent hook definitions
/templates/ # spec, runbook, ADR templates
/quality/ # lint/QA rulepacks, policy-as-code
```
### Aspects Package (Cross-Cutting Behaviors)
**Purpose:** Small, explicit library of approved behaviors (observability, resilience, caching, auth). Enforce 1:1 annotation/config ↔ interceptor/middleware mapping so coupling stays visible and reviewable.
```
/platform/
/aspects/
/annotations-or-config/
/interceptors-or-middleware/
```
### Golden Path Libraries (App & Infra)
**Purpose:** Shared, well-tested abstractions that speed delivery (HTTP clients, data mappers, idempotency, retry utilities; infra modules for queues, topics, jobs, storage). Changes reviewed by an Alignment/Architecture group.
## Framework & Patterns (portable)
### Dependency Injection
**What:** Declarative wiring and scoping for components.
**Options:** Dagger/Hilt, Guice, Spring/.NET DI, Micronaut, NestJS, FastAPI/Depends—choose per language, but keep compile-time where possible for startup perf.
**Guidelines:** Constructor injection; explicit scopes; configuration via typed config. Keep DI boundaries thin for FaaS/containers.
### Data Access
**What:** Repository/gateway interfaces with generated or minimal implementations.
**Options:** SQL (Postgres/MySQL/SQL Server) via ORM or lightweight query builders; NoSQL (DynamoDB/Cosmos/Firestore/Mongo) via thin adapters; event stores; caches (Redis/Memcached). Keep call sites stable behind interfaces so storage changes don’t leak.
### Resilience
**What:** Retries, exponential backoff, timeouts, circuit breakers, bulkheads. Implement as middleware/aspects, not business logic.
**Options:** Resilience4j, Envoy/sidecar policies, service-mesh policies, language runtime libraries.
### Telemetry (Logging, Metrics, Tracing)
**What:** Structured logs, metrics, and traces with shared schemas/names.
**Options:** OpenTelemetry + Prometheus/Grafana, ELK/OpenSearch, Datadog/New Relic/Honeycomb. Mandate a common metric namespace and log fields so dashboards compose across slices.
---
# 2. AI Integration + Tooling
## Agentic Development
**What:** Agents that understand alignment standards and slice context to scaffold, implement, test, and document.
**How:**
- Agents read `/docs/spec.md`, ADRs, tests, and alignment rules.
- Tooling stays portable (no lock-in): run locally/CI; swap providers (OpenAI/Anthropic/local).
- Lifecycle coverage: design → code → test → deploy → maintain.
**Capabilities:**
- **Scaffold:** Generate slice layout, contracts, tests, CI, baseline telemetry.
- **Implement:** Follow patterns for endpoints, services, data.
- **Test:** Happy paths, edge cases, property tests, contract tests.
- **Docs:** Keep spec/runbook/ADRs updated from diffs.
- **Refactor:** Propose safe changes with tests and measurable risk.
## LLM Tooling Stack (example options)
- **Agentic IDE:** Cursor, VS Code agents, Cline, Windsurf.
- **Policy & QA:** Non-blocking “AutoQA” bot that flags logic smells, perf issues, security gaps, and doc/code drift.
- **Hooks:** Pre-commit/CI/CD hooks to auto-sync docs, verify contracts, and surface missing telemetry.
## Agent Hooks (event-driven)
- **On create slice:** scaffold structure, seed `spec.md`, tests, interceptors/middleware.
- **On save/PR:** sync public API diffs & schema hashes into docs; lint; propose missing metrics/alerts.
- **On review:** verify “evidence” exists (logs, metrics, tests) for scenarios in `spec.md`.
## Context Management
**Goal:** Make each slice “LLM-ready.” Ship compact, canonical context: spec, ADRs, contracts, tests. Keep prompts within bounded, relevant windows.
## Quality Assurance
**AutoQA (non-blocking):**
Flags logic errors, anti-patterns, performance smells, security misses, missing telemetry, and doc divergence. Rules live in repo under `/platform/quality`; blocking items require fixes or an ADR exception.
---
# 3. Monitoring + Alarming
## Observable-by-Contract
**Thesis:** Every infra module/component implements a shared **MonitoringContract** guaranteeing: (1) standard metrics, (2) required alarms with sane defaults, (3) ticket creation/update with severity & runbook links, and (4) dashboard widgets for composable service views.
### Goals
- **Uniformity:** Golden signals (latency, errors, traffic, saturation) everywhere, consistent names/dimensions.
- **Composability:** Components export widgets; dashboards assemble automatically.
- **Actionability:** Every alarm links to a runbook and carries context (trace id, key attrs, environment).
- **Low upkeep:** Defaults come from Alignment; slices override via typed props.
### SLO/SLAs
Define per slice in `spec.md` (latency, availability, freshness, throughput). Track error budgets and wire alerts accordingly.
### Enforcement & Tooling
- **Hooks:** On PR, verify components implement `MonitoringContract` and required widgets/alarms exist.
- **AutoQA:** Check metric names/dimensions, thresholds vs SLOs, and presence of runbook links.
### Runbook Contract
Each alarm links to a minimal runbook: **Symptoms → Likely Causes → Triage Steps → Rollback Toggle → Verification Graph → Owners**.
---
# 4. Documentation
## Living Documentation (Wiki-as-Code)
### Steering Doc
**What:** The “how we build” playbook—naming, slice size, boundaries, testing, observability, security, performance, rollout.
**Where:** Central in the Alignment library under `/steering/`, versioned, PR-driven.
**How shared:** Imported as a dependency or referenced via submodule/template.
### Spec Doc (per slice)
Includes:
- **Problem / Goals / Non-Goals**
- **Contracts & Scenarios** (I/O, errors, invariants)
- **Sequences** (happy/sad paths)
- **Schemas** (data, events)
- **SLOs & Observability** (metrics/logs/traces/alerts)
- **Rollout/Rollback** (flags, migrations)
- **Risks & Tests** (golden paths, load/capacity)
Maintained in `/docs/spec.md`, edited with code; kept in sync by hooks/agents.
### ADRs
Use an ADR template (e.g., Michael Nygard/MADR). Each decision records context, options, decision, consequences. Accepted ADRs are immutable; supersede with new ADRs when direction changes.
---
## Tech-Chooser Cheat Sheet
- **IaC:** Terraform / Pulumi / CDK (choose one; standardize modules & contracts).
- **Runtime:** Containers (Kubernetes/Nomad) and/or FaaS (Lambda/Cloud Functions/Azure Functions).
- **DI & Web:** Spring/.NET/NestJS/Micronaut/FastAPI/Express—pick per language, keep patterns.
- **Data:** Postgres/MySQL/SQL Server; Dynamo/Cosmos/Firestore/Mongo; Redis/Memcached.
- **Messaging:** Kafka/PubSub/Event Hubs/RabbitMQ/NATS.
- **Telemetry:** OpenTelemetry + Prometheus/Grafana; or Datadog/New Relic/Honeycomb; logs via ELK/OpenSearch.
- **Auth:** OIDC/OAuth2 provider (Auth0/Okta/Cognito/Keycloak).
- **Docs:** Markdown in-repo; ADRs in `/docs/adr`; diagrams as code (Mermaid/PlantUML).
- **AI:** Provider-pluggable (OpenAI/Anthropic/local); agents run locally and in CI.
---
## Operating Cadence
- **Every PR:** AutoQA report (non-blocking), doc sync, contract checks, SLO drift note.
- **Every Slice:** Owns its spec, SLOs, alarms, dashboards, and runbook.
- **Every Quarter:** Alignment library update (small, frequent), with migration notes; teams adopt via version bump.
- **Every Incident:** Post-incident ADR or runbook patch; add missing telemetry first, code second.