LionAGI Architecture — orchestration & fleet patterns
This document outlines a compact 1–2 page architecture for using LionAGI as an orchestration layer for multi-step AI workflows (including QE/fleet use cases). It focuses on components, data flows, scaling considerations, and guardrails.
Goals
- Use LionAGI to orchestrate typed, auditable workflows that combine LLM planning with deterministic tool actions.
- Support dispatching work to a fleet of runners (CI agents, device agents, FleetDM-managed hosts).
- Maintain observability, replayability, and safety for automated or semi-automated remediation.
High-level components
- Model Provider Layer
- Providers: OpenAI, Anthropic, Perplexity, Ollama, internal models
- Responsibilities: LLM inference, routing to best model per task
- LionAGI Orchestration Layer
- Branches: workflow contexts and histories
- Planners/ReAct controllers: decide actions, call tools, loop until goal
- Validators: Pydantic schemas and custom checks
- Action logs: structured records of tool calls and agent reasoning
- Tool & Adapter Layer
- CI/API adapters: GitHub Actions, GitLab, Jenkins
- Device management adapters: FleetDM, MDM API, SSH, OTA services
- Test harnesses: test runners, synthetic monitoring agents, fuzzers
- Ticketing/Issue adapters: GitHub Issues, Jira
- Storage & Retrieval
- Artifact store: object storage (S3) for logs, screenshots, traces
- Vector DB / RAG: Pinecone, Milvus, Weaviate for contextual retrieval
- Metadata DB: lightweight relational DB for run metadata, indexing
- Observability & Control Plane
- Logging: ELK / Loki / structured logs (JSON), exportable DataFrames
- Metrics & Alerts: Prometheus + Alertmanager, SLO dashboards
- Human-in-the-loop UI: approvals, manual triage, PR review
Data flow (simple sequence)
- User or schedule triggers a workflow (goal) in LionAGI.
- Branch planner asks an LLM to decompose the goal into typed steps (Pydantic TestPlan).
- For each step, planner chooses a target runner using adapters (FleetDM query or CI tag) and dispatches via a tool call.
- Runner executes test, uploads artifacts to object store, and posts result to a callback endpoint or polled endpoint.
- LionAGI action log records the tool call and response; LLM reasons on results and decides next steps (retry, escalate, file issue).
- If issue creation is chosen, an adapter creates a ticket with a structured payload and links to artifacts.
- Final structured summary (Pydantic TestResultSummary) is emitted and stored with the run metadata.
Simple ASCII diagram
User/Schedule
|
v
[ LionAGI Branch / Planner ] ←> [ Model Providers ]
| calls tools
v
[ Tool Adapters: CI, FleetDM, HTTP ] -⇒ [ Runners / Devices / CI Workers ]
| |
v v
[ Artifact Store ] ⇐-------------------- results & artifacts
|
v
[ Observability: Logs / Metrics / Tickets / Vector DB ]
Deployment & scaling notes
- Run LionAGI controller as a service (k8s deployment) with autoscaling based on queue depth of incoming workflows.
- Model providers are external; use local model MCPs (Ollama, vLLM) where low-latency/on-prem inference is required.
- Runners (test agents) should be managed separately (FleetDM, k8s pods, VM Fleet) and expose a stable API to the tool adapters.
- Offload heavy artifact processing (video frames, large logs) to separate workers and reference via object URLs to keep the action logs small.
Security & safety
- Gate destructive tools behind allow_changes boolean and require human approval for high-risk workflows.
- Sign and verify callbacks from runners; use authentication tokens per adapter.
- Redact PII before storing artifacts or sending to third-party LLM providers. Use privacy-preserving embedding if needed.
- Rate-limit LLM usage and enforce cost budgets at the model provider layer.
Observability & reproducibility
- Store Branch histories and action logs in structured JSON; support exporting to DataFrames for analysis.
- Keep mappings between Branch runs and external artifacts/tickets for traceability.
- Add retry logic and idempotency keys to tools to avoid duplicate side-effects.
Guardrails & human-in-the-loop
- Include explicit review steps (“approval” tool) before PR merges or destructive remediation.
- Emit a natural-language runbook for any remediation action the agent proposes, and require a human confirmation token to proceed.
Recommended minimal stack for a PoC
- LionAGI controller (k8s or VM)
- One model provider (OpenAI or local Ollama) configured via provider adapter
- Simple HTTP runner (small test harness) reachable by a Tool adapter
- S3-compatible artifact store (MinIO)
- Relational DB (Postgres) to index runs and metadata
- Observability: ELK or Loki + Grafana for dashboards
Next steps
-
Create a concrete starter recipe and Pydantic schema for a test-plan workflow (see starter code file).
-
Build a thin Tool adapter for your chosen runner (HTTP webhook or polling).
-
Implement allow_changes gates and a human-approval UI for production rollout.
Status: DRAFT — adapt to your infra and policies.