LionAGI Architecture — orchestration & fleet patterns

This document outlines a compact 1–2 page architecture for using LionAGI as an orchestration layer for multi-step AI workflows (including QE/fleet use cases). It focuses on components, data flows, scaling considerations, and guardrails.

Goals

  • Use LionAGI to orchestrate typed, auditable workflows that combine LLM planning with deterministic tool actions.
  • Support dispatching work to a fleet of runners (CI agents, device agents, FleetDM-managed hosts).
  • Maintain observability, replayability, and safety for automated or semi-automated remediation.

High-level components

  • Model Provider Layer
    • Providers: OpenAI, Anthropic, Perplexity, Ollama, internal models
    • Responsibilities: LLM inference, routing to best model per task
  • LionAGI Orchestration Layer
    • Branches: workflow contexts and histories
    • Planners/ReAct controllers: decide actions, call tools, loop until goal
    • Validators: Pydantic schemas and custom checks
    • Action logs: structured records of tool calls and agent reasoning
  • Tool & Adapter Layer
    • CI/API adapters: GitHub Actions, GitLab, Jenkins
    • Device management adapters: FleetDM, MDM API, SSH, OTA services
    • Test harnesses: test runners, synthetic monitoring agents, fuzzers
    • Ticketing/Issue adapters: GitHub Issues, Jira
  • Storage & Retrieval
    • Artifact store: object storage (S3) for logs, screenshots, traces
    • Vector DB / RAG: Pinecone, Milvus, Weaviate for contextual retrieval
    • Metadata DB: lightweight relational DB for run metadata, indexing
  • Observability & Control Plane
    • Logging: ELK / Loki / structured logs (JSON), exportable DataFrames
    • Metrics & Alerts: Prometheus + Alertmanager, SLO dashboards
    • Human-in-the-loop UI: approvals, manual triage, PR review

Data flow (simple sequence)

  1. User or schedule triggers a workflow (goal) in LionAGI.
  2. Branch planner asks an LLM to decompose the goal into typed steps (Pydantic TestPlan).
  3. For each step, planner chooses a target runner using adapters (FleetDM query or CI tag) and dispatches via a tool call.
  4. Runner executes test, uploads artifacts to object store, and posts result to a callback endpoint or polled endpoint.
  5. LionAGI action log records the tool call and response; LLM reasons on results and decides next steps (retry, escalate, file issue).
  6. If issue creation is chosen, an adapter creates a ticket with a structured payload and links to artifacts.
  7. Final structured summary (Pydantic TestResultSummary) is emitted and stored with the run metadata.

Simple ASCII diagram

User/Schedule
|
v
[ LionAGI Branch / Planner ] > [ Model Providers ]
| calls tools
v
[ Tool Adapters: CI, FleetDM, HTTP ] - [ Runners / Devices / CI Workers ]
| |
v v
[ Artifact Store ] -------------------- results & artifacts
|
v
[ Observability: Logs / Metrics / Tickets / Vector DB ]

Deployment & scaling notes

  • Run LionAGI controller as a service (k8s deployment) with autoscaling based on queue depth of incoming workflows.
  • Model providers are external; use local model MCPs (Ollama, vLLM) where low-latency/on-prem inference is required.
  • Runners (test agents) should be managed separately (FleetDM, k8s pods, VM Fleet) and expose a stable API to the tool adapters.
  • Offload heavy artifact processing (video frames, large logs) to separate workers and reference via object URLs to keep the action logs small.

Security & safety

  • Gate destructive tools behind allow_changes boolean and require human approval for high-risk workflows.
  • Sign and verify callbacks from runners; use authentication tokens per adapter.
  • Redact PII before storing artifacts or sending to third-party LLM providers. Use privacy-preserving embedding if needed.
  • Rate-limit LLM usage and enforce cost budgets at the model provider layer.

Observability & reproducibility

  • Store Branch histories and action logs in structured JSON; support exporting to DataFrames for analysis.
  • Keep mappings between Branch runs and external artifacts/tickets for traceability.
  • Add retry logic and idempotency keys to tools to avoid duplicate side-effects.

Guardrails & human-in-the-loop

  • Include explicit review steps (“approval” tool) before PR merges or destructive remediation.
  • Emit a natural-language runbook for any remediation action the agent proposes, and require a human confirmation token to proceed.
  • LionAGI controller (k8s or VM)
  • One model provider (OpenAI or local Ollama) configured via provider adapter
  • Simple HTTP runner (small test harness) reachable by a Tool adapter
  • S3-compatible artifact store (MinIO)
  • Relational DB (Postgres) to index runs and metadata
  • Observability: ELK or Loki + Grafana for dashboards

Next steps

  • Create a concrete starter recipe and Pydantic schema for a test-plan workflow (see starter code file).

  • Build a thin Tool adapter for your chosen runner (HTTP webhook or polling).

  • Implement allow_changes gates and a human-approval UI for production rollout.

Status: DRAFT — adapt to your infra and policies.