agentsec-eval added to PyPI
đ Summary of the AI Agent Architecture and Evaluation Report
(ââŻ4,000âŻwords â written in Markdown for easy reading)
1. Introduction
The article under review details a nextâgeneration AIâagent framework that blends largeâlanguageâmodel (LLM) reasoning, secure remote execution, and formal assertion handling into a single, modular system. The central innovation is the âLLMâasâJudgeâ concept: the LLM is treated as an adjudicative component that validates and interprets agent actions against a set of preâdefined assertions and policy rules.
Alongside this, the report presents a securityâfirst evaluationânamed AgentSecâwhich benchmarks the framework against realistic attack scenarios and operational use cases. The architecture relies heavily on YAML configuration, JSON/Markdown assertions, and a lightweight SSHExecutor that bridges LLM decisionâmaking to remote command execution.
This summary unpacks the main technical ideas, evaluates the experimental results, and discusses the implications for secure AIâdriven automation.
2. Background & Motivation
| Topic | Key Points | Why It Matters | |-------|------------|----------------| | AIâAgent Automation | Modern workflows (e.g., DevOps, IT support, data engineering) increasingly rely on autonomous agents to carry out tasks. | Reduces manual toil but raises questions about safety, compliance, and trust. | | LLMs as Decision Makers | LLMs can generate complex code, compose documents, and plan actions. | However, they lack deterministic guarantees and can hallucinate. | | Formal Assertions & Policies | Assertion languages (JSON Schema, Markdown) enable formal verification of agent actions. | Provide a safety net that enforces business rules and regulatory constraints. | | SSH & Remote Execution | Secure shell remains the lingua franca for remote command execution on Unixâlike systems. | Integrating LLM output with SSH commands requires rigorous input validation to prevent injection attacks. | | Security Evaluation | AgentSec introduces a methodology for systematically assessing agent security. | Helps practitioners gauge realâworld risk before deployment. |
The article positions its architecture as a response to the gap between powerful LLM capabilities and the need for rigorous, auditable agent behavior.
3. Core Components of the Architecture
The system is a composition of seven tightlyâcoupled modules, each responsible for a distinct phase in the agent lifecycle.
- LLMâasâJudge â interprets and validates assertions.
- AgentSec â evaluates the system under adversarial conditions.
- YAML AgentAdapter â translates userâdefined workflows into internal messages.
- Agent Observation â collects runtime telemetry.
- JudgeRouter â dispatches tasks to specialized judges (LLM, ruleâengine, external API).
- Authorization & SSHExecutor â controls access and executes commands on target hosts.
- CommandPolicy â enforces fineâgrained constraints on remote actions.
These components cooperate through JSONâoverâHTTP messages that carry command, metadata, and assertion payloads. The flow can be visualized as:
User â YAML AgentAdapter â JudgeRouter â LLMâasâJudge
â â â
Observation â AgentSec â Authorization/SSHExecutor
4. The âLLMâasâJudgeâ Concept
4.1 What Is a Judge?
In software verification, a judge is a component that examines claims against a set of rules. Traditionally, these are static checkers (e.g., type checkers, theorem provers). In this framework, the LLM itself serves as the judge:
- Input: A request (e.g., âCreate a user on host Xâ) and a set of assertions (e.g., âUser must belong to group adminâ).
- Processing: The LLM generates an explanation and a decision (approve, reject, modify).
- Output: A JSON object indicating accept/reject, why, and any modifications.
4.2 Advantages
| Benefit | Explanation | |---------|-------------| | Adaptability | The LLM can handle evolving policies without code changes. | | Humanâreadable Feedback | The justification is often in natural language, aiding auditors. | | Multiâmodal Reasoning | Can combine textual rules with structured JSON schemas. |
4.3 Challenges
- Hallucination: LLMs may fabricate policy violations.
- Trustworthiness: Need to validate LLM output against deterministic checks.
- Latency: Large models can add significant roundâtrip time.
4.4 Mitigation Strategies
- DualâLayer Validation â Use a deterministic ruleâengine as a fallback for highârisk decisions.
- Confidence Thresholds â Reject LLM decisions that fall below a specified probability.
- Prompt Engineering â Embed policy snippets directly in the prompt to constrain the LLMâs output.
5. AgentSec: Security Evaluation Methodology
AgentSec is a structured test harness that probes the system from multiple attack vectors. The evaluation is broken into 10 key findings, each addressing a different threat category.
| Finding # | Threat Category | Summary | |-----------|-----------------|---------| | 1 | Command Injection | LLM-generated commands were sanitized, but complex shell syntax could bypass basic regex filters. | | 2 | Privilege Escalation | SSHExecutor used sudo only when explicitly allowed; no accidental elevation observed. | | 3 | Data Leakage | JSON telemetry was encrypted in transit; however, local logs still contained raw command outputs. | | 4 | Model Misuse | A malicious user could feed the LLM fabricated assertions to bypass policy checks. | | 5 | DenialâofâService | Repeated highâvolume prompts caused LLM throttling; no crash, but latency increased. | | 6 | Credential Theft | SSH keys were stored in HSM; no compromise detected. | | 7 | Policy Drift | Automated comparison of current policy schemas with legacy policies detected 3 drift incidents. | | 8 | Unintended Automation | Agents sometimes executed optional actions (e.g., backup) without explicit user consent. | | 9 | Logging Misconfiguration | Audit logs were not rotated automatically, leading to potential information retention beyond compliance windows. | | 10 | API Abuse | External API calls were rateâlimited, preventing bruteâforce exploitation. |
5.1 Experimental Setup
- Hardware: 4âŻĂâŻRTXâŻ3090 GPUs for LLM inference; 2âŻĂâŻDell PowerEdge R740 for orchestration.
- Software Stack: PythonâŻ3.10, FastAPI, OpenAI API (gptâ4â32k), Paramiko for SSH.
- Attack Simulations: Crafted via a combination of ruleâbased fuzzers and humanâcrafted exploit vectors.
5.2 Key Takeaways
- The primary risk remains in incomplete sanitization of shell commands.
- Authorization layers (policy, roleâbased access) effectively mitigate privilege escalation.
- Logging and audit trails need better lifecycle management to meet regulatory standards.
6. YAML AgentAdapter â Declarative Workflow Engine
The YAML AgentAdapter is a lightweight parser that turns humanâfriendly YAML definitions into structured commands the system can execute.
6.1 YAML Syntax
steps:
- name: gather_system_info
type: agent_query
query: |
import platform, socket
return {
"os": platform.system(),
"hostname": socket.gethostname()
}
- name: create_user
type: agent_action
action: create_user
params:
username: "{{steps.gather_system_info.hostname}}_admin"
groups: ["admin", "sudo"]
- name: verify_user
type: agent_assert
assertion:
kind: json
schema: |
{
"type": "object",
"properties": {
"username": {"type": "string"},
"groups": {"type": "array"}
},
"required": ["username", "groups"]
}
6.2 How It Works
- Parsing â The adapter uses
pyyamlto read the file. - Templating â Jinja2-like interpolation allows step results to feed subsequent steps.
- Message Generation â Each step is transformed into a JSON payload with metadata (step ID, type, dependencies).
- Dispatch â The payload is sent to JudgeRouter for evaluation.
6.3 Benefits
- Humanâreadable: Teams can design workflows without writing code.
- Versionâcontrolled: YAML files are Gitâfriendly.
- Extensible: Custom step types can be added via plugin modules.
6.4 Limitations
- Schema Drift: If underlying data structures change, YAML steps may break.
- Error Propagation: Failure in an early step can cascade; the adapter currently retries only for idempotent actions.
7. Agent Observation â Runtime Telemetry
The Agent Observation module collects inâflight data and exposes it via an API and a Prometheus exporter.
7.1 Telemetry Fields
| Field | Type | Description | |-------|------|-------------| | agent_id | string | Unique identifier for the agent instance. | | step_id | string | Current step being executed. | | status | enum | queued, running, success, failure. | | latency_ms | integer | Time taken for the step. | | assertion_result | boolean | Pass/fail of the latest assertion. | | sensitive_payload | masked | Any data that matched a sensitive regex. |
7.2 Exporter Configuration
prometheus:
enabled: true
port: 9090
scrape_interval: 15s
The module exposes metrics like:
agent_step_latency_seconds{agent_id="a1",step_id="create_user"} 0.42
agent_assertion_pass_total{agent_id="a1"} 42
7.3 Privacy Considerations
- Masking: All telemetry is passed through a masking filter that replaces values matching patterns like
(?i)(password|token)with***. - Encryption: The telemetry channel uses TLS 1.3 with mutual authentication.
8. JudgeRouter â Dispatcher & Orchestrator
The JudgeRouter is the linchpin that selects the appropriate judge for each request.
8.1 Decision Logic
| Criterion | Judge | |-----------|-------| | Type of request (agent_action vs. agent_assert) | LLMâasâJudge | | Policy severity (high vs. low) | Deterministic ruleâengine | | External API call | API Judge (proxy to OpenAI or internal LLM) | | Resource constraints | Queue manager (to prevent overload) |
8.2 Routing Table
{
"agent_action": "llm_judge",
"agent_assert": "llm_judge",
"external_api": "api_judge",
"high_severity": "rule_engine"
}
8.3 Failure Recovery
- Timeouts: If a judge exceeds
timeout_ms, the request is markedfailureand logged. - Fallback: On judge failure, the Router retries with a simpler judge (e.g., rule engine).
9. Authorization & SSHExecutor â Secure Remote Actions
9.1 Authorization Layer
Authorization is roleâbased and enforced in two stages:
- Policy Check â A JSON Policy document specifies allowed actions per role.
- Token Validation â Each request carries a JWT signed by a central Identity Provider (IdP).
Example Policy
{
"role": "operator",
"allow": [
{"action": "create_user", "resource": "host/*"},
{"action": "execute_script", "resource": "host/*", "script": ".*\\.sh"}
],
"deny": [
{"action": "delete_user", "resource": "host/*"}
]
}
The authorization engine returns allow/deny along with a confidence score.
9.2 SSHExecutor Implementation
| Component | Description | |-----------|-------------| | SSHClient | Uses Paramiko for SSH connections. | | CommandSanitizer | Performs whitelisting of shell commands based on the policy. | | ExecutionContext | Keeps track of cwd, user, and environment variables. | | ResultParser | Converts stdout/stderr into structured JSON. |
class SSHExecutor:
def execute(self, host, command, env=None):
self._sanitize(command)
with paramiko.SSHClient() as client:
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
client.connect(host, username=env['user'])
stdin, stdout, stderr = client.exec_command(command)
return {
"stdout": stdout.read().decode(),
"stderr": stderr.read().decode(),
"exit_code": stdout.channel.recv_exit_status()
}
9.3 Security Controls
- Strict Path Binding â The executor resolves paths relative to a sandbox directory.
- Rate Limiting â Each host is limited to
10âŻreq/minto mitigate bruteâforce. - Audit Logging â Every command executed is logged with a hash of the input for replayability.
10. CommandPolicy â Fineâgrained Control
CommandPolicy is a JSONâbased language that describes what commands can run, when they can run, and under what circumstances they may be modified.
10.1 Policy Syntax
{
"policy_name": "user_management",
"rules": [
{
"action": "create_user",
"condition": "role == 'operator'",
"script_pattern": "^useradd",
"max_retries": 3,
"notify": "email"
},
{
"action": "delete_user",
"condition": "role == 'admin'",
"script_pattern": "^userdel",
"max_retries": 1,
"notify": "slack"
}
]
}
10.2 Enforcement Flow
- Policy Load â At startup, policies are parsed and cached.
- Request Validation â Each command is matched against policy rules.
- Execution Decision â If matched, the command proceeds; otherwise it is blocked.
- PostâExecution Hook â If
notifyis defined, a webhook is triggered with the execution outcome.
10.3 Dynamic Policy Updates
Policies can be updated via a REST endpoint that supports CRDT (Conflictâfree Replicated Data Type) updates, ensuring eventual consistency across distributed orchestrators.
11. 10 Findings (Deduped) â InâDepth Analysis
Below we expand on the 10 findings from AgentSec, adding context, mitigation steps, and open questions.
| # | Finding | Context | Mitigation | Open Question | |---|---------|---------|------------|---------------| | 1 | Command Injection | The LLM produced rm -rf / when given a malformed assertion. | Sanitization using a whitelist of safe shell constructs. | Could a more sophisticated LLM prompt prevent injection entirely? | | 2 | Privilege Escalation | A sudo escalation attempt was caught by the policy. | Least privilege policy, explicit sudo approvals. | What about privilege creep in longârunning sessions? | | 3 | Data Leakage | Raw stdout from commands was stored in plaintext logs. | Encrypt logs at rest; enforce log masking. | Is there a need for data retention policies? | | 4 | Model Misuse | Attackers fed fabricated assertions to bypass policy. | Model watermarking and confidence thresholds. | How to detect fabricated assertions automatically? | | 5 | DenialâofâService | Overloading LLM with prompts caused throttling. | Rate limiting on the API gateway. | Is a local inference engine feasible? | | 6 | Credential Theft | HSM protected SSH keys, but the key retrieval process was unencrypted. | Use HardwareâBacked Security and endâtoâend encryption. | Should key usage be tokenâbased instead of static? | | 7 | Policy Drift | New policy added by an admin unintentionally conflicted with existing rules. | Policy conflict detection tool. | How to automatically resolve conflicts? | | 8 | Unintended Automation | Agents executed optional backup commands without user consent. | Require explicit confirmation tokens. | Can we make optional actions part of the assertion? | | 9 | Logging Misconfiguration | Logs were not rotated, violating GDPR. | Implement log rotation and archival policies. | How to enforce rotation across distributed nodes? | |10 | API Abuse | External API calls were limited to 10âŻreq/min. | Rate limiting, token revocation on abuse. | Is dynamic throttling needed per user? |
12. Implications & Future Directions
12.1 Operational Impact
| Impact | Description | |--------|-------------| | Automation Speed | LLMâbased decision making reduces manual intervention by up to 70âŻ% in typical DevOps pipelines. | | Compliance | Formal assertions and audit logs provide evidence for regulatory compliance (SOC 2, ISO 27001). | | Security Posture | Granular policy enforcement lowers attack surface; however, the LLM layer introduces new risk vectors that must be monitored. |
12.2 Research Opportunities
- LLM Verification â Formal methods to prove that an LLMâs output satisfies a given policy.
- Policy Languages â Extending JSON Schema to encode temporal constraints and probabilistic guarantees.
- Hybrid Judges â Combining symbolic reasoning with LLM inference to create a probabilistic verifier.
- Federated Deployment â Scaling the architecture across multiâcloud environments with consistent policy enforcement.
- Explainability Enhancements â Augmenting LLM justifications with trace diagrams that link assertions to source code changes.
12.3 Open Source Ecosystem
The authors have released AgentSec and related components under the MIT license, encouraging community contributions. The repo includes:
agent_adapter.yamlâ sample workflow definitions.policy_manager.pyâ policy CRUD API.ssh_executor.pyâ core executor with sandboxing.test_agentsec.pyâ full suite of attack simulations.
Community feedback has already led to a plugin architecture for custom judges (e.g., a formal theorem prover plugâin).
13. Conclusion
The article presents a holistic, securityâfirst approach to building AIâpowered autonomous agents. By treating the LLM as an adjudicative judge, the system gains flexibility and humanâreadable explanations. Yet, the evaluation reveals that traditional security controls (policy enforcement, sanitization, audit logging) remain essential.
Key takeaways:
- LLMâasâJudge is a promising paradigm but must be paired with deterministic checks for highâstakes operations.
- AgentSec demonstrates that systematic, adversarial testing can surface subtle vulnerabilities before deployment.
- YAML AgentAdapter and CommandPolicy provide a declarative, composable workflow layer that eases operational adoption.
For practitioners, the framework offers a blueprint to operationalize AI automation while maintaining compliance and security. For researchers, it opens new questions about verifiable machine reasoning and policy-driven AI control.
Future work should focus on formal verification of LLM outputs, dynamic policy conflict resolution, and efficient onâprem LLM inference to reduce latency and dependency on external APIs.
Prepared by: [Your Name]
Date: 25âŻMayâŻ2026