Home Findings Changelog Read Spec
Exploratory Results

These are early exploratory findings from a small test set (10 tests across 3 APIs). Results should be interpreted directionally, not as statistically significant. We're sharing them for transparency as we continue to expand the test suite.

Key Observations

  • MAPI and MAPI Skills both achieved 100% API success rate across all tests
  • MAPI Skills uses only 9.3% of OpenAPI's tokens (90.7% reduction)
  • Both MAPI formats achieved 100% capability matching accuracy
  • These tests execute real API calls—not just LLM comprehension checks
  • OpenAPI GitHub tests were rate-limited due to large spec size (~108K tokens per request)

Methodology

We built an end-to-end test harness that goes beyond LLM comprehension testing. Each test executes the full agent workflow: understanding a natural language request, constructing an HTTP request, and executing it against a live API. This measures real-world usability, not just spec readability.

Test Design

Each test case consists of:

  • A natural language request (e.g., "Count how many tokens are in 'Hello, world!'")
  • An expected capability (e.g., messages.count_tokens)
  • A response validator that checks the actual API response

The test compares three specification formats: OpenAPI, MAPI (full spec), and MAPI Skills (progressive loading). For each format, we measure:

  • API Success Rate — did the constructed request return a valid response?
  • Capability Accuracy — did the LLM identify the correct capability?
  • Token Usage — total input + output tokens consumed
  • Latency — end-to-end time including API execution

E2E Test Flow

Each test follows this flow:

  1. Phase 1: Intent Matching — LLM identifies which capability handles the request
  2. Phase 2: Request Construction — LLM builds the HTTP request (method, path, body)
  3. Phase 3: API Execution — Request is sent to the live API
  4. Phase 4: Response Validation — Response is checked against expected structure
Why E2E Testing Matters

Comprehension-only tests ("can the LLM pick the right capability?") don't prove an agent can actually use the API. E2E tests verify the full chain: understanding → construction → execution → validation.

Test Infrastructure

  • Model: Claude Haiku 4.5 (claude-haiku-4-5-20251001)
  • Test harness: TypeScript/Node.js
  • Rate limit handling: Exponential backoff with up to 5 retries; rate-limited tests are skipped, not failed
  • Test scope: Read-only operations only (no mutations)

APIs Tested

API E2E Tests Test Types
Anthropic Messages 3 Token counting, message creation
GitHub REST 4 Repository info, issues, pull requests
Google Cloud Billing 3 Billing accounts, services listing

Results by API

Anthropic Messages API

Tests: Token counting, message creation with Claude.

Metric OpenAPI MAPI Skill
API Success 67% (2/3) 100% (3/3) 100% (3/3)
Capability Accuracy 67% 100% 100%

Observation: OpenAPI failed one test because the LLM selected a model name (claude-3-5-sonnet-2024) that doesn't exist in the API. MAPI and Skill specs include explicit model name guidance, avoiding this error.

GitHub REST API

Tests: Repository details, listing issues, listing pull requests.

Metric OpenAPI MAPI Skill
API Success N/A (rate limited) 100% (4/4) 100% (4/4)
Capability Accuracy N/A 100% 100%

Observation: All 4 OpenAPI tests hit rate limits and were skipped. The large OpenAPI spec (~108K tokens per request) consumes rate limit budget faster than MAPI (~10K tokens) or Skill (~1K tokens). This demonstrates a practical limitation of large specs.

Google Cloud Billing API

Tests: Listing billing accounts, account details, listing services.

Metric OpenAPI MAPI Skill
API Success 100% (3/3) 100% (3/3) 100% (3/3)
Capability Accuracy 100% 100% 100%

Observation: All three formats achieved 100% success on GCP Billing tests. OpenAPI performed well here because the spec is moderate in size.

Summary

Metric OpenAPI MAPI Skill
API Success Rate 83% (5/6)* 100% (10/10) 100% (10/10)
Capability Accuracy 50% 100% 100%
Total Tokens 108,530 102,851 10,065
% of OpenAPI Tokens 100% 94.8% 9.3%
Avg Latency 2,020ms 1,890ms 1,710ms

* OpenAPI ran 6 of 10 tests; 4 GitHub tests were skipped due to rate limits (not counted as failures).

MAPI Skills: Progressive Loading

MAPI Skills uses a 2-phase approach that dramatically reduces token usage:

  1. Phase 1: Load the lightweight Skill.md index (~1-3KB) with capability IDs and intent keywords
  2. Phase 2: Match intent to capability, then load only the required capability file (~1KB)

Token Efficiency Results

Format Total Tokens % of OpenAPI Reduction
OpenAPI 108,530 100%
MAPI (Full) 102,851 94.8% 5.2%
MAPI Skill 10,065 9.3% 90.7%

Observation: MAPI Skills achieved 90.7% token reduction compared to OpenAPI while maintaining 100% accuracy across all E2E tests. The progressive loading approach loads only what's needed for each request.

When to Use Skills

MAPI Skills is optimal for APIs where token efficiency matters—especially when running many requests or working within rate limits. The 10x token reduction means 10x more API calls within the same budget.

Challenges Encountered

1. Rate Limiting with Large Specs

OpenAPI's large spec sizes (~108K tokens for GitHub) consume rate limit budgets quickly. All 4 GitHub OpenAPI tests hit rate limits and were skipped, while MAPI and Skill completed successfully. This is a real-world limitation of large specifications.

2. Model Name Hallucination

OpenAPI tests failed when the LLM invented model names not in the actual API (e.g., claude-3-5-sonnet-2024). MAPI specs with explicit model guidance avoided this issue.

3. Asymmetric Test Coverage

Rate limiting caused OpenAPI to run fewer tests (6/10) than MAPI and Skill (10/10). This makes direct comparison difficult—OpenAPI's "83% success" is from a smaller, potentially easier subset.

4. LLM Variance

Results vary between runs due to LLM non-determinism. A test that passes in one run may fail in another. These results represent a single run and should be interpreted directionally.

Limitations

  • Small sample size: Only 10 tests across 3 APIs—results are directional, not statistically significant
  • Single run: No averaging across multiple runs to account for LLM variance
  • Asymmetric coverage: OpenAPI completed 6/10 tests; MAPI/Skill completed 10/10
  • Single model tested: Results are specific to Claude Haiku 4.5; other models may differ
  • Read-only tests only: No mutation operations tested (creates, updates, deletes)
  • Hand-written specs: MAPI/Skill specs were written for this test; quality may vary

Reproducing These Results

The E2E test harness and all specifications are available in the MAPI repository:

# Clone the repository
git clone https://github.com/jeffrschneider/markdownapi.git
cd markdownapi/tests/harness

# Install dependencies
npm install

# Set your API keys in .env.local (project root)
ANTHROPIC_API_KEY=your-anthropic-key
GITHUB_TOKEN=your-github-token
GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-credentials.json

# Build and run E2E tests
npm run build
node dist/e2e-runner.js
Rate Limits

OpenAPI tests may hit rate limits due to large spec sizes. The test harness implements exponential backoff and skips rate-limited tests rather than failing them.