These are early exploratory findings from a small test set (10 tests across 3 APIs). Results should be interpreted directionally, not as statistically significant. We're sharing them for transparency as we continue to expand the test suite.
Key Observations
- MAPI and MAPI Skills both achieved 100% API success rate across all tests
- MAPI Skills uses only 9.3% of OpenAPI's tokens (90.7% reduction)
- Both MAPI formats achieved 100% capability matching accuracy
- These tests execute real API calls—not just LLM comprehension checks
- OpenAPI GitHub tests were rate-limited due to large spec size (~108K tokens per request)
Methodology
We built an end-to-end test harness that goes beyond LLM comprehension testing. Each test executes the full agent workflow: understanding a natural language request, constructing an HTTP request, and executing it against a live API. This measures real-world usability, not just spec readability.
Test Design
Each test case consists of:
- A natural language request (e.g., "Count how many tokens are in 'Hello, world!'")
- An expected capability (e.g.,
messages.count_tokens) - A response validator that checks the actual API response
The test compares three specification formats: OpenAPI, MAPI (full spec), and MAPI Skills (progressive loading). For each format, we measure:
- API Success Rate — did the constructed request return a valid response?
- Capability Accuracy — did the LLM identify the correct capability?
- Token Usage — total input + output tokens consumed
- Latency — end-to-end time including API execution
E2E Test Flow
Each test follows this flow:
- Phase 1: Intent Matching — LLM identifies which capability handles the request
- Phase 2: Request Construction — LLM builds the HTTP request (method, path, body)
- Phase 3: API Execution — Request is sent to the live API
- Phase 4: Response Validation — Response is checked against expected structure
Comprehension-only tests ("can the LLM pick the right capability?") don't prove an agent can actually use the API. E2E tests verify the full chain: understanding → construction → execution → validation.
Test Infrastructure
- Model: Claude Haiku 4.5 (claude-haiku-4-5-20251001)
- Test harness: TypeScript/Node.js
- Rate limit handling: Exponential backoff with up to 5 retries; rate-limited tests are skipped, not failed
- Test scope: Read-only operations only (no mutations)
APIs Tested
| API | E2E Tests | Test Types |
|---|---|---|
| Anthropic Messages | 3 | Token counting, message creation |
| GitHub REST | 4 | Repository info, issues, pull requests |
| Google Cloud Billing | 3 | Billing accounts, services listing |
Results by API
Anthropic Messages API
Tests: Token counting, message creation with Claude.
| Metric | OpenAPI | MAPI | Skill |
|---|---|---|---|
| API Success | 67% (2/3) | 100% (3/3) | 100% (3/3) |
| Capability Accuracy | 67% | 100% | 100% |
Observation: OpenAPI failed one test because the LLM selected a model name (claude-3-5-sonnet-2024) that doesn't exist in the API. MAPI and Skill specs include explicit model name guidance, avoiding this error.
GitHub REST API
Tests: Repository details, listing issues, listing pull requests.
| Metric | OpenAPI | MAPI | Skill |
|---|---|---|---|
| API Success | N/A (rate limited) | 100% (4/4) | 100% (4/4) |
| Capability Accuracy | N/A | 100% | 100% |
Observation: All 4 OpenAPI tests hit rate limits and were skipped. The large OpenAPI spec (~108K tokens per request) consumes rate limit budget faster than MAPI (~10K tokens) or Skill (~1K tokens). This demonstrates a practical limitation of large specs.
Google Cloud Billing API
Tests: Listing billing accounts, account details, listing services.
| Metric | OpenAPI | MAPI | Skill |
|---|---|---|---|
| API Success | 100% (3/3) | 100% (3/3) | 100% (3/3) |
| Capability Accuracy | 100% | 100% | 100% |
Observation: All three formats achieved 100% success on GCP Billing tests. OpenAPI performed well here because the spec is moderate in size.
Summary
| Metric | OpenAPI | MAPI | Skill |
|---|---|---|---|
| API Success Rate | 83% (5/6)* | 100% (10/10) | 100% (10/10) |
| Capability Accuracy | 50% | 100% | 100% |
| Total Tokens | 108,530 | 102,851 | 10,065 |
| % of OpenAPI Tokens | 100% | 94.8% | 9.3% |
| Avg Latency | 2,020ms | 1,890ms | 1,710ms |
* OpenAPI ran 6 of 10 tests; 4 GitHub tests were skipped due to rate limits (not counted as failures).
MAPI Skills: Progressive Loading
MAPI Skills uses a 2-phase approach that dramatically reduces token usage:
- Phase 1: Load the lightweight Skill.md index (~1-3KB) with capability IDs and intent keywords
- Phase 2: Match intent to capability, then load only the required capability file (~1KB)
Token Efficiency Results
| Format | Total Tokens | % of OpenAPI | Reduction |
|---|---|---|---|
| OpenAPI | 108,530 | 100% | — |
| MAPI (Full) | 102,851 | 94.8% | 5.2% |
| MAPI Skill | 10,065 | 9.3% | 90.7% |
Observation: MAPI Skills achieved 90.7% token reduction compared to OpenAPI while maintaining 100% accuracy across all E2E tests. The progressive loading approach loads only what's needed for each request.
MAPI Skills is optimal for APIs where token efficiency matters—especially when running many requests or working within rate limits. The 10x token reduction means 10x more API calls within the same budget.
Challenges Encountered
1. Rate Limiting with Large Specs
OpenAPI's large spec sizes (~108K tokens for GitHub) consume rate limit budgets quickly. All 4 GitHub OpenAPI tests hit rate limits and were skipped, while MAPI and Skill completed successfully. This is a real-world limitation of large specifications.
2. Model Name Hallucination
OpenAPI tests failed when the LLM invented model names not in the actual API (e.g., claude-3-5-sonnet-2024). MAPI specs with explicit model guidance avoided this issue.
3. Asymmetric Test Coverage
Rate limiting caused OpenAPI to run fewer tests (6/10) than MAPI and Skill (10/10). This makes direct comparison difficult—OpenAPI's "83% success" is from a smaller, potentially easier subset.
4. LLM Variance
Results vary between runs due to LLM non-determinism. A test that passes in one run may fail in another. These results represent a single run and should be interpreted directionally.
Limitations
- Small sample size: Only 10 tests across 3 APIs—results are directional, not statistically significant
- Single run: No averaging across multiple runs to account for LLM variance
- Asymmetric coverage: OpenAPI completed 6/10 tests; MAPI/Skill completed 10/10
- Single model tested: Results are specific to Claude Haiku 4.5; other models may differ
- Read-only tests only: No mutation operations tested (creates, updates, deletes)
- Hand-written specs: MAPI/Skill specs were written for this test; quality may vary
Reproducing These Results
The E2E test harness and all specifications are available in the MAPI repository:
# Clone the repository git clone https://github.com/jeffrschneider/markdownapi.git cd markdownapi/tests/harness # Install dependencies npm install # Set your API keys in .env.local (project root) ANTHROPIC_API_KEY=your-anthropic-key GITHUB_TOKEN=your-github-token GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcp-credentials.json # Build and run E2E tests npm run build node dist/e2e-runner.js
OpenAPI tests may hit rate limits due to large spec sizes. The test harness implements exponential backoff and skips rate-limited tests rather than failing them.