AUDIO PROGRAMMING AI SCOREBOARD

Audio Gauntlet V2

which AI models can actually build JUCE plugins, Tone.js instruments, AudioWorklet processors, and audio CLI tools?

10 models tested101 checks per run10 taskslast run: 2026-04-10
#1 OVERALL
Claude 4 Sonnet
Anthropic89/101 checks passed
83.7
creator score
#ModelProviderChecksLatencyCostScore
1
Claude 4 Sonnetbest accuracy
Anthropic89/10124.2s$0.0420
83.7
2
Llama 3.3 70Bfastest
Groq71/1014.8s$0.0038
82.6
3
GPT-4o
OpenAI84/10119.8s$0.0480
80.9
4
Gemini 2.5 Pro
Google80/10117.4s$0.0350
80.7
5
DeepSeek V3best value
DeepSeek75/10121.0s$0.0062
80.2
6
Claude 4 Haiku
Anthropic66/1016.2s$0.0085
79.5
7
Qwen 2.5 72B
Alibaba69/10118.6s$0.0110
76.9
8
GPT-4o Mini
OpenAI62/1017.1s$0.0045
76.4
9
Grok 3 Mini
xAI63/10111.2s$0.0090
74.9
10
Mistral Large
Mistral65/10116.8s$0.0220
74.3

METRIC BREAKDOWN — TOP 5

Claude 4 Sonnet83.7
accuracy92
speed68
stability95
cost58
creative93
efficiency72
Llama 3.3 70B82.6
accuracy76
speed95
stability82
cost94
creative71
efficiency88
GPT-4o80.9
accuracy88
speed72
stability91
cost55
creative86
efficiency70
Gemini 2.5 Pro80.7
accuracy85
speed76
stability88
cost62
creative82
efficiency75
DeepSeek V380.2
accuracy80
speed70
stability85
cost91
creative74
efficiency82

TEST SUITE — 10 TASKS

every model runs the same 10 tasks across 4 categories. 101 automated checks evaluate correctness, completeness, and creative quality.

Plugin DevJUCE C++ plugin development, DSP, real-time audio

JUCE Drive Plugin

Build a JUCE AudioProcessor with APVTS, processBlock, soft clipping, and smoothed parameters.

12 checks
JUCE FDN Reverb

Implement a feedback delay network reverb with size, decay, and damping controls.

10 checks
DSP Zipper Fix

Diagnose and fix zipper noise in a delay feedback automation loop.

10 checks

Web AudioTone.js, AudioWorklet, browser-based audio

Tone.js Synth Sequence

Build a Tone.js polysynth with a sequencer pattern, effects chain, and transport controls.

9 checks
AudioWorklet Processor

Create a custom AudioWorkletProcessor for real-time gain with parameter automation.

8 checks
Tone.js Drum Machine

Build a browser drum machine with Tone.js Players, step sequencer, and tempo control.

10 checks

Audio ToolsCLI tools, batch processing, file analysis

WAV Batch Normalizer

Node.js CLI that scans, analyzes peak/RMS, and normalizes WAV files to 48kHz mono.

11 checks
Spectrum Analyzer

Real-time FFT spectrum analyzer with Web Audio API and canvas visualization.

9 checks

Plugin UIPlugin interfaces, controls, meters, accessibility

Plugin UI (HTML)

Single-file HTML plugin interface with knobs, meters, bypass, and keyboard focus.

12 checks
Plugin UI (React)

React component for a plugin control surface with parameter state and real-time meter.

10 checks

METHODOLOGY

scoring

creator score = (0.35 × Accuracy) + (0.15 × Speed) + (0.20 × Stability) + (0.10 × Cost) + (0.10 × Creative) + (0.10 × Efficiency). accuracy measures check pass rate. speed scores latency against per-task targets. stability penalizes refusals, placeholders, and truncation. creative fit rewards domain-appropriate choices.

evaluation

all checks are regex-based pattern matching against the raw response text. no human scoring, no LLM-as-judge. this keeps results reproducible and free from evaluator bias. we check for specific API usage, safety patterns, and structural completeness.

fairness

every model gets the same system prompt and user prompt. temperature is fixed at 0.7. max tokens is set per task. cost is calculated from provider pricing at time of run. we run each model once — variance analysis coming in V3.

want your model on this board? witchaudiostudios@gmail.com

the benchmark suite is open source. run it yourself at internal/audio-benchmark-core