AUDIO PROGRAMMING AI SCOREBOARD

Audio Gauntlet V2

which AI models can actually build JUCE plugins, Tone.js instruments, AudioWorklet processors, and audio CLI tools?

10 models tested101 checks per run10 taskslast run: 2026-04-10

#1 OVERALL

Claude 4 Sonnet

Anthropic — 89/101 checks passed

83.7

creator score

#	Model	Provider	Checks	Latency	Cost	Score
1	Claude 4 Sonnetbest accuracy	Anthropic	89/101	24.2s	$0.0420	83.7
2	Llama 3.3 70Bfastest	Groq	71/101	4.8s	$0.0038	82.6
3	GPT-4o	OpenAI	84/101	19.8s	$0.0480	80.9
4	Gemini 2.5 Pro	Google	80/101	17.4s	$0.0350	80.7
5	DeepSeek V3best value	DeepSeek	75/101	21.0s	$0.0062	80.2
6	Claude 4 Haiku	Anthropic	66/101	6.2s	$0.0085	79.5
7	Qwen 2.5 72B	Alibaba	69/101	18.6s	$0.0110	76.9
8	GPT-4o Mini	OpenAI	62/101	7.1s	$0.0045	76.4
9	Grok 3 Mini	xAI	63/101	11.2s	$0.0090	74.9
10	Mistral Large	Mistral	65/101	16.8s	$0.0220	74.3

METRIC BREAKDOWN — TOP 5

Claude 4 Sonnet83.7

accuracy92

speed68

stability95

cost58

creative93

efficiency72

Llama 3.3 70B82.6

accuracy76

speed95

stability82

cost94

creative71

efficiency88

GPT-4o80.9

accuracy88

speed72

stability91

cost55

creative86

efficiency70

Gemini 2.5 Pro80.7

accuracy85

speed76

stability88

cost62

creative82

efficiency75

DeepSeek V380.2

accuracy80

speed70

stability85

cost91

creative74

efficiency82

TEST SUITE — 10 TASKS

every model runs the same 10 tasks across 4 categories. 101 automated checks evaluate correctness, completeness, and creative quality.

Plugin Dev— JUCE C++ plugin development, DSP, real-time audio

JUCE Drive Plugin

Build a JUCE AudioProcessor with APVTS, processBlock, soft clipping, and smoothed parameters.

12 checks

JUCE FDN Reverb

Implement a feedback delay network reverb with size, decay, and damping controls.

10 checks

DSP Zipper Fix

Diagnose and fix zipper noise in a delay feedback automation loop.

10 checks

Web Audio— Tone.js, AudioWorklet, browser-based audio

Tone.js Synth Sequence

Build a Tone.js polysynth with a sequencer pattern, effects chain, and transport controls.

9 checks

AudioWorklet Processor

Create a custom AudioWorkletProcessor for real-time gain with parameter automation.

8 checks

Tone.js Drum Machine

Build a browser drum machine with Tone.js Players, step sequencer, and tempo control.

10 checks

Audio Tools— CLI tools, batch processing, file analysis

WAV Batch Normalizer

Node.js CLI that scans, analyzes peak/RMS, and normalizes WAV files to 48kHz mono.

11 checks

Spectrum Analyzer

Real-time FFT spectrum analyzer with Web Audio API and canvas visualization.

9 checks

Plugin UI— Plugin interfaces, controls, meters, accessibility

Plugin UI (HTML)

Single-file HTML plugin interface with knobs, meters, bypass, and keyboard focus.

12 checks

Plugin UI (React)

React component for a plugin control surface with parameter state and real-time meter.

10 checks

METHODOLOGY

scoring

creator score = (0.35 × Accuracy) + (0.15 × Speed) + (0.20 × Stability) + (0.10 × Cost) + (0.10 × Creative) + (0.10 × Efficiency). accuracy measures check pass rate. speed scores latency against per-task targets. stability penalizes refusals, placeholders, and truncation. creative fit rewards domain-appropriate choices.

evaluation

all checks are regex-based pattern matching against the raw response text. no human scoring, no LLM-as-judge. this keeps results reproducible and free from evaluator bias. we check for specific API usage, safety patterns, and structural completeness.

fairness

every model gets the same system prompt and user prompt. temperature is fixed at 0.7. max tokens is set per task. cost is calculated from provider pricing at time of run. we run each model once — variance analysis coming in V3.

want your model on this board? witchaudiostudios@gmail.com

the benchmark suite is open source. run it yourself at internal/audio-benchmark-core