working title β name TBD next session
MIDI in. Diatonic-button-accordion sheet music + tablature out.
Drop a MIDI (one of the pre-baked stems from the research folder, or your own). Get a preview, a staff render, and a button-tab placeholder. Audio ingestion + LLM mapping land later β for now this exercises the render half of the pipeline.
Paste your DeepSeek API key (sk-...) to enable Stage 6 mapping. Stored in localStorage only β never committed to the repo, never sent anywhere except api.deepseek.com.
Want production-grade? Replace this with a Cloudflare Worker holding the key β page-side input is the dev shortcut.
In scope: norteΓ±o Β· cumbia Β· conjunto Β· Tex-Mex Β· vallenato β diatonic-button-accordion-led repertoire from the Mexican / Latin-American tradition. Andy's actual playing world.
Out of scope: everything else. Rock with accordion (System of a Down, Modest Mouse, Arcade Fire) β won't work, accordion isn't the lead. Zydeco / Cajun β different tradition, different button conventions. Polish / Italian / Eastern European folk polka β different repertoire. Generic Western pop β accordion isn't there. Trying to be a universal MIDI-to-tab tool dilutes the model and the dataset; staying narrow is the value proposition.
Tuning the basic-pitch params, picking the source-separation model, building the button-map dataset, choosing the quantization grid β every decision optimizes for norteΓ±o / cumbia. Accordion-led, ~120-150 BPM, polka-ish or cumbia-ish rhythmic feel, GCF tuning as the canonical instrument.
Each stage of the pipeline is a swappable pedal. The chain order is the build sequence; individual pedals get tuned, swapped, or A/B-tested without touching the rest. This isn't an architectural diagram β it's a working metaphor for how the project gets built.
[Source audio] β [2: Normalize β 22kHz mono WAV] β [3: Source-sep htdemucs β "other" stem] audio β audio ββ (optional) [3.5: UVR Mel-Band Roformer cleanup pass] audio β audio β [4: Transcribe basic-pitch β raw MIDI] audio β MIDI ββ (optional) [4.5: Quantize to detected beat grid] MIDI β MIDI β [5: Configure (tuning + tier + hand assignment)] β [6: Map (LLM + button-map dataset)] β [7: Render (VexFlow staff + tab strip SVG)] β [8: Export (PDF / MusicXML / ABC)]
Stages 3.5 (UVR audio cleanup, before transcribe) and 4.5 (MIDI quantize, after transcribe) are optional pedals β turn on / off per song. Default ON for norteΓ±o / cumbia where they help; default OFF for clean studio recordings where they may hurt. The decimal label tells you which domain the pedal operates in: 3.5 is in the audio domain (htdemucs and basic-pitch sandwich it), 4.5 is in the MIDI domain (basic-pitch produces it, configure consumes it).
Andy texted from a stalled Google AI Mode session β he was looking for sheet music for Gallo de Pelea (commonly transcribed in Mi/E). Google's AI Mode kicked him to YouTube tutorials. No usable sheet music returned.
Two product decisions emerged. First, output isn't tab-only β it's tiered, user picks the friction level. Second, Spanish-language tutorial dominance is the real underlying barrier for English-dominant self-learners; written notation outputs are language-neutral, that's a feature.
Listener finding 2026-05-05 while auditioning the Forge alpha: "musicians ever say fuck it to measures?" Yes, they do β and it's the right move for tab tiers. Guitar tab, accordion tab, blues / folk lead-sheet traditions all routinely drop measures and time signatures, presenting the music as a sequence of presses with phrase breaks for orientation. Pickup notes / anacrusis require no special handling because there's no downbeat for them to fall before.
This is also what makes tab forgiving of messy transcription. basic-pitch's per-onset jitter and demucs's stem bleed introduce timing imperfections that look bad on a measured staff but are invisible in tab β the tab reader cares about button order, not which 16th-of-which-beat. Our pipeline (basic-pitch β quantize β tab) plays well into this: even when the transcription is rough, the tab output reads cleanly because it doesn't claim measure-perfection.
Locked decision: tab tiers (Harmony, Full) render measureless with a phrase break every 16 notes for visual orientation. Universal tier (standard staff) keeps measures and time signatures β that's where formal notation rules apply. Pickups in the Universal tier still need special handling (anacrusis bar shorter than the time signature implies); pickups in tab tiers do not.
User picks per song or per session. The system always computes the underlying button mapping; the rendering layer chooses what to show.
| Tier | Output | Audience | Build |
|---|---|---|---|
| Universal | Standard treble + bass clef notation, no tab | Players who want notation transferable across instruments. Andy's stated preference. Highest learning curve, highest long-term reward. | Easy β existing tools (MuseScore, VexFlow) already do this. Ship first. |
| Harmony tab | Treble standard + left-hand bass clef mapped to one of 12 button presses | Players who can read treble but get stuck decoding bass-clef chord stacks into button presses. The original Andy insight. | Medium β LLM mapping work, but only for one hand. The actual differentiator. |
| Full tab | Both hands as button + bellows tab, "guitar hero" style | Lowest-friction first-play. Locks the player into one instrument. Power-user / fastest on-ramp. | Hard β bellows-direction planning across phrases is the hardest sub-problem. Ship last. |
A diatonic button accordion is not a piano. Each button plays:
So a given pitch may be reachable on row-2 button-3 push and also on row-1 button-5 pull. The player has to plan bellows direction across whole phrases β you cannot reverse the bellows on every sixteenth note. This is a constraint-satisfaction problem keyed on the instrument's tuning, the available buttons, and the bellows-continuity rule.
LLMs are good at this kind of constrained-mapping work when you give them the button-layout table as fixed reference. The dataset is the moat; the LLM is the thin wrapper.
@tonejs/midi in browser, mido in Python.Primary persona: beginner-to-intermediate diatonic-button-accordion players, particularly self-taught conjunto / norteΓ±o / Tex-Mex players in the U.S. and diaspora. The Hohner Panther in GCF is the canonical entry-level instrument; if we serve Andy, we serve the modal user.
The Spanish-tutorial barrier. Andy's other observation in the convo: "a ton of tutorial vids are in Spanish." This is a real friction point for English-dominant self-learners. Written notation outputs are language-neutral β that's a feature, not a bug. A Spanish UI mode is a low-effort stretch goal that doubles the addressable audience in the other direction (Spanish-dominant players also lack a high-quality MIDI-to-notation tool tuned to their repertoire).
Original spec treated audio-in as Phase-N+ with legal review. Then Spotify's basic-pitch surfaced β a free, open-source audio-to-MIDI transcriber that runs entirely in the browser via TensorFlow.js. Audio never leaves the user's machine. Spotify (a major label-adjacent company) shipped it openly and didn't get sued. The "audio-in is risky" framing was wrong. Same legal posture as a DAW that imports audio: user supplies the source, tool transforms.
Real-world MIDI hunt experience confirmed why this matters: Karaoplay's La Chona MIDI had an audible flat note, MuseScore charged Pro for export, free archives are sparse for niche repertoire. Asking users to find a clean MIDI is asking them to do work that doesn't scale. Audio is what they actually have.
yt-dlp for power users. The audio comes back into our tool client-side.The build strategy: each stage gets CLI-tested first, then ported to the page once it works. Statuses below are live as of 2026-05-04.
π§ Placeholder π¬ CLI-tested π Wired-up β Live
User-supplied audio. Source from anywhere: YouTube via Cobalt (one-click) or yt-dlp (power users). Spotify-Premium download. Personal recording. Owned MP3.
Why off-tool: YouTube ToS forbids ripping on our domain. Cobalt + yt-dlp let other people own that legal risk while our tool stays clean.
yt-dlp -f 18 \ --extractor-args "youtube:player_client=android,ios,tv_embedded" \ -o "in.%(ext)s" "<youtube-url>"
Future on page: file drop-zone (mp3/mp4/m4a/wav/ogg) + Cobalt deep-link button + yt-dlp instructions snippet.
Convert whatever format we got (mp4 from YouTube, m4a from Spotify, etc.) into a known-good WAV: 22.05 kHz, mono, 16-bit PCM. This is what basic-pitch expects natively, and it cuts file size for the client.
ffmpeg -y -i in.mp4 -vn -acodec pcm_s16le -ar 22050 -ac 1 out.wav
Future on page: bundled imageio-ffmpeg binary (Python) on the worker side, or ffmpeg.wasm client-side. Browser-side is preferred (audio stays local).
This is the differentiator. Almost any YouTube video should work because we isolate the accordion stem from the full mix before transcribing. Without this stage, basic-pitch returns chord-salad MIDI for full-band recordings (vocals + drums + bass + accordion all mashed together β confirmed against EZ Band La Chona, output was unusable).
Tool: Demucs (Meta, open source). Runs the htdemucs model: separates a mix into vocals Β· drums Β· bass Β· other. The "other" stem is mostly accordion + bajo sexto for conjunto recordings. Feed THAT to basic-pitch.
Don't over-clean β rhythm scaffold is a feature. Listener finding 2026-05-04: htdemucs "other" stem includes bajo sexto + rhythm guitar alongside accordion, and that's useful. A solo player practicing along to the chart needs harmonic context β they're not going to play the accordion melody against silence. The MIDI captures the rhythm-string parts as a backing scaffold, the player adapts and takes the voice they're filling. Goal of source separation is "minus vocals + drums" (so the player can hear themselves over the practice track), NOT "isolate accordion lead in pristine isolation." Stage 3.5 (UVR cleanup) should be tuned with the same restraint β strip residual vocals and percussion, keep the rhythm-string scaffolding.
python -m demucs --two-stems=other -o stems/ in.wav # β stems/htdemucs/in/other.wav (the accordion-ish stem)
CLI-tested 2026-05-04 against EZ Band La Chona: 5:56 runtime on the 3:23 audio. Isolated stem fed back through basic-pitch produced 1114 note-on events vs 1743 from the full mix (36% fewer events) with the pitch floor up from MIDI 28β40 (drums and bass no longer captured). First listener verdict: melody drifted on repeated phrases. Currently exploring stricter basic-pitch params (Stage 4) and the heavier htdemucs_ft model (4-stem, fine-tuned, ~4Γ slower) before considering alternatives.
Source-separation landscape (verified 2026-05-04):
| Project | Status (2024-26) | Browser | License | Pick |
|---|---|---|---|---|
| Demucs / htdemucs | upstream archived Jan 2025; fork at adefossez/demucs live | yes β Mixxx GSoC 2025 ONNX export shipping | MIT (code + weights) | β default |
| UVR + MDX-Net / Mel-Band Roformer | very active | no (Python only) | mixed; many weights non-commercial | β οΈ great offline, license-audit per model |
| BS-Roformer / Mel-Band Roformer | active monthly weight drops | not verified in browser | MIT code / NC weights | β οΈ best SDR but no clean commercial+browser path yet |
| Spleeter | dormant (dep bumps only, no new models since 2019) | partial, unofficial | MIT | β outdated |
| Open-Unmix | low activity | stale TF.js | MIT code / NC for high-perf weights | β outdated |
| LALAL.AI / Moises / AudioShake | active commercial | server-side only | proprietary, per-stem billing | β οΈ fallback if client-side stalls |
Community-report-only (not formally benchmarked): two-stage pipeline of htdemucs β UVR Mel-Band Roformer "instrumental" model on the "other" stem reportedly preserves reed/wind timbres better for non-Western instruments. May test as Stage 3.5 if htdemucs_ft alone falls short.
Future on page: Mixxx GSoC 2025 produced a clean ONNX export of htdemucs (October 2025) β that's the browser upgrade path. Run via onnxruntime-web with WebGPU/WASM. For now: server-side via Cloudflare Pages Function or Render. Demucs CPU-bound, RAM-heavy on long files.
Audio-to-audio cleanup. Community two-stage trick: feed the htdemucs "other" stem into a UVR Mel-Band Roformer instrumental model for a second separation pass. Reportedly preserves reed/wind timbres better than htdemucs alone β accordion lead lines come out sharper, bajo sexto bleed gets reduced.
Status placeholder until we test against EZ Band La Chona. The licensing path needs auditing β many UVR community-trained weights are CC-BY-NC, not commercial. If the quality bump is real, may pay for that constraint with separate hosting.
python -m audio_separator stems/htdemucs/.../other.wav \ --model_filename "Kim_MelBandRoformer_FT.ckpt" \ --output_dir stems-roformer/
Future on page: off-by-default toggle. Add to chain only when norteΓ±o / cumbia material has heavy bajo sexto bleed in the htdemucs stem.
Run basic-pitch (Spotify, open source) on the isolated accordion stem. Polyphonic note detection, runs on TensorFlow.js or ONNX runtime. Audio never leaves the machine.
CLI tested against EZ Band La Chona full mix this session: 14.4KB output, 1743 note-on events over 201 seconds, MIDI 28-89 pitch range. The technical pipeline works; quality was poor because we hadn't separated stems yet (see Stage 3).
python -c "
from basic_pitch.inference import predict_and_save
import basic_pitch, os
onnx = os.path.join(os.path.dirname(basic_pitch.__file__),
'saved_models', 'icassp_2022', 'nmp.onnx')
predict_and_save(['stems/htdemucs/in/other.wav'], '.', True, False, False, False, onnx)
"
Tunable inference params matter on accordion: onset_threshold, frame_threshold, minimum_note_length, multiple_pitch_bends=False, min/max_frequency. First test with default (loose) params produced melody drift on repeated phrases β the model picks up bellows-pressure micro-pitch shifts and reed-chorus harmonics, transcribing identical phrases differently each repeat. Stricter thresholds (0.7-0.85 onset, 0.5-0.6 frame, 200ms min length, 130-1500 Hz band) drop note count 40-90% but cure the drift. Optimal params per repertoire is an open tuning question.
Alt to explore: Melodia (MTG / Essentia) β monophonic melody-tracking algorithm, decades old but still robust for single-instrument lead lines. If basic-pitch's polyphonic detection keeps producing drift on accordion melody, Melodia on a vocal-stem-isolated track would be a cleaner fallback for the right-hand melody specifically (left-hand harmony still needs polyphonic for chord work). Test pending; basic-pitch parameter tuning comes first.
Future on page: @spotify/basic-pitch (TS port, browser TF.js). Drop-in for the CLI command, runs entirely in the page. Param sliders exposed as advanced UI for users who want to tune for their instrument.
Our own pedal β not in any other audio-to-MIDI tool we've seen. basic-pitch outputs notes with sample-accurate timing, which means jitter on every onset. Identical phrases get rendered slightly differently each repeat β the sound of "drift" the user heard on first audition.
Fix: detect tempo + beat grid from the source audio (librosa.beat.beat_track), then snap every MIDI note's onset and offset to the nearest sub-beat grid tick. Drop notes shorter than ~80ms (transcription noise). Merge same-pitch notes within a 30ms gap. Works because cumbia and norteΓ±o have stable tempo and clean onsets β beat tracking is reliable on this repertoire (genre-scope dependency: would NOT work on rubato art-music).
python scripts/quantize.py audio.wav input.mid output.mid --subdiv 4 # subdiv=4 β 16th-note grid at detected tempo # subdiv=8 β real 16ths when tracker grabbed half-tempo (common on cumbia)
CLI-tested 2026-05-04 against EZ Band La Chona stems. htdemucs stem (medium params): 645 β 527 notes (subdiv=4) / 532 (subdiv=8, real 16ths) / 528 (subdiv=16, real 32nds). htdemucs_ft stem: 692 β 563 (subdiv=4) / 391 (subdiv=2 = quarter grid) / 571 (subdiv=8) / 571 (subdiv=16). Note count past 16ths plateaus because the 80ms drop-short floor caps how short a note can be regardless of grid resolution.
Listener finding (2026-05-04): real-32nd grid (subdiv=16) sounds more natural than real-16th on the EZ Band La Chona reference. The tab strip itself is largely rhythmless (button-press order matters more than exact timing), but 32nd quantization captures the order of pickup notes / anacrusis / fast runs that 16th grid loses by snapping them onto the wrong beat. For the audio playback / Universal-tier staff render, 32nds win. Per-tier defaults: Tab tiers (Harmony / Full) β subdiv=16 (32nds, preserves pickup order). Universal tier (standard staff) β subdiv=8 (real 16ths, cleaner staff readability) with subdiv=16 available as advanced option for ornament-heavy material.
Future on page: grid-subdiv slider (8th / 16th / 32nd / triplet) defaulted by tier, tempo override box (skip auto-detect β librosa often grabs half-tempo on cumbia), drop-short and merge-window sliders. Live A/B preview against the un-quantized output.
User picks: tuning (GCF default β Andy's Panther), rows (3-row default β Panther), output tier (Universal / Harmony tab / Full tab), and hand-track assignment (auto-split by octave or manual marking).
Future on page: form panel with selects + radio buttons. State persists across runs via localStorage so the user doesn't reconfigure every song.
The LLM API hop β provider-agnostic by design. Send {notes, tuning, rows, tier} + the GCF button-map dataset. Model returns: right-hand button + bellows assignments, left-hand 12-button assignments, notation as ABC or MusicXML.
Validator pass rejects buttons that don't exist on the user's instrument; a deterministic post-processor repairs physically-infeasible bellows reversals.
Provider default: DeepSeek (V3 / Coder). Auto-prompt-caching, ~10x cheaper than Claude for this kind of structured constraint-mapping at parity quality. Premium fallback: Claude (Sonnet or Haiku) when DeepSeek's output validation fails or for higher-stakes Full-tab work where bellows planning is gnarly. Both are OpenAI-or-Anthropic-shaped APIs; the worker is a thin switch.
Future on page: single fetch() to a Cloudflare Worker that holds the API keys (multi-provider). Prompt-cache the button-map (static per tuning) β DeepSeek caches automatically, Anthropic with cache_control. Blocks on Stage 1 of phasing β the GCF button-map dataset (~86 entries) needs to exist first.
VexFlow paints the standard staff (treble + bass clef). Custom SVG paints the tablature strip below: numbered buttons + push/pull arrows for the right hand, 1-of-12 button index for the left hand. Layout matches the user's selected tier.
Future on page: <div id="staff"> for VexFlow, <div id="tab"> for the tab-strip SVG. Side-by-side print layout for paper output.
Browser print-to-PDF (zero work, looks fine on letter / A4). MusicXML download for users who want to load into MuseScore for further editing. ABC notation copy-button for embeds.
Future on page: three buttons. Print is just window.print() with print-stylesheet. MusicXML is Blob + download attribute. ABC is clipboard.
| Layer | Pick | Why |
|---|---|---|
| Frontend rendering | VexFlow + custom tab-strip SVG | Well-maintained, the tab strip is custom anyway. |
| MIDI parsing | @tonejs/midi | ESM, no build step, clean API. |
| LLM (default) | DeepSeek API (V3 / Coder) | Auto-prompt-caching, ~10x cheaper than Claude for structured constraint-mapping. Cost matters at hobby-project scale. |
| LLM (fallback) | Claude API (Sonnet / Haiku) | Premium tier for Full-tab bellows planning where DeepSeek validation fails. Worker switches providers on retry. |
| API key gate | Cloudflare Worker | Already DLz infra. Keeps the key off the page. |
| PDF export | Browser print first | Zero work. Upgrade to pdf-lib only if quality is bad. |
Product quality is gated on accurate button-layout data per tuning. Manual data entry, but small in scale:
A weekend of careful data entry from manufacturer charts (Hohner, Gabbanelli) and community resources. Without this dataset, hallucination risk is too high to ship. Phase 1 = GCF only. Other tunings added one at a time as audience demand surfaces.
Phasing follows the tier order β ship the easy tier first, prove the loop, then layer the differentiator. Audio-in is now in-scope from Phase 2 thanks to basic-pitch.
| Phase | Tier | Deliverable | Estimate |
|---|---|---|---|
| 1 | β | GCF Panther button-map dataset (right + left hand) | 1 sitting |
| 2 | Ingestion | Browser drop-zone: accept .mp3 / .wav (basic-pitch TF.js) AND .mid (@tonejs/midi). Show note-list extracted. | 2 sittings |
| 3 | β | Cobalt link button + yt-dlp instructions snippet on the page. No ripping on our domain. | 1 sitting |
| 4 | Universal | VexFlow standard-staff rendering | 1 sitting |
| 5 | Universal | PDF export, ship as v0.1 | 1 sitting |
| 6 | Harmony tab | Claude prompt: bass-clef chord β 1-of-12 button | 1 sitting |
| 7 | Harmony tab | Render harmony-tab strip below treble staff | 1 sitting |
| 8 | Full tab | Right-hand button + bellows assignment (LLM + validator) | 2 sittings |
| 9 | Full tab | Bellows-continuity post-processor | 1 sitting |
| 10 | β | Add FBbEb / EAD tunings as demand surfaces | 1 sitting per tuning |
| 11 | β | Spanish UI mode | 1 sitting |
Andy's foundational learning song is La Chona β Los Tucanes de Tijuana, 1995. "My Twinkle Twinkle Little Star," in his words. He specifically prefers the EZ Band cover for our reference test (recent, but his pick). That audio file is the Phase 2 acceptance test: EZ Band La Chona (audio) β basic-pitch (MIDI) β Universal-tier render (standard staff). If the output reads as recognizably "La Chona" to Andy, Phase 2 is acceptance-tested.
The engineering spec lives at accordion-tab-spec.md in the repo root. Same content, less HTML.