Back To Schedule

Helicopter View of Audio ML

00:00 - 00:00 | Thursday 30th October 2025 |
Beginner

Audio machine learning can seem overwhelming: so many model types, representations, and tasks - and no clear map. This session provides a structured overview to help you make sense of it all and build intuition for how different parts fit together.

We begin by looking at what models are actually used for. Tasks such as classification, transcription, generation, and transformation shape not just the training targets, but the flow of data between modalities - audio-to-text, text-to-audio, audio-to-audio, and so on. These task-modality pairs define the shape of the problem and, by extension, influence what types of models are suitable.

This also introduces one of the central trade-offs in audio ML: how much context is needed, and how is it managed? Some tasks rely only on short-term input; others require memory, recurrence, or attention mechanisms to track long-term structure.

Once the task and modality framing is in place, we examine how audio is represented inside models - waveforms, spectrograms, tokens - and what these formats enable or constrain.

Only then do we turn to model architectures themselves: CNNs, RNNs, Transformers, Diffusion models, and hybrids. Each comes with its own strengths, structure, and computational properties.

Finally, we return to the system-level view: how models can be composed into larger chains or graphs. Some systems pass data between models at runtime; others include models inside other models and train them jointly. These structures open up powerful design options - modularity, reuse, and flexible transfer across tasks and domains.

The focus is conceptual: a clean overview to clarify the terrain, not dive into implementation. A starting point for navigating the audio ML space with purpose.

Martin Swanholm

CTO

Hindenburg Systems

Martin is a software developer and DSP engineer with over 30 years of experience, currently focusing on practical, real-world applications of machine learning in audio. His work emphasizes getting the most out of available hardware and compute resources, ensuring solutions are efficient and accessible to a wide range of users. He is currently developing effective tools for audio restoration, like phase-coherent frequency-domain models and multi-task learning models that improve speech off-line or interactively in real time.

Martin’s journey in digital audio began in the 1990s, and over the years, he’s worked on everything from basic signal processing to full multimedia systems. His approach is rooted in pragmatism—using techniques that work, whether simple or advanced, to solve real problems.

Martin excels at breaking down complex concepts into clear, actionable steps, making his presentations valuable for beginners looking to understand audio processing with machine learning. He’s committed to showing how practical, tried-and-true methods can yield strong results without requiring cutting-edge hardware or expertise, making his sessions approachable for all skill levels.

VolumetricCondensed