Python Templates for Neural Image Classification and Spectral Audio Processing

Lightning Hydra Template Extended and Neural Spectral Modeling Template

16:00 - 16:40 UTC | Friday 26th September 2025 | ADCx Gather

Intermediate

This presentation introduces two open-source research frameworks for neural image classification and spectral audio processing: (1) the Lightning Hydra Template Extended (LHTE) and (2) the Neural Spectral Modeling Template (NSMT). The LHTE extends the widely used PyTorch Lightning + Hydra template with state-of-the-art architectures (CNNs, ConvNeXt, EfficientNet, Vision Transformers) and expanded dataset support, adding CIFAR-10, CIFAR-100, and a new generalized Variable Image Multi-Head (VIMH) format. VIMH accommodates extremely large image/channel dimensions, multi-head tasks, and supports both classification and regression from a single shared backbone. The LHTE also provides reproducible benchmark experiments, and systematic workflows for rapid model comparison.

Built upon the LHTE, the NSMT specializes in spectral audio modeling, where stacked spectrograms and other 2D audio representations serve as image-like inputs. By leveraging the perceptual inductive priors of human hearing, the NSMT avoids the computational expense of end-to-end waveform modeling while maintaining high accuracy. Applications include synthesizer parameter estimation (tested on sawtooth oscillators, and Moog VCFs with ADSR envelopes), instrument recognition, and real-time effect control. NSMT emphasizes small, efficient architectures, extended spectral representations, auxiliary conditioning inputs, and enhanced VIMH support for audio-specific datasets.

Together, the LHTE and NSMT form robust, reproducible platforms for advancing machine learning research at the intersection of vision and audio. Code, datasets, and other resources are available online for immediate adoption.

View Slides

Julius Smith

Professor Emeritus

Stanford University

Julius O. Smith is a research engineer, educator, and musician devoted primarily to developing new technologies for music and audio signal processing. He received the B.S.E.E. degree from Rice University in 1975 (Control, Circuits, and Communication), and the M.S. and Ph.D. degrees in E.E. from Stanford University, in 1978 and 1983, respectively. For his MS/EE, he focused largely on statistical signal processing. His Ph.D. research was devoted to improved methods for digital filter design and system identification applied to music and audio systems, particularly the violin. From 1975 to 1977 he worked in the Signal Processing Department at ESL, Sunnyvale, CA, on systems for digital communications. From 1982 to 1986 he was with the Adaptive Systems Department at Systems Control Technology, Palo Alto, CA, where he worked in the areas of adaptive filtering and spectral estimation. From 1986 to 1991 he was employed at NeXT Computer, Inc., responsible for sound, music, and signal processing software for the NeXT computer workstation. After NeXT, he became a Professor at the Center for Computer Research in Music and Acoustics (CCRMA) at Stanford, with a courtesy appointment in EE, teaching courses and pursuing/supervising research related to signal processing techniques applied to music and audio systems. At varying part-time levels, he was a founding consultant for Staccato Systems, Shazam Inc., and moForte Inc. He is presently a Professor Emeritus of Music and by courtesy Electrical Engineering at Stanford, and a perennial consultant for moForte Inc. and a few others. For more information, see https//ccrma.stanford.edu/~jos/.