Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations. Our benchmark reveals that Coordinate-MLPs require complex hyperparameter tuning and frequency-dependent initialization, limiting their robustness. To address these issues, we propose Fourier-ASR, a novel framework based on the Fourier series theorem and the Kolmogorov-Arnold representation theorem. Fourier-ASR introduces Fourier Kolmogorov-Arnold Networks (Fourier-KAN), which leverage periodicity and strong nonlinearity to represent audio signals, eliminating the need for additional positional encoding. Furthermore, a Frequency-adaptive Learning Strategy (FaLS) is proposed to enhance the convergence of Fourier-KAN by capturing high-frequency components and preventing overfitting of low-frequency signals. Extensive experiments conducted on natural speech and music datasets reveal that: (1) well-designed positional encoding and activation functions in Coordinate-MLPs can effectively improve audio representation quality; and (2) Fourier-ASR can robustly represent complex audio signals without extensive hyperparameter tuning. Looking ahead, the continuity and infinite resolution of implicit audio representations make our research highly promising for tasks such as audio compression, synthesis, and generation.
In contrast to local feature-based representation, Neural Amplitude Fields, as a specific instance of implicit neural representations in audio signals, utilize time coordinates as inputs to regress the corresponding amplitudes, thereby encoding the audio signal within the weights of a neural network. This parameterized representation is not only continuously differentiable but also decoupled from spatial resolution, allowing for precise processing of audio signals at any resolution. It holds potential for applications in audio denoising, synthesis, generation, and other related fields.
Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations.
Conclusions:
RFF makes it more suited to Gaussian-type activation functions (~3dB ↑ in SNR). Conversely, NeFF employs Fourier
mappings, which are more compatible with Sine-type activation functions (~9dB ↑ in SNR).
SNR=13.36 dB
SNR=39.02 dB
SNR=42.39 dB
SNR=15.98 dB
SNR=38.10 dB
SNR=41.40 dB
SNR=6.35 dB
SNR=20.85 dB
SNR=19.68 dB
SNR=12.04 dB
SNR=15.57 dB
SNR=15.26 dB
SNR=6.38 dB
SNR=20.86 dB
SNR=19.69 dB
SNR=0.00 dB
SNR=15.62 dB
SNR=22.29 dB
SNR=7.96 dB
SNR=13.06 dB
SNR=33.58 dB
SNR=8.16 dB
SNR=12.86 dB
SNR=32.24 dB
SNR=0.74 dB
SNR=12.14 dB
SNR=9.20 dB
SNR=1.34 dB
SNR=10.97 dB
SNR=8.67 dB
SNR=0.75 dB
SNR=12.44 dB
SNR=9.20 dB
SNR=-7.66 dB
SNR=4.93 dB
SNR=9.57 dB
To avoid spectral bias from positional encoding and complex parameter tuning of activation functions, we propose a novel audio signal representation framework, Fourier-ASR, based on the Fourier series theorem and the Kolmogorov-Arnold theorem. Fourier-ASR includes Fourier Kolmogorov-Arnold Networks (Fourier-KAN) and a Frequency-adaptive Learning Strategy (FaLS). Due to the periodicity and strong nonlinearity of Fourier basis functions, Fourier-ASR can effectively represent audio signals and provide enhanced interpretability.
Conclusions:
RFF+Gaussian and NeFF+Sine, to address
these challenges.
RFF+Gaussian and NeFF+Sine significantly enhance the
ability of Coordinate-MLPs to represent audio signals. On the GTZAN dataset, these methods
improve the SNR by 10.15dB ↑ and 12.28dB ↑, respectively. On the CSTR VCTK dataset, the SNR
improvements are 10.40dB ↑ and 14.04dB ↑, respectively.
Bach
SNR=20.85 dB
SNR=42.39 dB
SNR=33.14 dB
Counting
SNR=12.14 dB
SNR=33.58 dB
SNR=20.10 dB
Blues (GTZAN)
SNR=11.80 dB
SNR=22.02 dB
SNR=13.80 dB
Classical (GTZAN)
SNR=10.76 dB
SNR=25.95 dB
SNR=15.05 dB
NorthernIrish (VCTK)
SNR=16.19 dB
SNR=19.59 dB
SNR=17.12 dB
NewZealand (VCTK)
SNR=13.32 dB
SNR=16.87 dB
SNR=15.79 dB
@article{Li2025nerf,
title={Representing Sounds as Neural Amplitude Fields: A Benchmark of Coordinate-MLPs and a Fourier Kolmogorov-Arnold Framework},
volume={39},
number={23},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
publisher={AAAI},
author={Li, Linfei and Zhang, Lin and Wang, Zhong and Zhang, Fengyi and Li, Zelin and Shen, Ying},
year={2025},
pages={24458–24466}
}