Representing Sounds as Neural Amplitude Fields: A Benchmark of Coordinate-MLPs and A Fourier Kolmogorov-Arnold Framework

Abstract

Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations. Our benchmark reveals that Coordinate-MLPs require complex hyperparameter tuning and frequency-dependent initialization, limiting their robustness. To address these issues, we propose Fourier-ASR, a novel framework based on the Fourier series theorem and the Kolmogorov-Arnold representation theorem. Fourier-ASR introduces Fourier Kolmogorov-Arnold Networks (Fourier-KAN), which leverage periodicity and strong nonlinearity to represent audio signals, eliminating the need for additional positional encoding. Furthermore, a Frequency-adaptive Learning Strategy (FaLS) is proposed to enhance the convergence of Fourier-KAN by capturing high-frequency components and preventing overfitting of low-frequency signals. Extensive experiments conducted on natural speech and music datasets reveal that: (1) well-designed positional encoding and activation functions in Coordinate-MLPs can effectively improve audio representation quality; and (2) Fourier-ASR can robustly represent complex audio signals without extensive hyperparameter tuning. Looking ahead, the continuity and infinite resolution of implicit audio representations make our research highly promising for tasks such as audio compression, synthesis, and generation.

What is Neural Amplitude Fields?

In contrast to local feature-based representation, Neural Amplitude Fields, as a specific instance of implicit neural representations in audio signals, utilize time coordinates as inputs to regress the corresponding amplitudes, thereby encoding the audio signal within the weights of a neural network. This parameterized representation is not only continuously differentiable but also decoupled from spatial resolution, allowing for precise processing of audio signals at any resolution. It holds potential for applications in audio denoising, synthesis, generation, and other related fields.

A Benchmark of Coordinate-MLPs in Audio Signal Representations

Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations.

Quantitative Experiments

Conclusions:

Most activation functions (Sigmoid, ReLU, Tanh, etc.), aside from those with strong nonlinearity (Gaussian-type) and periodicity (Sine-type), fail to capture the high-frequency and local periodicity of audio signals.
Positional encodings significantly enhance the ability of Coordinate-MLPs to represent audio signals due to their high-dimensional mapping capabilities, which improve the model's ability to capture high-frequency information. This enhancement is particularly notable for Gaussian (11.02dB ↑ in SNR) and Sine (18.96dB ↑ in SNR) activation functions.
In the context of positional encoding, the introduction of random Gaussian noise by RFF makes it more suited to Gaussian-type activation functions (~3dB ↑ in SNR). Conversely, NeFF employs Fourier mappings, which are more compatible with Sine-type activation functions (~9dB ↑ in SNR).

Qualitative Experiments on ''Bach'' Audio

Ground truth

Sine

w/o Pos. Enc.

SNR=13.36 dB

w/ RFF Pos. Enc.

SNR=39.02 dB

w/ NeFF Pos. Enc.

SNR=42.39 dB

IncodeSine

SNR=15.98 dB

SNR=38.10 dB

SNR=41.40 dB

Gaussian

SNR=6.35 dB

SNR=20.85 dB

SNR=19.68 dB

Laplacian

SNR=12.04 dB

SNR=15.57 dB

SNR=15.26 dB

SuperCaussian

SNR=6.38 dB

SNR=20.86 dB

SNR=19.69 dB

ReLU

SNR=0.00 dB

SNR=15.62 dB

SNR=22.29 dB

Qualitative Experiments on ''Counting'' Audio

Ground truth

Sine

w/o Pos. Enc.

SNR=7.96 dB

w/ RFF Pos. Enc.

SNR=13.06 dB

w/ NeFF Pos. Enc.

SNR=33.58 dB

IncodeSine

SNR=8.16 dB

SNR=12.86 dB

SNR=32.24 dB

Gaussian

SNR=0.74 dB

SNR=12.14 dB

SNR=9.20 dB

Laplacian

SNR=1.34 dB

SNR=10.97 dB

SNR=8.67 dB

SuperCaussian

SNR=0.75 dB

SNR=12.44 dB

SNR=9.20 dB

ReLU

SNR=-7.66 dB

SNR=4.93 dB

SNR=9.57 dB

Fourier-ASR: A Fourier Kolmogorov-Arnold Framework

To avoid spectral bias from positional encoding and complex parameter tuning of activation functions, we propose a novel audio signal representation framework, Fourier-ASR, based on the Fourier series theorem and the Kolmogorov-Arnold theorem. Fourier-ASR includes Fourier Kolmogorov-Arnold Networks (Fourier-KAN) and a Frequency-adaptive Learning Strategy (FaLS). Due to the periodicity and strong nonlinearity of Fourier basis functions, Fourier-ASR can effectively represent audio signals and provide enhanced interpretability.

Quantitative Experiments

Conclusions:

It is noteworthy that although Gaussian and Sine activation functions were introduced to mitigate the complex parameter adjustments and spectral bias associated with positional encoding, we found that positional encoding remains essential due to the high-frequency nature and local periodicity of audio signals. Consequently, we designed new nonlinear mappings, namely RFF+Gaussian and NeFF+Sine, to address these challenges.
The designs RFF+Gaussian and NeFF+Sine significantly enhance the ability of Coordinate-MLPs to represent audio signals. On the GTZAN dataset, these methods improve the SNR by 10.15dB ↑ and 12.28dB ↑, respectively. On the CSTR VCTK dataset, the SNR improvements are 10.40dB ↑ and 14.04dB ↑, respectively.
Due to the periodic nature of Fourier basis functions and the Frequency-adaptive Learning Strategy (FaLS), our proposed Fourier-ASR(KAN) significantly outperforms Sine(MLP) (~6dB ↑) and B-Spline(KAN) (~18dB ↑). However, because existing optimization strategies are not perfectly adapted to KAN networks, Fourier-ASR(KAN) performs slightly worse than the locally periodic NeFF+Sine(MLP). Nonetheless, Fourier-ASR(KAN) does not require positional encoding, thereby avoiding the need for cumbersome hyperparameter tuning.