HeAR Distillation Datalake
To continue an accidental foray into vision models (and adjacent), I distilled Google’s HeAR model (https://huggingface.co/google/hear-pytorch) into a ViT-s size (https://huggingface.co/matthewagi/HeAR-s). It retains 95.3% of its capacity while 5x smaller with 10x faster inference on cpu. I trained the model on 25.6 million 2s clips which translates to \~2.45 billion tokens. I provide a full repo with data streaming, orchestration, training and evaluation (https://github.com/Matthew-agi/hear-distillation-datalake/tree/main). These scripts are currently unoptimized and I intend to continue to develop them further. I did deploy some novel technical attributes to this distillation training. These are released because we should not let perfect be the enemy of good.
What’s in the box?
datalake/run_lake.py: adaptive orchestration of streamer + trainer.stream_laion_audio_clips.py: LAION-Audio streaming + shard writer.distill_hear_vit_s_canon2d.py: student distillation training script.evaluate_distilled_hear.py: downstream benchmark/evaluation on HF datasets.benchmark_student_vs_hear.py: throughput benchmark for student (no projection head) vs full HeAR.datalake/README.md: focused runner documentation.- Release scaffolding:
.env,.env.example,requirements.txt,.gitignore.
Technical Changes (Canon2D)
One of the major changes from vanilla ViT-s was the introduction of Canon layers. Canon layers are convolution layers which act as local mixing. You can read more about them here: https://arxiv.org/pdf/2512.17351. Canon layers overall improved training.
It looks less stark at only 4k steps but there is a clear differentiation between ViT-s with (green) and without (blue) canon layers. A large benefit of canon layers was the ability to remove the position encoding. For our final HeAR-s model (red), we don’t use any position encoding beyond what the local convolution provides. When I reference canon layers, I am referencing 2D ABCD. For ViT models, I found that 2D was required for full expression (however a naive 1D implementation did improve the model over baseline).
In the original HeAR paper (https://arxiv.org/abs/2403.02522), the authors mention that extending the clips beyond the training window leads to deteriorating results. I.E the validation tests perform worse when you use a clip that is 7s long as opposed to the 2s from training. I could not find specific citations in the paper so I benchmarked on some of the datasets. We see clips ranging from 1 - 10s. I use Full clip to reference these runs that extend audio length:
| Dataset | Metric | Distill | Distill (Full Clip) | Teacher (Full Clip) |
|---|---|---|---|---|
| FSD50K + FluSense | mAP | 0.624 | 0.595 | 0.594 |
| Task | Metric | Distill | Distill (Full Clip) | Teacher (Full Clip) |
| --------- | ------ | ------- | ------------------- | ------------------- |
| Breathing | AP | 0.285 | 0.278 | 0.278 |
| Cough | AP | 0.665 | 0.475 | 0.477 |
| Laughter | AP | 0.603 | 0.551 | 0.550 |
| Sneeze | AP | 0.647 | 0.559 | 0.515 |
| Speech | AP | 0.700 | 0.663 | 0.663 |
| Task | Metric | Distill | Distill (Full Clip) | Teacher (Full Clip) |
| --------------- | ------ | ------- | ------------------- | ------------------- |
| Breathing | AP | 0.366 | 0.306 | 0.307 |
| Cough | AP | 0.909 | 0.913 | 0.913 |
| Gasp | AP | 0.521 | 0.541 | 0.542 |
| Sneeze | AP | 0.756 | 0.733 | 0.734 |
| Sniffle | AP | 0.816 | 0.836 | 0.836 |
| Speech | AP | 0.911 | 0.932 | 0.899 |
| Throat-Clearing | AP | 0.310 | 0.4207 | 0.417 |
As we can see, the distilled model performs at a capacity 95.3% of the full model on 2s clips. On full audio clips from the selected datasets, we perform at parity with the full model and within 95% of the distilled model. The original model has a decay to 90.3% values for full length audio clips. That’s a decay of nearly 2x HeAR-s. Canon layers were the key to preserving our model capacity at extended audio lengths.
EMA-Batch Driven LR Scheduler
EMA-Batch driven learning rate scheduler was minimally tested but performed adequately. There are many changes I wish to make in the future. Additionally, our model was undertrained. You can see that our loss curve was continuing to decrease even as we stopped training. The two different colors are indicative of where the run was stopped due to file processing issues. It was resumed after the code fix was implemented. It was resumed with the EMA learning rate scheduler.
To reiterate, we used 25.6 million 2s clips which translates to \~2.45 billion tokens. These audio clips were streamed from https://huggingface.co/datasets/laion/LAION-Audio-300M. We had access to over 10x more data. I am hopeful to revisit this distillation/continued training with a more efficient training script.
Our batch size was set to 128 which was determined from gradient noise. I naively set the learning rate to decay as a linear relation with an EMA of the optimal batch noise. Optimal batch size was determined with gradient noise. This was tracked after the restart of our script (teal).
The above was a plot of the EMA of our optimal batch value. Below is our learning rate. We updated every 50 steps.
As we can see, the update was likely too fast. The initial decline (after the patch) is to “catch up” with our increase in optimal batch size which had changed quite a bit from when we had started training. In general, we track the fast increase in optimal batch size and then have a slower gradual decrease. Using a linear relationship between batch and learning rate targeted a constant noise level per step (https://openreview.net/pdf?id=B1Yy1BxCZ\&utm_source=chatgpt.com).
Generally, modern training uses a sqrt(batch) relationship with learning rate to keep the variance per step constant (https://ar5iv.labs.arxiv.org/html/1404.5997). We do see both of these batch - learning rate dynamics addressed in the CBS paper from OpenAI where the regime affects batch - learning rate scaling. The ideal exponent is between sqrt(B) and B (https://arxiv.org/pdf/1812.06162). The optimal relationship between our scheduler and batch size scaling will be addressed in future posts. Alternatively, during our initial decrease of LR, we could have increased the batch size. Our GPU (A6000) had the capacity. Batch scheduling may have improved training (https://arxiv.org/pdf/2510.14717).
Another misstep was the lack of a warmup. In the future, I am going to implement an edge of stability warmup where we estimate the largest LR we can use that satisfies the curvature constraints of model training (https://proceedings.neurips.cc/paper_files/paper/2024/file/555479a201da27c97aaeed842d16ca49-Paper-Conference.pdf). This is not a requirement but would minimize the hand scheduling of a standard warmup-stagnant-decay learning rate schedule.
Putting the above together, the learning rate scheduling would benefit from continued work. I had already run mini trials where I swept the learning rate 2 OOM. The initial learning rate was proper for training. However it did not have a warmup and the decay was sub-optimal (although effective). We did not need to train long enough where end of training considerations would have meaningfully impacted our linear learning rate - batch size relationship. I have a slate of experiments planned to algorithmically estimate and optimize a LR scheduler given the above information. I am happy with the results of this distillation even if the learning rate scheduler can be optimized further.
The Datalake
When I went to download our dataset I had a few options. One was to scrape the audio from some of Google’s provided audio datasets (metadata pointed to youtube videos). The other was to find a dataset that had a minimum of 5 million audio clips of at least 2s long. I was lucky enough to come across laion/LAION-Audio-300M after spec’ing a data scraping rig for the youtube based dataset.
I mentioned that we used 25.6 million clips. This was a mild misnomer. I actually used fewer clips but cut the longer clips into 2s chunks. The total capacity of the 300M dataset is much larger than 300M examples if we were to use it to the fullest within the HeAR limitations. If a clip was not divisible by 2 then the final chunk either had overlap (with more than 1s remaining) or was discarded (less than 1s remaining). Fully downloading this dataset would have surpassed the capacity of our instance. Instead we needed to stream.
The data mixing is an artifact from my initial plan (full download - how quaint). There is also no filtering beyond discarding clips we could not reconcile into 2s. LAION provides fairly high quality datasets. Google in their HeAR report does discuss dataset construction which we do not match. This is an area that can be improved for my current process.
The Datalake is governed by a minimum and maximum number of workers. The max workers have a propensity to turn on and shut off frequently. We can specify the amount of memory to allocate on disk to our datalake. We will stream until the reservoir is full then stop all workers until we hit a 50% depletion point. Our validation set is taken from this lake and retained throughout training. The dataset appears to have sequential distribution drift.
Our validation set experienced oscillations later in training. We also see a discontinuity between when training was resumed that is only apparent in the validation loss. This is likely due to sequential drift. When we resumed, we pulled a new validation set which was more aligned with the current data distribution. Later we see noise and oscillations which may be driven by LR or data mixture. My expectation is we will find the oscillations in Optimal batch size (earlier) and validation are artifacts of data mixture. The peaks and valleys of the validation loss values do not align with either our validation frequency nor our learning rate update frequency. This implies it is a data artifact.
Despite this data drift, our training loss smoothly declined throughout training and seemed stable. Our validation tests (taken from the evaluations in the HeAR paper) continued to improve throughout training. These tests were run manually and periodically throughout training to gauge progress. Each test improved on the last.
The datalake is unrefined. It makes use of a large, high quality dataset provided by LAION. To them I am indebted. In the future, I will be iterating towards better preprocessing within our orchestrator and better mixing. The data is an area which can be improved in future runs.
Distillation
The distillation follows current best practices with a mix of losses. We used mean square error, relational loss and contrastive loss. The ratios were minimally optimized to a stable fun formulation. This is another aspect of training that I will eventually address when I have the GPU capacity.
It is likely that the loss dynamics can be improved. Every clip was embedded by our teacher and our student. Our student used a final projection layer to convert the student embedding dimension (384) to our teacher embedding dimension (512). We then minimized our combined losses to improve the student model. The final model performed best without the projection to 512 dimensions.
Evaluating our Model
I ended up using two primary evaluations (FSD50k and Flusense) which were conveniently uploaded to Huggingface. Here is the chart from the original paper:
| Metric | TRILL | FRILL | BigSSL-CAP12 | HeAR | CLAP (48k) |
|---|---|---|---|---|---|
| mAP | 0.494 | 0.516 | 0.613 | 0.658 | 0.691 |
| Task | TRILL | FRILL | BigSSL-CAP12 | HeAR | CLAP (48k) |
| ------------------ | ----- | ----- | ------------ | ----- | ---------- |
| Breathing | 0.301 | 0.336 | 0.365 | 0.434 | 0.467 |
| Cough | 0.450 | 0.452 | 0.658 | 0.621 | 0.751 |
| Laughter | 0.438 | 0.425 | 0.673 | 0.680 | 0.715 |
| Respiratory sounds | 0.539 | 0.535 | 0.629 | 0.670 | 0.702 |
| Sneeze | 0.361 | 0.448 | 0.570 | 0.650 | 0.912 |
| Speech | 0.430 | 0.418 | 0.567 | 0.534 | 0.599 |
| Task | TRILL | FRILL | BigSSL-CAP12 | HeAR | CLAP (48k) |
| --------------- | ----- | ----- | ------------ | --------- | ---------- |
| Breathing | 0.147 | 0.233 | 0.357 | 0.336 | 0.371 |
| Cough | 0.903 | 0.892 | 0.954 | 0.974 | 0.963 |
| Gasp | 0.466 | 0.587 | 0.653 | 0.608 | 0.701 |
| Sneeze | 0.648 | 0.661 | 0.810 | 0.788 | 0.825 |
| Sniffle | 0.718 | 0.667 | 0.720 | 0.852 | 0.841 |
| Speech | 0.949 | 0.949 | 0.983 | 0.972 | 0.973 |
| Throat-Clearing | 0.070 | 0.099 | 0.035 | 0.436 | 0.169 |
Respiratory sounds did not discriminate amongst the models and was a vaguely specified cohort of labels so was left off my evaluations. Otherwise, I tried to stay as true as possible to the original paper. Here is our HeAR-s results (the same as before):
| Dataset | Metric | Distill | Distill (Full Clip) | Teacher (Full Clip) |
|---|---|---|---|---|
| FSD50K + FluSense | mAP | 0.624 | 0.595 | 0.594 |
| Task | Metric | Distill | Distill (Full Clip) | Teacher (Full Clip) |
| --------- | ------ | ------- | ------------------- | ------------------- |
| Breathing | AP | 0.285 | 0.278 | 0.278 |
| Cough | AP | 0.665 | 0.475 | 0.477 |
| Laughter | AP | 0.603 | 0.551 | 0.550 |
| Sneeze | AP | 0.647 | 0.559 | 0.515 |
| Speech | AP | 0.700 | 0.663 | 0.663 |
| Task | Metric | Distill | Distill (Full Clip) | Teacher (Full Clip) |
| --------------- | ------ | ------- | ------------------- | ------------------- |
| Breathing | AP | 0.366 | 0.306 | 0.307 |
| Cough | AP | 0.909 | 0.913 | 0.913 |
| Gasp | AP | 0.521 | 0.541 | 0.542 |
| Sneeze | AP | 0.756 | 0.733 | 0.734 |
| Sniffle | AP | 0.816 | 0.836 | 0.836 |
| Speech | AP | 0.911 | 0.932 | 0.899 |
| Throat-Clearing | AP | 0.310 | 0.4207 | 0.417 |
Again, we retain 95.3% of the capacity of our HeAR model. We even beat Clap in the Throat-Clearing category. Our model would place 3rd on our evaluations. It shows much better generalization than the other “small” models while beating BigSSL (\~30x larger than HeAR-s). Our model also generalizes audio length better than HeAR (as discussed).
I look forward to evaluations further on HeAR-s and here I include some from my (soon to be) deployed demo. I used two datasets which used recordings for heart and lungs. I used per dataset preprocessing and an ensemble of probes to optimize the models for each task. I will go into the specifics in the future when the demo is released. The datasets are cited:
| Diagnosis | Metric | Value (Std) | Dataset |
|---|---|---|---|
| Healthy (Lungs) | F1 | 0.536 (0.076) | ICBHI 2017 Respiratory Sound Database |
| COPD | F1 | 0.888 (0.058) | ICBHI 2017 Respiratory Sound Database |
| URTI | F1 | 0.179 (0.127) | ICBHI 2017 Respiratory Sound Database |
| Bronchiectasis | F1 | 0.529 (0.211) | ICBHI 2017 Respiratory Sound Database |
| Pneumonia | F1 | 0.722 (0.208) | ICBHI 2017 Respiratory Sound Database |
| Bronchiolitis | F1 | 0.622 (0.166) | ICBHI 2017 Respiratory Sound Database |
| Heart Murmur | F1 | 0.582 (0.017) | The CirCor DigiScope Phonocardiogram Dataset |
In order to finish the Google competition, I will be hosting a webpage where the model can be downloaded and used on edge devices to directly measure and output determinations. This is going to be a research demo only.
Downloads
Model: (https://huggingface.co/matthewagi/HeAR-s) Github: (https://github.com/Matthew-agi/hear-distillation-datalake)