← Return to Portfolio

By Rommel Sharma · LinkedIn

Project Overview

This project builds an offline acoustic wildlife identification system for Brazil’s Pantanal—supporting field biologists, rangers, and citizen scientists who need reliable species detections without cloud connectivity. From short audio clips, the application estimates presence across 234 taxa (birds, amphibians, mammals, reptiles, and insects), including overlapping calls in noisy, far-field soundscapes. The core challenge is ecological, not cosmetic: most training examples are clean, close-range recordings, while real deployments face distant microphones, wind, insects, and multiple species calling at once. The pipeline combines efficient on-device models, teacher-guided learning from unlabeled soundscapes, and ensemble inference tuned for robust ranking scores in the UI—turning competition-grade research into a practical tool for biodiversity monitoring and conservation workflows.

Current champion model-ensemble (June 2026)

v7 ONNX 4-member ensemble — Private LB 0.89774 / Public 0.89535 (+0.035 private vs the former PyTorch champion). This was an inference-only breakthrough: no new CNN training. The stack combines three exported EfficientNet checkpoints with a live native Perch v2 member via ONNX Runtime on CPU, prob-space blending, and calibrated post-processing.

Gap to the 0.90 target: +0.00226 private. Inference-only tuning has plateaued; next lever is Phase E training.

Read the Kaggle writeup →

Mel spectrogram example of a species vocalization

Source Dataset

Dataset and taxonomy breakdown

Data comes from the Kaggle BirdCLEF+ 2026 competition:

Architectural Evolution

From Phase 1 onward the CNN front-end was fixed: 128-mel log-spectrograms, 5 s windows @ 32 kHz, per-sample z-score, feeding an EfficientNet backbone with fp32-guarded GEM frequency pooling and an SED attention head. Training followed supervised → Noisy Student iter1 → Noisy Student iter2 on focal + labeled-soundscape streams.

Supervised training results

Noisy Student iteration 1 training results

Noisy Student iteration 2 training results

Model Card

Verified clean runs and scored inference submissions. Metric: macro-averaged ROC-AUC over 234 species on 5-second windows. Erroneous exports (e.g. the original Phase 3 packaging mistake) and rejected harness runs (e.g. the failed pseudo-label training experiment) excluded.

Leaderboard ranking (Private LB)

Rank Run Private Public
1 v7 ONNX 4-member (B0+Perch ⊕ B0 no-Perch ⊕ V2-S+Perch ⊕ native Perch) 0.89774 0.89535
1b v7 ONNX confirmatory (tuned genus mirroring & Perch gating) 0.89348 0.89177
1c v7 ONNX D1 clean resubmit 0.89004 0.88300
2 PyTorch single-model (B0+Perch + TTA) 0.86259 0.84613
3 Phase 2 (B0 + Perch distill, single CNN) 0.84846 0.82698
4 Phase 1 (B0, no Perch) 0.831 0.799
5 2-member prob-blend (B0+Perch + B0 no-Perch) 0.81465 0.80270
6 Phase 4 (V2-S + Perch) 0.803 0.797
7 Phase 3 clean rerun (V2-S, no Perch) 0.78679 0.78683
8 Phase 0 baseline (minimal submit) 0.499 0.555

Training validation (final self-trained checkpoint)

Run Focal Site-22 Greedy LSS Site-22 taxon (Amph / Aves / Mamm)
Phase 1 (B0, no Perch) 0.943 0.683 0.767 0.725 0.757 / 0.604 / 0.700
Phase 2 (B0 + Perch) 0.952 0.731 0.745 0.738 0.740 / 0.704 / 0.756
Phase 3 clean rerun (V2-S, no Perch) 0.652 0.767 0.709
Phase 4 (V2-S + Perch) 0.958 0.654 0.824 0.739 0.735 / 0.590 / 0.431

Evaluation layers

Layer Protocol Use for
Training notebook Site-22 macro-AUC, no TTA, per-phase peaks Training progress, export decisions
Site-22 harness 954 windows, BN recal, prob-space TTA, real post-proc Ensemble gate, blend A/B, promotion decisions
Kaggle LB Hidden test set, CPU-only submit Authoritative final score

Site-22 is the held-out monitoring site. Harness and LB can disagree on ensembles — the V2-S+Perch member failed standalone Site-22 (0.645) yet helped the 4-member ONNX stack reach 0.898.

Inference Stack (Champion)

The 0.89774 score came from inference engineering, not new CNN training. Pre-exported ONNX weights from earlier clean runs are combined at submit time:

Member Source run Role
B0 + Perch (anchor CNN) Phase 2 training run Primary distilled model
B0 without Perch Phase 6 training run Decorrelated second opinion (different training recipe)
V2-S + Perch Phase 4 training run Backbone-family diversity
Native Perch v2 Google Perch v2 (external) Live acoustic classifier mapped to 234 competition classes

Pipeline at score time:

  1. Decode once per soundscape; build log-mel for CNN members, raw waveform for Perch
  2. ONNX Runtime 1.27 on CPU (offline wheels — Kaggle submit has no internet)
  3. Prob-space TTA — ±0.5 s time shift, 3 variants, averaged per member
  4. Prob-blend the three CNN members; apply Perch rank gate on combined output
  5. Post-processing at champion settings: genus-level score mirroring, temporal continuity smoothing, and moderate Perch gating on the blended CNN output
  6. Budget guard — ~8.3 files/min measured; ~72 min extrapolated for 600 hidden test files (within 90 min)

PyTorch is retained only for audio I/O (decode + mel); all forward passes run through ORT. Pure PyTorch cannot fit 4 models + TTA in the CPU budget — ONNX was load-bearing, not optional.

MLOps & Reproducibility

Why MLOps mattered here

This was a months-long solo research effort spanning dozens of training runs, ensemble experiments, and Kaggle submissions. Without disciplined experiment tracking, it is easy to ship the wrong model snapshot, confuse a local validation win with a real leaderboard gain, or repeat an expensive GPU run with no clear hypothesis. MLOps — in the practical sense of machine learning operations — turned ad-hoc notebook work into a repeatable, auditable process: every run has a stated goal, measured outcome, promotion decision, and a single place to see what the current champion is.

The v3 export bug is the clearest example: the model looked weaker than it was because the submission packaged an early checkpoint. Formal checkpoint-selection rules and a run registry prevent that class of silent failure from reaching production or competition submits.

Best practices incorporated

Outcome

The champion ONNX ensemble (Private LB 0.89774) is the direct result of this discipline: only verified, correctly packaged models entered the four-member stack; rejected candidates (such as the failed pseudo-label training run) never reached submission. The remaining gap to 0.90 is now a focused training question, not an operational or tracking problem.

Key Lessons

The two largest early wins were debugging wins, not tuning wins: fp16 overflow in GEM pooling (pow(p≈3) under AMP) and BatchNorm statistics mismatch at inference both collapsed AUC to ≈0.5. Fixing them unlocked the Phase 0 → Phase 1 jump (+0.332 private LB) alongside the full stack rebuild — better backbone, z-score 128-mel front-end, soundscape self-training, and SED head.

  1. Trust Site-22, not focal AUC or the public leaderboard

    Focal AUC (~0.94) and greedy coverage were optimistic; Site-22 (held-out monitoring site) was the honest generalisation proxy. The harness extends this with submission-parity TTA and post-processing.

  2. The export path is as load-bearing as the model

    v3's regression was an export bug: the submission packaged an early supervised-only snapshot instead of the final self-trained model after two rounds of soundscape learning. A deterministic checkpoint-selection policy fixed this — always ship the fully trained model, never an intermediate snapshot. Several runs peaked mid-training but still exported a later, weaker checkpoint until export policy was aligned with validation peaks.

  3. Prob blend, not rank blend, for post-processed ensembles

    v6 scored flat because rank-uniform scores were fed into probability-tuned post-processing (site/hour priors, score boosting, taxonomy smoothing). Switching to probability-space blending recovered the loss.

  4. Native Perch > distillation-only; harness ≠ LB for ensembles

    Running Perch live at inference diversifies error vs B0+Perch distill alone — the largest conceptual shift since the prob-blend fix. V2-S+Perch failed Site-22 standalone yet helped the 4-member ONNX stack reach 0.898; treat harness as a coarse gate, not an LB oracle.

  5. B0 + ns_jft pretraining beats V2-S on this task

    The 2×2 ablation showed pretraining provenance (+0.044 private) matters more than architecture depth. Perch distillation adds a consistent ~+0.017 lift regardless of backbone.

  6. Single-model ceiling; ensemble + inference stack close the gap

    Best single CNN: 0.848. PyTorch 1-member + TTA: 0.863. ONNX 4-member: 0.898. The remaining +0.002 to 0.90 likely needs the next training cycle (Phase E: soundscape Perch embeddings, then retraining with best-checkpoint export), not more inference tuning.

Key Concepts

Plain-English glossary for terms used above.

GEM pooling & fp16 overflow

Generalized mean pooling raises feature values to a power (~3), averages, then takes the root — between average and max pooling. With p≈3, values like 50³ = 125,000 overflow fp16 under AMP. Fix: run GEM/head in fp32 while the rest uses mixed precision.

BatchNorm recalibration

BN running stats learned on augmented training audio mismatch clean inference inputs, collapsing predictions. Recalibrate on clean soundscape-like audio before every eval/submit pass.

Analogy: calibrating a scale while wearing a backpack, then weighing without it.

Noisy Student self-training

Pseudo-label unlabeled soundscapes with the current model, then retrain on those labels with noise augmentation. Exposes the model to the far-field, multi-species domain it will be tested on.

Perch distillation vs native Perch

Distillation: precomputed Perch embeddings guide CNN training; teacher not run at inference. Native: Perch runs live as a fourth ensemble member at inference — diversifies errors beyond what distillation alone captures.

Prob blend vs rank blend

Prob blend averages sigmoid probabilities — compatible with logit-space post-processing. Rank blend converts scores to uniform ranks per class — breaks priors and smoothing tuned for probabilities. The v6 regression came from mixing these.

SED head & frame-max loss

The SED attention head weights each time frame; frame-max auxiliary loss supervises the loudest frame so brief calls (e.g. one 0.4 s vocalization in 5 s) are not drowned by global average pooling.

What's Next

The inference-only sprint is closed: D1, D2, and D2+D3 confirmatory Kaggle submits did not beat the champion. Phase E is the active training path:

Target: close the remaining +0.00226 private gap to 0.90 macro-AUC.

← Return to Portfolio

Let's Connect

I enjoy discussing deep learning for environmental conservation, acoustic monitoring pipelines, and the engineering discipline behind competition-scale ML systems. Questions about ensemble design, ONNX inference, or MLOps gating are welcome.

Connect on LinkedIn