Beyond theLast-Frame
Temporal deep learning for the automated grading of retinal inflammation in fluorescein angiography.
Reader'sGuide
This is the graduation portfolio for the project Temporal Deep Learning for Automated Grading of Retinal Inflammation in Fluorescein Angiography, carried out at the Idiap Research Institute with the MedAI group.
The chapters are meant to be read as a whole story, not a loose pile of artifacts. Early chapters set context; Analysis explains what was investigated and why; Advice shows how recommendations were refined after feedback; Design and Realisation record how those choices were made concrete in software. Paths to notebooks, PDFs, scripts and diagrams are given explicitly — so you can open the evidence right next to the narrative.
How to use this interactive portfolio
The journey unfolds downward
Distilled, with the full text one click away
Grading criteria, mapped to evidence
Abstract
The group's pipeline treated each examination as a single static frame.
Clinically, diagnosis depends on how fluorescence evolves over time. This portfolio summarises internship work on giving the fluorescein-angiography pipeline a genuine temporal representation — rather than collapsing a minutes-long examination into one image.
A controlled, benchmark-driven path from raw frames to a temporal model.
On the public Aptos-style benchmark, labels and temporal signal were verified, timing was recovered from on-image text where metadata was missing, and an LDA-based Temporal Separability Benchmark compared binning configurations before any expensive GRU training.
Downstream experiments use a frozen RETFound-family backbone, temporal binning, and a two-layer GRU. A separate PCA study showed no benefit from compressing frozen embeddings before the recurrent core.
A working pipeline and a defensible method — not a finished clinical product.
The narrative and artefacts here document analysis, advice, design and a working evaluation script. Reported metrics and recommendations reflect the state at the time of writing.
Implementation of full time-embedding tracks, HOJG-scale validation, and integration with production workflows remain for subsequent project phases — and are described honestly as such.
TheProblem
The Idiap Research Institute, in Martigny, Switzerland, is a non-profit specialising in artificial intelligence, machine learning and computer vision. Within it, the MedAI group develops algorithmic methods for medical imaging in close collaboration with clinical partners such as the Hôpital Ophtalmique Jules-Gonin in Lausanne.
Fluorescein angiography tracks blood flow and vascular leakage in the retina over several minutes. But the existing deep-learning pipelines relied on single-frame Vision Transformer backbones — discarding the temporal evolution of the dye, information that is clinically indispensable.
Current fluorescein angiography classification pipelines insufficiently utilise the temporal information present in FA examinations — limiting diagnostic performance, robustness and interpretability.
The chapters that follow are organised around answering these.
Does the available benchmark data — labels, class balance, frame-level dynamics, timing metadata — support temporal modelling, and what limitations follow from how time is recorded?
When metadata lacks reliable timing, can we recover usable elapsed time and decide which frames are safe to use?
How should we group frames in time — binning, aggregation — and can a lightweight probe thin out options before expensive training?
Does unsupervised dimensionality reduction (PCA) of frozen embeddings improve sequence classification, or should embeddings stay at backbone dimension?
Does model performance transfer from Aptos to the external hospital (HOJG) dataset?
Engineer a temporal data representation and implement a sequential deep-learning architecture able to process variable-length FA examinations — improving diagnostic performance, robustness and interpretability.
Clinical &Technical
Patterns defined by behaviour over time — not by any single moment.
An FA examination is closer to a short film than a photograph: a dye is injected and tracked as it moves through the retinal vasculature over ten to fifteen minutes. Clinicians separate hyperfluorescence patterns by how they behave.
Leakage
Pooling
Staining
Window defect
A frozen RETFound-Green backbone turns each frame into meaning.
Training deep networks from scratch on specialised medical imagery overfits badly. The MedAI group instead leans on foundation models — RETFound and RETFound-Green, a compact ViT-Small pre-trained via a novel token-reconstruction task.
Passed through the backbone, each FA frame becomes a highly compressed feature vector encapsulating its medical semantics. The technical mandate of this project: sequence those vectors chronologically to represent the transit of the dye over time.
MedNet and a shared GPU grid — the constraints behind every choice.
MedNet is the group's in-house, PyTorch-based deep-learning framework: a shared, standardised structure for training, evaluation and experiment outputs, so one researcher's work can be inspected and reproduced by another. The second constraint is the shared GPU grid — compute is finite, and a wasteful experiment is a cost to the whole group. This gate reappears in every later chapter.
One public benchmark, one private clinical set.
The project draws on two distinct datasets. The difference between them is not only clinical — it decides what each one is allowed to be used for.
| Dataset | Clinical focus | Role in this project |
|---|---|---|
| HOJG | Uveitis-related retinal inflammation — 543 patients, 1,042 eyes, ASUWOG grading. | Main clinical target dataset for grading inflammatory signs. |
| 3rd Aptos | Broader angiographic abnormalities and fluorescence patterns. | Public dataset for broad FA analysis and temporal exploration. |
Why those two datasets cannot be handled the same way.
The public / private split was treated as a hard boundary, not an accident of availability.
Identifiable clinical patient data from the Hôpital Ophtalmique Jules-Gonin, governed by Swiss law and the GDPR. It cannot leave Idiap's infrastructure or be redistributed — so it is pointed to, never shipped.
A public benchmark — used for open exploration and anything that touches external tools or services, carrying none of the disclosure risk of patient imagery, and reproducible for a future paper.
From the thesis of Roberto Pulvirenti, using only the last frame of each examination. A genuinely strong score.
This project did not start from a blank page. “Temporal” is not competing against a weak straw-man — it has to earn its place against an already-capable single-frame model. Any temporal approach must justify its added complexity and compute by improving on that 0.82, under the same constraints.
Professional SkillsManage & Control
Adapting both content and form to the audience.
The project sat between technical researchers and clinical stakeholders. Bi-weekly calls with the Jules-Gonin hospital needed machine-learning ideas in plain, accessible language; discussions with medical-AI researchers needed precise methodological justification.
Asking for feedback after presentations exposed a clear lesson: being complete is not the same as creating the right emphasis. I began stating the central takeaway earlier and cutting detail from the first slides — visible by comparing an early weekly deck with a later one, and brought to a head in the institute-wide TAM talk.
Proactive alignment over working in isolation.
Rather than working alone, I regularly shared intermediate findings and discussed uncertainties before major implementation decisions. The clearest example: I had built an LSTM pipeline to process frame sequences — and treating the feedback on it as an opportunity, not a setback, became the model for how I worked.
A GRU architecture would be more efficient than the LSTM for our requirements. I researched GRUs, validated the advantages of the switch, and refactored the pipeline — identifying the bottleneck, opening a dialogue, and proactively pivoting to a better approach.
A structured, traceable, quality-oriented process.
Manage & Control here was not just a schedule — it was keeping the software and research process structured, traceable and reliable enough to support valid conclusions, on top of the company’s established standards.
A logbook as a control instrument
Branch-based GitLab workflow
Stepwise validation
Replanning against evidence
Analysis
Was the last-frame shortcut clinically wrong — and could the data support a different choice?
The official goal was to improve automated grading of retinal inflammation. Underneath it sat a more basic issue: the existing pipeline treated each examination as a single static image — the last frame. Before writing any temporal model code, that assumption had to be tested.
A window defect appears early and stays the same size; leakage starts small and progressively blurs; staining accumulates brightness without expanding. A model that sees only the last frame is not missing some information — it is missing the entire basis of the clinical diagnosis for several pathology types. That is a structural flaw worth fixing.
Five stakeholders, translated into the quality-attribute gates every later chapter answers to.
| Stakeholder | What they need | Quality attribute |
|---|---|---|
| HOJG clinicians | Decisions they can trust and interpret. | Clinical validity · interpretability |
| MedAI researchers | Beat the 0.82 baseline; fit the existing stack. | Performance · compatibility · reproducibility |
| Shared-grid users | Experiments that don't monopolise compute. | Scalability · compute efficiency |
| Future researchers | Code they can reuse and extend. | Maintainability · reproducibility |
| Patients (HOJG) | Their data handled lawfully and safely. | Security · privacy |
Does the data even support temporal modelling?
The 3rd Aptos dataset — 1,921 examinations across 16 metadata fields — was worked through systematically. Labels were informative, but class imbalance was non-trivial: a decision that drove class-weighted training and a commitment to per-class metrics rather than averages.
Comparing first and last frames within examinations confirmed genuine visual progression across the sequence. The temporal signal was there — the question became how to extract it.
A preprocessing footnote that became a core quality gate.
Aptos has no structured per-frame timing, so elapsed time had to be reconstructed from timestamp text burned into each image. The first OCR approach was not robust enough — so the methods were benchmarked explicitly against 150 manually-checked frames.
86 / 150 frames
76 / 150 frames
144 / 150 frames
The six remaining failures turned out to be Indocyanine Green Angiography frames — not FA frames at all. On genuine FA frames, Gemini reached 144 / 144 — a perfect score — and the discovery doubled as a way to discard non-FA frames automatically.
A genuine enrichment of the Aptos dataset that the whole MedAI group can reuse.
The timeline showed huge variability — some examinations run up to 90 minutes, most at most 14 — and frames cluster near the beginning and end of an exam, sparse in the relative centre. With the when established, the next question was the what: were there distinguishable phases?
Calibrated to the data, not copied from a textbook.
A one-dimensional Linear Discriminant Analysis fitted on frozen embeddings became the project’s reusable measurement instrument — read through three metrics: balanced accuracy, distribution overlap, and the Fisher score. It tested whether a clinical phase convention actually transferred to Aptos.
Thinning the option space cheaply, before expensive runs.
The Temporal Separability Benchmark used that same LDA probe to compare backbone families and frame-picking strategies — showing temporal structure was present, and that finer binning gave diminishing returns. RETFound-Green was kept: efficient, robust, and continuous with the group’s prior work.
A separate controlled study asked whether PCA on frozen embeddings would help. It did not improve macro-ROC-AUC or AP. Negative — but decision-relevant: it removed an attractive, non-beneficial branch and kept the implementation path simpler.
Where the compute goes — and where the model goes wrong.
A question-driven MedNet analysis showed activation storage dominates parameter storage, that scaling is driven by batch size and temporal depth, and why out-of-memory errors persist even with gradient checkpointing. The bottleneck is not a large model — it is a per-frame backbone multiplying work in one step.
An interactive error-analysis viewer separated actionable mistakes from out-of-reach ones. The model struggled with very small pathologies — traced to FA frames being resized from 334×334 to 224×224, discarding exactly the subtle cues that matter. That finding became a piece of advice.
An honest boundary on what the internship could reach.
Running anything against HOJG required the codebase to be mirrored into a separate, access-controlled environment. The privacy safeguards that protect patients are exactly what made the data hard to reach — and that process outlasted the project window. Q5 remains open. The whole method was deliberately built on Aptos as a transferable, re-runnable benchmark so the MedAI team can carry out that external validation.
Weak assumptions, exposed and corrected — converted into more defensible downstream decisions.
Evidence in this chapter
Advice
Three gates that constrained every recommendation.
The MedAI group needed a temporal pipeline that runs reproducibly on existing GPU infrastructure, integrates cleanly with the RETFound family already in use, and stays interpretable enough to discuss with clinical partners.
Compute cost & scalability
Infrastructure compatibility
Interpretability
Stating criteria, then failing to apply them.
The first advisory document compared three options — but treated the existing binning-plus-LSTM approach as the natural reference point, and named the project criteria without ever using them in the comparison. Stating requirements and then dropping them weakens the whole exercise.
The advisory was focused too narrowly on optimising binning parameters and had not sufficiently considered alternative temporal-encoding strategies from recent literature. The recommendation: broaden the scope before committing to a design direction. This produced a dedicated literature review and a revised advisory report.
Not “which paper sounds newest” — which strategy survives.
| Strategy class | Compute & scalability | Interpretability | Verdict |
|---|---|---|---|
| Temporal binning | Passes — minimal preprocessing overhead. | Moderate — abstracts away elapsed time. | Track 1 — baseline |
| Explicit time-embedding | Strong — minimal added training cost. | Best — aligns with a continuous, clinical notion of time. | Track 2 — primary |
| Heavier temporal objectives | Fails — substantial complexity, longer cycles. | Fails — much harder to explain to clinical partners. | Deferred |
A two-track path — start small, end big.
Track 1 keeps temporal binning as a controlled baseline — it already satisfies the three gates and matches what was empirically exercised in depth. Track 2 implements a time-embedding strategy — the only option scoring acceptably on all three criteria at once.
The MedAI group’s feedback was unanimous — research is iterative, so Track 1 was picked first. Their other point: I had interpreted the interpretability gate as closeness to real-world practice, when it meant the model showing where it drew its information from — tools like Grad-CAM. A genuine lesson in confirming requirements rather than assuming them.
Conservative by design — extend, don’t replace.
Two recommendations: cache frozen ViT features so the temporal component trains on stored embeddings rather than recomputing the backbone every epoch; and extend MedNet so temporal batches become an explicit contract rather than ad-hoc local modifications. Switching the whole stack to MONAI would imply re-validation and integration work — extending MedNet is the realistic path.
Advising beyond the MedAI group.
Presenting the work at the institute-wide TAM turned into advising other groups on their temporal problems — temporality, it turned out, is not unique to MedAI. The audience gave feedback too: one suggestion was to change how ViT tokens are pooled.
Test AUC, after changing ViT token pooling from average to max — a piece of TAM feedback, applied a week later.
A trade-off, raised honestly with the pipeline owner.
The error analysis suggested that downscaling inputs loses subtle pathological cues. I raised this directly with Roberto, who owns the preprocessing pipeline. His reply confirmed the concern was real — but it is always a trade-off: gains in performance against losses in memory and inference latency. Weighed against the efficiency requirement, the idea was set aside in favour of higher-return work.
Design
Extend the existing pipeline — don’t replace it.
The RETFound-Green backbone and the existing linear head were kept unchanged, with a temporal block inserted between them. The constraint was deliberate: if performance changes, it should be attributable to temporal modelling — not to a swapped backbone or classifier.
RETFound-Green backbone
Each frame is encoded into a 384-dimensional embedding.
Temporal stack
Bin-then-select, then a two-layer GRU summarises the progression into a single vector.
Classification head
The existing linear head maps that vector to five HyperF-Type classes.
A strict boundary between public development and clinical use.
The core sequence logic and GRU model were built dataset-agnostic, so the public Aptos dataset could serve as a sandbox with no risk of leaking sensitive information. The external OCR service was applied only to public data — when the model is eventually run on HOJG it will use the hospital’s own metadata, so no patient information ever touches an outside server.
Trade-offs made explicit.
A recurrent model was preferred over a Transformer — sequence lengths after binning are short, and the recurrent option fit the available data and compute better. The design review specified an LSTM; implementation switched to a GRU without changing the tensor contract.
Feeding more frames helped — but full sequences can reach 200 frames. The resolution: keep the data-driven phases, pick four frames per phase, and feed an ordered sequence of twelve embeddings to the GRU.
Aligned with evaluation from the start.
All configurations use the same fixed train / validation / test split, with a held-out test set of 211 examinations. The primary metrics match the baseline work — ROC-AUC, to validate ranking of disease severity across all thresholds, and F1, to ground the evaluation in clinical utility and penalise majority-class defaulting. Blindly borrowing metrics is a fast track to misleading results — so each was chosen for a reason.
Architecture in service of maintainable research software.
| Characteristic | How the design supports it |
|---|---|
| Reproducibility | Fixed split file; backbone frozen; embeddings computed once and reused. |
| Scalability | Cost scales with phase count, not with raw frames per examination. |
| Interpretability | Bin structure and recurrent hidden states are easier to discuss than full fine-tuned backbones. |
| Compatibility | Backbone and head interfaces unchanged; the temporal block is a drop-in layer. |
| Debuggability | A prototype script prints tensor shapes through the pipeline before integration. |
| Security & privacy | Public / private dataset segregation; external OCR isolated to public data. |
Two layers, because two kinds of bug fail in different places.
A self-contained prototype validated end-to-end tensor flow and sequence reshaping — and caught an early reshaping mismatch before integration, reducing debugging cost later.
The test strategy splits in two: a model layer (test_vitgru) covering freeze / unfreeze policy and a full forward pass, and a data layer (test_seq_angioreport) covering frame-selection boundaries, the split parser, and variable-length padding — where a silent error would corrupt the GRU signal without raising an exception.
Mean-pooling frames within a bin risks averaging away exactly the temporal variation the model is meant to learn. The recommendation: use a single representative frame per bin instead, and evaluate both strategies systematically rather than assuming one is always better.
Realisation
Working code is not automatically good professional software.
run_downstream.py
Proved the temporal idea end-to-end: precomputed embeddings arranged as short sequences, a GRU, classification from the final hidden state. Fast to modify — but configuration, data prep, model and training were mixed in one file.
MedNet temporal pipeline
A full image-to-prediction pipeline inside the MedNet project structure: image folders grouped by exam, sequences loaded and preprocessed, a ViT backbone per frame, a GRU, and class probabilities for the whole examination.
Responsibilities separated across the project structure.
The final pipeline is reusable because each concern has a home — so a future experiment can change one layer without disturbing the rest.
Grouping & sequence construction
The ViT-GRU model
Training & auditable output
Consistent with the Idiap codebase.
Validated before any expensive training job.
The two-layer unit-test strategy designed in the previous phase was implemented as planned, both modules living in the MedNet library. They confirm the model-level and data-level components behave correctly in isolation — and continue to serve as a regression guard during refactoring. Structural validation traces the complete flow, from split JSON to evaluation output.
Use a GRU instead of the specified LSTM. It achieves equivalent representational power with fewer parameters and lower computational cost — relevant when the sequence after binning is short and experiments run on a shared cluster. The change was incorporated before implementation.
Results& Status
Reported metrics reflect the state at the time of writing.
ROC-AUC — last frame only (Pulvirenti)
Test AUC, after the average → max pooling change
The single-frame baseline set the bar at 0.82. Changing ViT token pooling from average to max lifted the temporal model’s test AUC from 0.85 to 0.86. These are real, measured results — but they are a snapshot, not a finished clinical claim, and should be read as such.
What this internship produced — and what it deliberately leaves to the next phase.
- A complete analysis, advice and design chain, gated against explicit quality attributes.
- A working, MedNet-integrated temporal evaluation pipeline.
- A two-layer automated unit-test suite in the MedNet library.
- 32,844 validated frame timestamps, reusable by the whole MedAI group.
- A reproducible LDA phase-calibration probe and Temporal Separability Benchmark.
- Full implementation of the explicit time-embedding track (Advice — Track 2).
- HOJG-scale external validation — research question Q5.
- Integration with production clinical workflows.
- Grad-CAM-style interpretability, as clarified by the supervisors.
The honest framing is the point. The artefacts here document analysis, advice, design and a working evaluation script — they do not claim a finished clinical product. Reporting the state accurately, and naming what remains, is itself part of the professional contribution.
PortfolioMap
Orientation
Professional Skills & Manage and Control
Process logbook
Daily logbook and weekly reflections — the personal control instrument.
Highlight evidence (Fig. 1–15)
Annotated communication, collaboration, GitLab and replanning evidence.
Week 01 — progress slides
Earliest weekly presentation — baseline for the communication growth story.
Week 02 — progress slides
Weekly MedAI progress presentation.
Week 03–04 — progress slides
Weekly MedAI progress presentation.
Week 05–06 — progress slides
Later presentation — clearer upfront emphasis, the communication lesson applied.
Week 07–09 — progress slides
Weekly MedAI progress presentation.
Week 10–12 — progress slides
Final weekly progress presentations.
Analysis
Aptos data exploration
Systematic check of labels, class balance and frame-level temporal change.
Data exploration (supporting)
Supporting exploratory notebook for the benchmark data.
OCR benchmark analysis
EasyOCR vs Tesseract vs Gemini 2.5 Flash on 150 hand-checked frames.
Timeline distribution analysis
Where stamped frames sit on the FA timeline; sampling density per exam.
Timestamp metadata quality
Exam-duration variability and timestamp reliability after extraction.
Probing temporal separability (TSB)
The LDA-probe benchmark ranking backbones and binning strategies.
TSB study proposal
The benchmark plan sent to André for sign-off before running it.
PCA embedding / GRU study
Controlled study: does PCA on frozen embeddings help? (Decision-relevant no.)
MedNet framework analysis
Where GPU memory goes; why OOM persists even with gradient checkpointing.
Literature — temporal encoding
Literature review of temporal-encoding strategies (triggered by Oscar's feedback).
Studied concepts notebook
Handwritten/sketched notes building genuine understanding of the methods.
LSTM bin-count evaluation
Early evaluation of bin counts for the recurrent model.
Model error analysis (interactive)
Interactive exam viewer — ground truth vs prediction, frame by frame.
analyse_timestamps.py
Timestamp analysis script used after OCR extraction.
Figure 1 — Notion paper database
The literature trail: titles, dates, keywords kept organised in Notion.
Figure 2 — paper notes
Reading notes written in own words, unfamiliar terms pinned down.
Figure 3 — paper notes
A second example of the same dense-abstract note-taking habit.
Figure 5 — LD1 histogram (manual boundaries)
Cluster geometry under paper-derived phase boundaries — Early and Mid overlap.
Advice
Advisory — week 04 options matrix
The first advisory document — and the one Oscar's feedback corrected.
Advisory — temporal encoding (revised)
The revised two-track recommendation, gated against three project criteria.
TAM presentation — slides & notes
Cross-disciplinary dissemination of the decision logic to all of Idiap.
MedNet framework advice — slides & notes
Advice on caching, temporal batches and MedNet vs MONAI.
Advice on information loss (Mattermost)
The Mattermost exchange with Roberto on downscaling trade-offs.
Design
Temporal pipeline architecture
Backbone → temporal stack → unchanged classification head.
Design decisions & rationale
Full trade-off reasoning behind the architecture and quality characteristics.
prototype-temporal-pipeline-shapes.py
Self-contained prototype validating end-to-end tensor flow before integration.
Temporal pipeline design (interactive)
Interactive design walkthrough of the temporal pipeline.
Realisation & appendices
Temporal pipeline — implementation
Thorough explanation of the final MedNet-integrated pipeline.
run_downstream.py
The first exploratory downstream script — the GRU temporal classifier.
hyperftype split — README
Appendix note: where the fixed train/val/test split lives on Idiap storage.
hyperftype split — sample JSON
A representative sample of the split-definition JSON structure.
Where something cannot be redistributed — for example the exact train / validation / test ID list used on the cluster — the appendices point to where the file lives on Idiap storage and what it contains, rather than shipping it.
Thank you fortaking the time.
This portfolio documents a real research contribution — built carefully, reviewed honestly, and made reusable for the team that continues it.
