Hao Lin henliveira

Hao Lin

I'm an MS student at Nanjing University, working on multimodal large language models — what makes them perceive the world coherently across vision, language, and audio, and what makes them fail to. Most of my open-source work is small, sharp libraries that fall out of my own research as I try to clean up the same problem twice.

Research focus

Efficient multimodal inference. Visual tokens dominate the cost of MLLM forward passes; the same is increasingly true for video and audio. I'm interested in training-free methods that get most of the way to a clean fine-tune.
Fine-grained perception. General VQA accuracy has saturated; the questions that remain are the ones where you have to actually look — counting, materials, spatial relations under occlusion. Benchmarks here matter more than ever.
Data curation for audio-visual models. Web video is mostly noise. The pipeline you use to clean it is part of the model, whether you treat it that way or not.

Selected open-source work

vlm-token-pruner — training-free visual-token reduction for MLLMs (FastV, VisionZip, ToMe, spatial pooling).
fg-percept-bench — a 6.4k-item benchmark of fine-grained visual perception (counting, colour, material, shape, spatial, state).
av-curator — a modular audio-visual data curation pipeline for messy web video.

Stack

PyTorch, HuggingFace Transformers, OpenCLIP, Whisper, ffmpeg, Lance/Parquet for large-scale manifests. Most experiments live in a single A100 box; longer training runs go to a small SLURM cluster.

_{Nanjing, China · usually reachable evenings UTC+8 · open issues are the best way to start a conversation}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hao Lin henliveira

Achievements

Achievements

Block or report henliveira

Hao Lin

Research focus

Selected open-source work

Stack

Pinned Loading

Uh oh!