I'm an MS student at Nanjing University, working on multimodal large language models — what makes them perceive the world coherently across vision, language, and audio, and what makes them fail to. Most of my open-source work is small, sharp libraries that fall out of my own research as I try to clean up the same problem twice.
- Efficient multimodal inference. Visual tokens dominate the cost of MLLM forward passes; the same is increasingly true for video and audio. I'm interested in training-free methods that get most of the way to a clean fine-tune.
- Fine-grained perception. General VQA accuracy has saturated; the questions that remain are the ones where you have to actually look — counting, materials, spatial relations under occlusion. Benchmarks here matter more than ever.
- Data curation for audio-visual models. Web video is mostly noise. The pipeline you use to clean it is part of the model, whether you treat it that way or not.
vlm-token-pruner— training-free visual-token reduction for MLLMs (FastV, VisionZip, ToMe, spatial pooling).fg-percept-bench— a 6.4k-item benchmark of fine-grained visual perception (counting, colour, material, shape, spatial, state).av-curator— a modular audio-visual data curation pipeline for messy web video.
PyTorch, HuggingFace Transformers, OpenCLIP, Whisper, ffmpeg, Lance/Parquet for large-scale manifests. Most experiments live in a single A100 box; longer training runs go to a small SLURM cluster.
Nanjing, China · usually reachable evenings UTC+8 · open issues are the best way to start a conversation
