Skip to content
View henliveira's full-sized avatar
🏠
Working from home
🏠
Working from home
  • Nanjing University
  • Nanjing, China

Block or report henliveira

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
henliveira/README.md

Hao Lin

I'm an MS student at Nanjing University, working on multimodal large language models — what makes them perceive the world coherently across vision, language, and audio, and what makes them fail to. Most of my open-source work is small, sharp libraries that fall out of my own research as I try to clean up the same problem twice.


Research focus

  • Efficient multimodal inference. Visual tokens dominate the cost of MLLM forward passes; the same is increasingly true for video and audio. I'm interested in training-free methods that get most of the way to a clean fine-tune.
  • Fine-grained perception. General VQA accuracy has saturated; the questions that remain are the ones where you have to actually look — counting, materials, spatial relations under occlusion. Benchmarks here matter more than ever.
  • Data curation for audio-visual models. Web video is mostly noise. The pipeline you use to clean it is part of the model, whether you treat it that way or not.

Selected open-source work

  • vlm-token-pruner — training-free visual-token reduction for MLLMs (FastV, VisionZip, ToMe, spatial pooling).
  • fg-percept-bench — a 6.4k-item benchmark of fine-grained visual perception (counting, colour, material, shape, spatial, state).
  • av-curator — a modular audio-visual data curation pipeline for messy web video.

Stack

PyTorch, HuggingFace Transformers, OpenCLIP, Whisper, ffmpeg, Lance/Parquet for large-scale manifests. Most experiments live in a single A100 box; longer training runs go to a small SLURM cluster.


Nanjing, China · usually reachable evenings UTC+8 · open issues are the best way to start a conversation

Pinned Loading

  1. av-curator av-curator Public

    Audio-visual data curation pipeline — scene cuts, silence trim, dedup, CLIP/Whisper filtering for messy web video.

    Python 230

  2. fg-percept-bench fg-percept-bench Public

    FG-PerceptBench: a fine-grained visual perception benchmark for multimodal LLMs (counting, colour, material, shape, spatial, state).

    Python 1

  3. vlm-token-pruner vlm-token-pruner Public

    Training-free visual token pruning for multimodal LLMs — FastV, VisionZip, ToMe, spatial pooling.

    Python 1

  4. OpenSenseNova/SenseNova-Skills OpenSenseNova/SenseNova-Skills Public

    Modular SenseNova skills for building AI-powered office assistants and productivity workflows

    Python 3.1k 224

  5. deeplethe/forkd deeplethe/forkd Public

    Fork() for AI agent microVMs. Spawn 100 children in ~100ms from a warm parent; BRANCH a live VM in ~150ms. KVM-isolated, snapshot CoW.

    Rust 1.2k 86

  6. openmemind/memind openmemind/memind Public

    Self-evolving cognitive memory and context engine for AI agents in Java. Empowering 24/7 proactive agents like OpenClaw with understanding and SOTA performance.

    Java 895 84