I'm a CS master's student at Huazhong University of Science & Technology in Wuhan, China. Most of my time goes into figuring out why large multimodal models do what they do — and into building small, focused tools that make that easier to find out.
I work mostly at the boundary between vision-language models and speech-language models: how visual and acoustic information actually flows through these systems, where alignment breaks, and how we measure that without fooling ourselves.
A little more about me
I came into ML from a fairly traditional CS undergrad — operating systems, compilers, that sort of thing — and ended up in multimodal research mostly because I kept getting frustrated that "interpretability" tooling assumed a single text decoder. The bits that interest me happen between the vision encoder, the connector, and the language model, and a lot of standard tools don't go there cleanly.
I like small, sharp libraries over framework-y monorepos. I think benchmarks are underrated. I think papers should publish their decoding configs. I think I should probably commit more often during exam season, but I rarely do.
If something here is useful to you, feel free to open an issue — happy to talk shop.
Wuhan, China · he/him · usually online evenings UTC+8
