Skip to content
View edmicho's full-sized avatar
🌴
On vacation
🌴
On vacation
  • Huazhong University of Science and Technology
  • Wuhan, China

Block or report edmicho

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
edmicho/README.md

Hi, I'm Zihao 👋

I'm a CS master's student at Huazhong University of Science & Technology in Wuhan, China. Most of my time goes into figuring out why large multimodal models do what they do — and into building small, focused tools that make that easier to find out.

I work mostly at the boundary between vision-language models and speech-language models: how visual and acoustic information actually flows through these systems, where alignment breaks, and how we measure that without fooling ourselves.

What I'm up to

  • 🔭 Probing internals of multimodal LLMs (LLaVA / Qwen-VL / InternVL) — attention, hidden states, modality alignment
  • 🎧 Putting together a reproducible evaluation suite for end-to-end speech LLMs (Qwen-Audio, SALMONN, LLaMA-Omni, …)
  • 📚 Reading a lot about long-context multimodal reasoning and audio-visual grounding
  • ☕ Trying to survive the second year of grad school on coffee and bubble tea

Research interests

Multimodal LLMs · Speech LLMs · Mechanistic Interpretability · Benchmark Design · Modality Alignment

Stack

Python PyTorch HuggingFace CUDA NumPy Linux Bash LaTeX Git Jupyter


Things I'm working on in the open

A little more about me

I came into ML from a fairly traditional CS undergrad — operating systems, compilers, that sort of thing — and ended up in multimodal research mostly because I kept getting frustrated that "interpretability" tooling assumed a single text decoder. The bits that interest me happen between the vision encoder, the connector, and the language model, and a lot of standard tools don't go there cleanly.

I like small, sharp libraries over framework-y monorepos. I think benchmarks are underrated. I think papers should publish their decoding configs. I think I should probably commit more often during exam season, but I rarely do.

If something here is useful to you, feel free to open an issue — happy to talk shop.

Wuhan, China · he/him · usually online evenings UTC+8

Pinned Loading

  1. edmicho edmicho Public

    Profile README

  2. paper-pilot paper-pilot Public

    Forked from Xueyang-Song/paper-pilot

    TypeScript

  3. speech-llm-bench speech-llm-bench Public

    A reproducible evaluation suite for speech-conditioned LLMs — ASR, spoken QA, audio captioning, and more.

    Python

  4. ModelEngine-Group/nexent ModelEngine-Group/nexent Public

    Nexent is a zero-code platform for auto-generating production-grade AI agents using Harness Engineering principles — unified tools, skills, memory, and orchestration with built-in constraints, feed…

    Python 4.8k 619

  5. fim-ai/fim-one fim-ai/fim-one Public

    Open-source agent platform for Global × China enterprises — wire every system through one agent core. Self-hosted, any LLM.

    Python 1.2k 133

  6. Agentshire/Agentshire Agentshire/Agentshire Public

    OpenClaw / QClaw plugin that visualizes AI agents as 3D NPCs in a game town — with social simulation, a map editor, and a character workshop.

    TypeScript 1.1k 164