Input: Audio -> Output: Intent
The standard "audio to text" and then "text to intent" conversion is too error prone. It is easier to train for specific intent with any sound, gesture, combination of singals - just make sure that it takes as small steps as possible, and there is a statistical distinction metric - how unique is the combination compared to daily background noise or observations.
Convert intent to command.