We had this ability before 0.3.1, because we were just using mlx_lm as a system cli. But this is not a scalable approach and since we already have py server model inference and a modelfile abstraction, just use that.
Currently the model running is not generic, it's hardcoded to run memory models.
Instead move the default system prompt to the default model file, pass the model parameters from model file to the py server for model inference.
Make the repl generic to run any model, since memory models have relays (Setting relay_count to 1 for other models will work though, Also can even think of adding relay_count to the modelfile itself later).
We had this ability before 0.3.1, because we were just using mlx_lm as a system cli. But this is not a scalable approach and since we already have py server model inference and a modelfile abstraction, just use that.
Currently the model running is not generic, it's hardcoded to run memory models.
Instead move the default system prompt to the default model file, pass the model parameters from model file to the py server for model inference.
Make the repl generic to run any model, since memory models have relays (Setting relay_count to 1 for other models will work though, Also can even think of adding relay_count to the modelfile itself later).