GSOC PROJECT: Develop an OpenVINO-Domain Specialized Coder Model with SFT/GRPO/RAG #34299
Replies: 2 comments
-
|
Hi @Shi-pra-19 , Thank you for reaching out and sharing such a detailed and well-thought-out proposal! Regarding your proposed implementation and questions, here are my thoughts and recommendations:
Model Choice: Qwen 2.5 7B Coder is currently the state-of-the-art for this size, and DeepSeek-Coder-V2-Lite is also a fantastic choice. However, I would suggest dropping CodeLlama 7B from your list, as its architecture and performance are quite outdated compared to Qwen and DeepSeek.
An additional lightweight interface for users to interact with the trained model is also an important part of this project, so I recommend including this within the timeframe. (Of course, the core goal of this project is still to train an excellent OpenVINO coder model; if you don't have enough time, you can consider the demo as an option.)
Using Github or email for future discussions and sharing proposal drafts is fine. My email is tao1.zhou@intel.com Looking forward to your next step! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @7taozhou7, Thank you for the detailed feedback and technical clarifications. I'll proceed with Qwen 2.5 7B Coder as the base model. For the training strategy, I will adopt QLoRA + GRPO (via Hugging Face TRL), using Unsloth for efficient fine-tuning:
Also, thank you for the correction regarding ONNX. I will instead use optimum-intel. For the user interface, I'll develop a terminal-based TUI as part of the deployment stage. I’ll ensure model quality remains our primary milestone. I’ll prepare a more detailed technical proposal draft soon and share it via email or here for feedback. Looking forward to the next steps, and thanks again for the guidance! I’m very excited to move forward with this! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello @yinquan251, @7taozhou7, and the OpenVINO community!
I hope this message finds you both well. My name is Shipra, and I am currently a third year student at IIT Madras.
A bit about me: Currently I work as a Quantitative Research Consultant at WorldQuant, and prior to that, I gained experience in LLM evaluation and dataset curation for code-generation models at Remotasks. I’ve fine-tuned large language models, including a Qwen2.5-Math model (https://github.com/Shi-pra-19/Qwen_2.5_Fine_Tuning), and I also hold Kaggle Competition Expert rank with two bronze medals. I have also deployed a RAG pipeline indexing course content (including discourse forum discussions).
This GSoC project caught my attention because it aligns closely with my background. I do have some questions regarding the implementation:
I think curating a proper dataset would be a crucial part of the project.
Dataset curation: I am considering a combination of resources for curating a high-quality dataset:
OpenVINO latest documentations, repo, GenAI API, tutorials, notebooks.
Stack Overflow and other discussion forums (focusing on OpenVINO 2.0 API)
Specific migration examples from older API version to newer one.
Scraping specific relevant commits, issues via GitHub GraphQL API
Model selection: I am thinking of using LLMs such as Qwen 2.5 7B Coder, DeepSeek 7B Instruct Coder, or CodeLlama 7B Instruct.
GRPO design: Rewarding positively to signals such as compilation success, execution correctness, latency/performance, code quality/structure, and favoring correct usage of newer APIs.
Inference and deployment: My plan is to export the model to ONNX, then optimize using OpenVINO, applying NNCF for quantization/compression. Also I will use suitable precision per device and monitor performance via the OpenVINO benchmark_app.
As part of the project, I aim to provide an additional lightweight interface for users to interact with the trained model. This could be a terminal-based TUI or Streamlit demo.
Would you recommend including this within the time frame, or should it be considered an optional demo feature?
Prerequisite Contribution:
I have contributed 10+ PRs implementing numpy operations to Keras for OpenVINO backend, some of them including:
keras-team/keras#22078
keras-team/keras#22025
Would you recommend any additional considerations for deployment optimization?
Is the above model size appropriate, or should we consider larger models (e.g., Qwen 2.5 14B Coder, CodeLlama 13B)?
Can I use LoRA, QLoRA, or libraries like Unsloth to speed up training time and reduce memory usage?
Are there other resources I should consider for dataset curation and knowledge base for RAG implementation?
Is there a preferred platform you prefer for discussion, sharing demos and proposal drafts in future? Email, Discord, or another medium?
I am excited to discuss the implementation further and explore how I can contribute.
Thank you very much for your time!
Beta Was this translation helpful? Give feedback.
All reactions