-
Notifications
You must be signed in to change notification settings - Fork 123
Description
Hello KBLaM team,
First of all, thank you for making this project open and accessible — it's a really exciting approach to structured knowledge injection.
I'ma student working on a domain-specific use case involving a large food and nutrition database, which includes thousands of entities covering products, nutritional values, food groups, and other attributes such as:
- Macronutrients (e.g.
ENERC,FAT,CHO,PROT) - Micronutrients (e.g.
VITC,CA,FE) - Numeric values (e.g.
12.2 g sugar,0.6 g saturated fat,61.0 g carbohydrates) - Categorization tags (e.g.
"cereal products","vegetables")
Training Setup
I trained the model for 2,000 steps using the Meta-Llama-3-8B-Instruct model. The food database was transformed into the expected KBLaM format like this:
{"name": "White pepper", "property": "FOOD_GROUP", "value": "Seasoning"}
{"name": "White pepper", "property": "CA", "value": "265.0"}
{"name": "Beef soup", "property": "THIA", "value": "0.022"}Observed Issues
-
Despite the training, I’m encountering several challenges:
-
Poor retrieval quality for health-related queries
e.g. “Which foods are good for diabetics?” often retrieves items with high sugar or refined carbs. -
Abbreviations like ENERC, FIBT, FASAT are not well understood
(Note: I plan to map these to full names in the next training run.) -
Generated outputs are incoherent, sometimes repeating the user's prompt or hallucinating answers.
-
Numeric values (e.g. 12.6g) appear particularly problematic for the model to use effectively.
Request for Guidance
Could you please advise on best practices for integrating a numeric-heavy, structured KB like this into KBLaM?
Specifically:
-
Handling Numeric Data: Any suggestions for how to better encode or structure numerical values so that the model can reason over them effectively?
-
Downstream Fine-Tuning: Should I augment the training with open-ended QA examples (e.g. “What foods are good for a Mediterranean diet?”)?
I understand that numerical information may be difficult for the current compression method, as noted in the paper, but any insights or advice would be greatly appreciated.
Thank you again for your great work and for your time!