Skip to content

Commit 167022f

Browse files
authored
rebase dev with main (#280)
* 🔖 v1.4.6 * 🔖 Update metadata after release (#278) Co-authored-by: HamedBabaei <26560419+HamedBabaei@users.noreply.github.com> * ✨ add custom AutoLLM support for spceial cases * ✏️ improve dependencies * 🔖 v1.4.7 * 🐛 fix mistral common dependency * 🔖 Update metadata after release (#279) Co-authored-by: HamedBabaei <26560419+HamedBabaei@users.noreply.github.com> * 🔖 v1.4.7 * 📝 * 📝 * 📝 --------- Co-authored-by: HamedBabaei <26560419+HamedBabaei@users.noreply.github.com>
1 parent eb81b9d commit 167022f

10 files changed

Lines changed: 2033 additions & 1797 deletions

File tree

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,16 @@
11
## Changelog
22

3+
### v1.4.7 (October 1, 2025)
4+
- add custom LLM based learner
5+
- add Falcon-H and Mistral-Small custom AutoLLMs.
6+
- Add custom LLm documentations.
7+
- Minor bug fix and improvements in documentation and code.
8+
9+
### v1.4.6 (September 22, 2025)
10+
- add type annotation to metrics
11+
- add minor fix to retriever taxonomy discovery
12+
- add count metrics in evaluation.
13+
314
### v1.4.5 (September 16, 2025)
415
- add batch retriever feature to `AutoRetrieverLearner`
516

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,5 +31,5 @@ keywords:
3131
- Large Language Models
3232
- Text-to-ontology
3333
license: MIT
34-
version: 1.4.5
34+
version: 1.4.7
3535
date-released: '2025'

docs/source/learners/llm.rst

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,5 +135,155 @@ The OntoLearner package also offers a streamlined ``LearnerPipeline`` class that
135135
# Print all returned outputs (include predictions)
136136
print(outputs)
137137
138+
139+
Custom AutoLLM
140+
-----------------
141+
142+
OntoLearner provides a default ``AutoLLM`` wrapper for handling popular model families (Mistral, Llama, Qwen, etc.) through HuggingFace or external providers. However, in some cases you may want to integrate a model family that is not natively supported (e.g., Falcon, DeepSeek, or a proprietary LLM).
143+
144+
For this, you can extend the ``AutoLLM`` class and implement the required
145+
``load`` and ``generate`` methods. Basic requirements are:
146+
147+
1. Inherit from ``AutoLLM``
148+
2. Implement ``load(model_id)``, if your loging model is different (as an example `mistralai/Mistral-Small-3.2-24B-Instruct-2506 <https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506>`_ uses different type of loading)
149+
3. Implement ``generate(inputs, max_new_tokens)`` to encodes prompts, performs generation, decodes outputs, and maps them to labels.
150+
151+
152+
.. tab:: Falcon-H
153+
154+
The following example shows how to build a Falcon integration:
155+
156+
::
157+
158+
from ontolearner import AutoLLM
159+
from typing import List
160+
import torch
161+
162+
class FalconLLM(AutoLLM):
163+
164+
def generate(self, inputs: List[str], max_new_tokens: int = 50) -> List[str]:
165+
encoded_inputs = self.tokenizer(
166+
inputs,
167+
return_tensors="pt",
168+
padding=True,
169+
truncation=True
170+
).to(self.model.device)
171+
172+
input_ids = encoded_inputs["input_ids"]
173+
input_length = input_ids.shape[1]
174+
175+
outputs = self.model.generate(
176+
input_ids,
177+
max_new_tokens=max_new_tokens,
178+
pad_token_id=self.tokenizer.eos_token_id
179+
)
180+
181+
generated_tokens = outputs[:, input_length:]
182+
decoded_outputs = [
183+
self.tokenizer.decode(g, skip_special_tokens=True).strip()
184+
for g in generated_tokens
185+
]
186+
187+
return self.label_mapper.predict(decoded_outputs)
188+
189+
.. tab:: Mistral-Small
190+
191+
For Mistral, you can integrate the official ``mistral-common`` tokenizer and chat completion interface:
192+
193+
::
194+
195+
from ontolearner import AutoLLM
196+
from typing import List
197+
import torch
198+
199+
class MistralLLM(AutoLLM):
200+
201+
def load(self, model_id: str) -> None:
202+
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
203+
from mistral_common.models.modeling_mistral import Mistral3ForConditionalGeneration
204+
205+
self.tokenizer = MistralTokenizer.from_hf_hub(model_id)
206+
207+
device_map = "cpu" if self.device == "cpu" else "balanced"
208+
self.model = Mistral3ForConditionalGeneration.from_pretrained(
209+
model_id,
210+
device_map=device_map,
211+
torch_dtype=torch.bfloat16,
212+
token=self.token
213+
)
214+
215+
if not hasattr(self.tokenizer, "pad_token_id") or self.tokenizer.pad_token_id is None:
216+
self.tokenizer.pad_token_id = self.model.generation_config.eos_token_id
217+
218+
self.label_mapper.fit()
219+
220+
def generate(self, inputs: List[str], max_new_tokens: int = 50) -> List[str]:
221+
from mistral_common.protocol.instruct.messages import ChatCompletionRequest
222+
223+
tokenized_list = []
224+
for prompt in inputs:
225+
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
226+
tokenized = self.tokenizer.encode_chat_completion(ChatCompletionRequest(messages=messages))
227+
tokenized_list.append(tokenized.tokens)
228+
229+
# Pad inputs and create attention masks
230+
max_len = max(len(tokens) for tokens in tokenized_list)
231+
input_ids, attention_masks = [], []
232+
for tokens in tokenized_list:
233+
pad_length = max_len - len(tokens)
234+
input_ids.append(tokens + [self.tokenizer.pad_token_id] * pad_length)
235+
attention_masks.append([1] * len(tokens) + [0] * pad_length)
236+
237+
input_ids = torch.tensor(input_ids).to(self.model.device)
238+
attention_masks = torch.tensor(attention_masks).to(self.model.device)
239+
240+
outputs = self.model.generate(
241+
input_ids=input_ids,
242+
attention_mask=attention_masks,
243+
eos_token_id=self.model.generation_config.eos_token_id,
244+
pad_token_id=self.tokenizer.pad_token_id,
245+
max_new_tokens=max_new_tokens,
246+
)
247+
248+
decoded_outputs = []
249+
for i, tokens in enumerate(outputs):
250+
output_text = self.tokenizer.decode(tokens[len(tokenized_list[i]):])
251+
decoded_outputs.append(output_text)
252+
253+
return self.label_mapper.predict(decoded_outputs)
254+
255+
256+
Once your custom class is defined, you can pass it into ``AutoLLMLearner``:
257+
258+
.. code-block:: python
259+
260+
from ontolearner import AutoLLMLearner, LabelMapper, StandardizedPrompting
261+
262+
falcon_learner = AutoLLMLearner(
263+
prompting=StandardizedPrompting,
264+
label_mapper=LabelMapper(),
265+
llm=FalconLLM, # 👈 plug in custom Falcon
266+
token="...",
267+
device="cuda"
268+
)
269+
270+
falcon_learner.llm.load(model_id="tiiuae/Falcon-H1-1.5B-Deep-Instruct")
271+
272+
# Train and evaluate
273+
falcon_learner.fit(train_data, task="term-typing")
274+
predictions = falcon_learner.predict(test_data, task="term-typing")
275+
276+
print(predictions)
277+
278+
The following models are specialized within the OntoLearner:
279+
280+
- To use `mistralai/Mistral-Small-3.2-24B-Instruct-2506 <https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506>`_ you can use ``MistralLLM`` instead of ``AutoLLM``.
281+
- To use `Falcon-H` series of LLMs (e.g. `tiiuae/Falcon-H1-1.5B-Deep-Instruct <https://huggingface.co/tiiuae/Falcon-H1-1.5B-Deep-Instruct>`_ you can ``FalconLLM`` instead of ``AutoLLM``.
282+
283+
.. note::
284+
285+
You can implement as many custom AutoLLM classes as needed (e.g., for proprietary APIs, local models, or new HF releases). As long as they subclass ``AutoLLM`` and implement ``load`` + ``generate``, they will work seamlessly with ``AutoLLMLearner``.
286+
287+
138288
.. hint::
139289
See `Learning Tasks <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_ for possible tasks within Learners.

0 commit comments

Comments
 (0)