Skip to content

Commit 7a47a8b

Browse files
authored
30 more models added with existing providers (#3)
* 30 more models added with existing providers * Mypy pytest fixes * version updated
1 parent bdb7dbf commit 7a47a8b

File tree

5 files changed

+100
-17
lines changed

5 files changed

+100
-17
lines changed

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,19 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

88

9+
## [1.0.1] - 2025-01-07
10+
11+
### Added
12+
- **New Providers:**
13+
- **Databricks (5 models):** dbrx-instruct, dbrx-base, dolly-v2-12b, dolly-v2-7b, dolly-v2-3b
14+
- **Voyage AI (6 models):** voyage-2, voyage-large-2, voyage-code-2, voyage-finance-2, voyage-law-2, voyage-multilingual-2
15+
- **30+ new models added across existing providers**
16+
17+
### Enhanced
18+
- **Provider-Specific Approximations:** Added optimized tokenization approximations for Databricks and Voyage AI models.
19+
- **Model Detection:** Enhanced provider detection to support Databricks and Voyage AI models.
20+
- **Cost Estimation:** Added pricing information for all new models.
21+
922
## [1.0.0] - 2025-01-06
1023

1124
### Added

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# toksum
22

3-
A comprehensive Python library for counting tokens across 300+ Large Language Models (LLMs) from 32+ providers.
3+
A comprehensive Python library for counting tokens across 300+ Large Language Models (LLMs) from 34+ providers.
44

55
[![PyPI version](https://badge.fury.io/py/toksum.svg)](https://badge.fury.io/py/toksum)
66
[![Python Support](https://img.shields.io/pypi/pyversions/toksum.svg)](https://pypi.org/project/toksum/)
@@ -9,8 +9,8 @@ A comprehensive Python library for counting tokens across 300+ Large Language Mo
99
## Features
1010

1111

12-
- **🎯 Production Ready v1.0.0**: Comprehensive support for 300+ models across 32+ providers including OpenAI, Anthropic, Google, Meta, Mistral, Microsoft, Amazon, Nvidia, IBM, Salesforce, BigCode, and many more
13-
- **Comprehensive Multi-LLM Support**: Count tokens for 279 models across 32 providers including OpenAI, Anthropic, Google, Meta, Mistral, Microsoft, Amazon, Nvidia, IBM, Salesforce, BigCode, and many more
12+
- **🎯 Production Ready v1.0.1**: Comprehensive support for 300+ models across 34+ providers including OpenAI, Anthropic, Google, Meta, Mistral, Microsoft, Amazon, Nvidia, IBM, Salesforce, BigCode, Databricks, Voyage AI, and many more
13+
- **Comprehensive Multi-LLM Support**: Count tokens for 300+ models across 34 providers including OpenAI, Anthropic, Google, Meta, Mistral, Microsoft, Amazon, Nvidia, IBM, Salesforce, BigCode, Databricks, Voyage AI, and many more
1414
- **Accurate Tokenization**: Uses official tokenizers (tiktoken for OpenAI) and optimized approximations for all other providers
1515
- **Chat Message Support**: Count tokens in chat/conversation format with proper message overhead calculation
1616
- **Cost Estimation**: Estimate API costs based on token counts and current pricing
@@ -174,8 +174,16 @@ A comprehensive Python library for counting tokens across 300+ Large Language Mo
174174
- Multi-language code generation and understanding
175175
- Trained on diverse programming languages
176176

177+
### Databricks Models (5 models)
178+
- **NEW: Databricks Models** (dbrx-instruct, dbrx-base, dolly-v2-12b, dolly-v2-7b, dolly-v2-3b)
179+
- High-quality instruction-following and base models
177180

178-
**Total: 300+ models across 32+ providers**
181+
### Voyage AI Models (6 models)
182+
- **NEW: Voyage AI Models** (voyage-2, voyage-large-2, voyage-code-2, voyage-finance-2, voyage-law-2, voyage-multilingual-2)
183+
- State-of-the-art embedding models for various domains
184+
185+
186+
**Total: 300+ models across 34+ providers**
179187

180188
## Installation
181189

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "toksum"
7-
version = "1.0.0"
7+
version = "1.0.1"
88
description = "A comprehensive Python library for counting tokens across 300+ LLM models from 32+ providers including OpenAI, Anthropic, Google, Meta, Mistral, and more"
99
readme = "README.md"
1010
requires-python = ">=3.8"

tests/test_toksum.py

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -686,11 +686,10 @@ def test_eleutherai_models(self):
686686
assert tokens > 0
687687

688688
def test_mosaicml_models(self):
689-
"""Test MosaicML/Databricks models."""
689+
"""Test MosaicML models."""
690690
mosaicml_models = [
691691
"mpt-7b", "mpt-7b-chat", "mpt-7b-instruct",
692692
"mpt-30b", "mpt-30b-chat", "mpt-30b-instruct",
693-
"dbrx", "dbrx-instruct"
694693
]
695694

696695
for model in mosaicml_models:
@@ -702,6 +701,22 @@ def test_mosaicml_models(self):
702701
assert isinstance(tokens, int)
703702
assert tokens > 0
704703

704+
def test_databricks_models(self):
705+
"""Test Databricks models."""
706+
databricks_models = [
707+
"dbrx", "dbrx-instruct", "dbrx-base",
708+
"dolly-v2-12b", "dolly-v2-7b", "dolly-v2-3b",
709+
]
710+
711+
for model in databricks_models:
712+
counter = TokenCounter(model)
713+
assert counter.provider == "databricks"
714+
715+
# Test basic token counting
716+
tokens = counter.count("Hello, world!")
717+
assert isinstance(tokens, int)
718+
assert tokens > 0
719+
705720
def test_replit_models(self):
706721
"""Test Replit code models."""
707722
replit_models = ["replit-code-v1-3b", "replit-code-v1.5-3b", "replit-code-v2-3b"]
@@ -935,7 +950,8 @@ def test_provider_counts(self):
935950
"stability": 7,
936951
"tii": 6,
937952
"eleutherai": 12,
938-
"mosaicml": 8,
953+
"mosaicml": 6, # Updated: Removed dbrx and dbrx-instruct
954+
"databricks": 6, # Updated: Added dbrx
939955
"replit": 3,
940956
"minimax": 5,
941957
"aleph_alpha": 4,
@@ -962,9 +978,9 @@ def test_provider_list(self):
962978
"openai", "anthropic", "google", "meta", "mistral",
963979
"cohere", "perplexity", "huggingface", "ai21", "together",
964980
"xai", "alibaba", "baidu", "huawei", "yandex", "stability",
965-
"tii", "eleutherai", "mosaicml", "replit", "minimax",
981+
"tii", "eleutherai", "mosaicml", "databricks", "replit", "minimax",
966982
"aleph_alpha", "deepseek", "tsinghua", "rwkv", "community",
967-
"microsoft", "amazon", "nvidia", "ibm", "salesforce", "bigcode"
983+
"microsoft", "amazon", "nvidia", "ibm", "salesforce", "bigcode", "voyage" # Added voyage
968984
}
969985
actual_providers = set(models.keys())
970986
assert actual_providers == expected_providers
@@ -2896,4 +2912,4 @@ def test_consistency_across_model_variants(self):
28962912

28972913

28982914
if __name__ == "__main__":
2899-
pytest.main([__file__])
2915+
pytest.main([__file__])

toksum/core.py

Lines changed: 52 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -276,16 +276,14 @@
276276
"pythia-12b": "pythia", # NEW
277277
}
278278

279-
# MosaicML/Databricks Models (using approximation)
279+
# MosaicML Models (using approximation)
280280
MOSAICML_MODELS = {
281281
"mpt-7b": "mpt", # NEW
282282
"mpt-7b-chat": "mpt", # NEW
283283
"mpt-7b-instruct": "mpt", # NEW
284284
"mpt-30b": "mpt", # NEW
285285
"mpt-30b-chat": "mpt", # NEW
286286
"mpt-30b-instruct": "mpt", # NEW
287-
"dbrx": "dbrx", # NEW
288-
"dbrx-instruct": "dbrx", # NEW
289287
}
290288

291289
# Replit Models (using approximation)
@@ -569,6 +567,26 @@
569567
"text-similarity-davinci-001": "r50k_base", # ADDED
570568
}
571569

570+
# Databricks Models
571+
DATABRICKS_MODELS = {
572+
"dbrx": "databricks", # ADDED
573+
"dbrx-instruct": "databricks",
574+
"dbrx-base": "databricks",
575+
"dolly-v2-12b": "databricks",
576+
"dolly-v2-7b": "databricks",
577+
"dolly-v2-3b": "databricks",
578+
}
579+
580+
# Voyage AI Models
581+
VOYAGE_MODELS = {
582+
"voyage-2": "voyage",
583+
"voyage-large-2": "voyage",
584+
"voyage-code-2": "voyage",
585+
"voyage-finance-2": "voyage",
586+
"voyage-law-2": "voyage",
587+
"voyage-multilingual-2": "voyage",
588+
}
589+
572590

573591
class TokenCounter:
574592
"""
@@ -600,6 +618,8 @@ def _detect_provider(self) -> str:
600618
openai_legacy_models_lower = {k.lower(): v for k, v in OPENAI_LEGACY_MODELS.items()}
601619
openai_o1_models_lower = {k.lower(): v for k, v in OPENAI_O1_MODELS.items()}
602620
openai_vision_models_lower = {k.lower(): v for k, v in OPENAI_VISION_MODELS.items()}
621+
databricks_models_lower = {k.lower(): v for k, v in DATABRICKS_MODELS.items()}
622+
voyage_models_lower = {k.lower(): v for k, v in VOYAGE_MODELS.items()}
603623
anthropic_models_lower = {k.lower(): v for k, v in ANTHROPIC_MODELS.items()}
604624
anthropic_legacy_models_lower = {k.lower(): v for k, v in ANTHROPIC_LEGACY_MODELS.items()}
605625
anthropic_haiku_models_lower = {k.lower(): v for k, v in ANTHROPIC_HAIKU_MODELS.items()}
@@ -652,7 +672,12 @@ def _detect_provider(self) -> str:
652672
mistral_instruct_models_lower = {k.lower(): v for k, v in MISTRAL_INSTRUCT_MODELS.items()}
653673
openai_embedding_models_lower = {k.lower(): v for k, v in OPENAI_EMBEDDING_MODELS.items()}
654674

655-
if (self.model in openai_models_lower or self.model in openai_legacy_models_lower or
675+
# Prioritize Databricks models as they are more specific
676+
if self.model in databricks_models_lower:
677+
return "databricks"
678+
elif self.model in voyage_models_lower:
679+
return "voyage"
680+
elif (self.model in openai_models_lower or self.model in openai_legacy_models_lower or
656681
self.model in openai_o1_models_lower or self.model in openai_vision_models_lower or
657682
self.model in openai_gpt4_turbo_models_lower or self.model in openai_embedding_models_lower):
658683
return "openai"
@@ -725,7 +750,7 @@ def _detect_provider(self) -> str:
725750
elif self.model in bigcode_models_lower:
726751
return "bigcode"
727752
else:
728-
supported = (list(OPENAI_MODELS.keys()) + list(OPENAI_LEGACY_MODELS.keys()) + list(OPENAI_O1_MODELS.keys()) +
753+
supported = (list(DATABRICKS_MODELS.keys()) + list(VOYAGE_MODELS.keys()) + list(OPENAI_MODELS.keys()) + list(OPENAI_LEGACY_MODELS.keys()) + list(OPENAI_O1_MODELS.keys()) +
729754
list(OPENAI_VISION_MODELS.keys()) + list(ANTHROPIC_MODELS.keys()) + list(ANTHROPIC_LEGACY_MODELS.keys()) +
730755
list(ANTHROPIC_HAIKU_MODELS.keys()) + list(ANTHROPIC_COMPUTER_USE_MODELS.keys()) +
731756
list(ANTHROPIC_CLAUDE_21_MODELS.keys()) + list(ANTHROPIC_INSTANT_2_MODELS.keys()) +
@@ -975,6 +1000,14 @@ def _approximate_tokens(self, text: str) -> int:
9751000
# BigCode StarCoder models
9761001
base_tokens = char_count / 3.4
9771002
adjustment = (whitespace_count + punctuation_count) * 0.2
1003+
elif self.provider == "databricks":
1004+
# Databricks models
1005+
base_tokens = char_count / 4.0
1006+
adjustment = (whitespace_count + punctuation_count) * 0.25
1007+
elif self.provider == "voyage":
1008+
# Voyage AI models
1009+
base_tokens = char_count / 3.8
1010+
adjustment = (whitespace_count + punctuation_count) * 0.25
9781011
else:
9791012
# Default approximation
9801013
base_tokens = char_count / 4
@@ -1086,6 +1119,8 @@ def get_supported_models() -> Dict[str, List[str]]:
10861119
"openai": (list(OPENAI_MODELS.keys()) + list(OPENAI_LEGACY_MODELS.keys()) +
10871120
list(OPENAI_O1_MODELS.keys()) + list(OPENAI_VISION_MODELS.keys()) +
10881121
list(OPENAI_GPT4_TURBO_MODELS.keys()) + list(OPENAI_EMBEDDING_MODELS.keys())),
1122+
"databricks": list(DATABRICKS_MODELS.keys()),
1123+
"voyage": list(VOYAGE_MODELS.keys()),
10891124
"anthropic": (list(ANTHROPIC_MODELS.keys()) + list(ANTHROPIC_LEGACY_MODELS.keys()) +
10901125
list(ANTHROPIC_HAIKU_MODELS.keys()) + list(ANTHROPIC_COMPUTER_USE_MODELS.keys()) +
10911126
list(ANTHROPIC_CLAUDE_21_MODELS.keys()) + list(ANTHROPIC_INSTANT_2_MODELS.keys()) +
@@ -1109,7 +1144,7 @@ def get_supported_models() -> Dict[str, List[str]]:
11091144
"stability": list(STABILITY_MODELS.keys()),
11101145
"tii": list(TII_MODELS.keys()),
11111146
"eleutherai": list(ELEUTHERAI_MODELS.keys()),
1112-
"mosaicml": list(MOSAICML_MODELS.keys()),
1147+
"mosaicml": list(MOSAICML_MODELS.keys()), # Only MPT models remain here
11131148
"replit": list(REPLIT_MODELS.keys()),
11141149
"minimax": list(MINIMAX_MODELS.keys()),
11151150
"aleph_alpha": list(ALEPH_ALPHA_MODELS.keys()),
@@ -1145,6 +1180,17 @@ def estimate_cost(token_count: int, model: str, input_tokens: bool = True) -> fl
11451180
pricing = {
11461181
"gpt-4": {"input": 0.03, "output": 0.06},
11471182
"gpt-4-32k": {"input": 0.06, "output": 0.12},
1183+
"dbrx-instruct": {"input": 0.001, "output": 0.002},
1184+
"dbrx-base": {"input": 0.001, "output": 0.002},
1185+
"dolly-v2-12b": {"input": 0.001, "output": 0.002},
1186+
"dolly-v2-7b": {"input": 0.001, "output": 0.002},
1187+
"dolly-v2-3b": {"input": 0.001, "output": 0.002},
1188+
"voyage-2": {"input": 0.0001, "output": 0.0001},
1189+
"voyage-large-2": {"input": 0.0001, "output": 0.0001},
1190+
"voyage-code-2": {"input": 0.0001, "output": 0.0001},
1191+
"voyage-finance-2": {"input": 0.0001, "output": 0.0001},
1192+
"voyage-law-2": {"input": 0.0001, "output": 0.0001},
1193+
"voyage-multilingual-2": {"input": 0.0001, "output": 0.0001},
11481194
"gpt-4-turbo": {"input": 0.01, "output": 0.03},
11491195
"gpt-4-turbo-2024-04-09": {"input": 0.01, "output": 0.03},
11501196
"gpt-4o": {"input": 0.005, "output": 0.015},

0 commit comments

Comments
 (0)