- Installation
- Overview
- Tensor & Automatic Differentiation
- Layers & Modules
- Activation Functions
- Sequential Model & Building Models
- Custom Model Building
- Loss Functions & Metrics
- Optimizers
- Device & DataType Management
- Training Examples
- Transformer Architecture Implementation
- Saving and Loading Models
- Conclusion & Future Extensions
Before installing the required Python packages, please ensure your system meets the following prerequisites:
-
Python 3.12.x Installation:
- Reason: The project environment and its dependencies (including
numpy==1.26.4, which is compatible with Python 3.12) are specified to work and for cuda support with Python version 3.12.x (e.g., 3.12.0, 3.12.1, 3.12.2, etc.). Using other Python versions (like 3.11 or 3.13) may lead to compatibility issues. - Action: Ensure you have a Python 3.12.x interpreter installed and are using it for this project (e.g., in your virtual environment).
- Verification: You can check your active Python version by running
python --versionorpython3 --versionin your terminal.
- Reason: The project environment and its dependencies (including
-
**NVIDIA CUDA Toolkit 12.x Installation: optional but recommended for gpu support **
- Reason: The project depends on
cupy-cuda12x==13.3.0. Thecuda12xsuffix indicates that this version of CuPy is specifically built for and requires CUDA Toolkit version 12.x (e.g., 12.0, 12.1, 12.2, 12.3, 12.4). - Action: You must have a compatible NVIDIA driver and the CUDA Toolkit 12.x installed on your system.
- Verification: You can typically check your installed CUDA version by running
nvcc --versionin your terminal (ifnvccis in your PATH).
- Reason: The project depends on
Summary of Dependencies Driving Requirements:
- Project Configuration (implicit) &
numpy==1.26.4(compatible) -> Requires Python 3.12.x cupy-cuda12x==13.3.0-> Requires CUDA 12.x
Please install or configure CUDA and Python according to these requirements before proceeding with the package installation steps (e.g., pip install -r requirements.txt).
the make a project file and download this cobra to it, then import from it to another files .. may soon we make it pip installable package.
This framework is a minimal yet powerful neural network library implemented using NumPy with optional support for CuPy (for GPU acceleration). It features:
- Automatic differentiation: Tensors track operations to enable gradient backpropagation with proper dtype handling
- Layer modularity: Layers such as Dense, Conv2D, and pooling layers can be composed easily
- Activation functions: A variety of activations (ReLU, Sigmoid, Tanh, etc.) are provided
- Loss functions and metrics: Compute losses like MSE, cross-entropy, and accuracy
- Optimizers: Standard algorithms like SGD (with momentum) and Adam are available
- Device and dtype management: Seamlessly move data between CPU and GPU and control numerical precision
- Model serialization: Save and load state via
state_dict()andload_state_dict() - Transformer architecture: Built-in components for creating transformer models
- Custom model building: Flexible API for creating complex model architectures
-
Purpose:
Acts as the core data structure for the framework. It wraps a NumPy (or CuPy) array, stores metadata (e.g., device, dtype), and tracks operations for automatic differentiation. -
Key Properties:
data: The underlying array.grad: Gradient of the tensor (initialized to zeros whenrequires_grad=True).requires_grad: Flag indicating if the tensor requires gradient computation.device: Either'cpu'or'cuda'. When set to'cuda', operations use CuPy.dtype: Data type of the tensor (e.g.,np.float32).xp: Reference to the appropriate numerical library (NumPy or CuPy).
-
Core Methods:
- Autograd:
backward(): Traverses the computation graph in reverse to propagate gradients.zero_grad(): Resets gradients to zero.no_grad(): Context manager to disable gradient tracking.
- Operations:
- Arithmetic (
+,-,*,/), matrix multiplication (@), element-wise operations, reshaping, slicing, and more.
- Arithmetic (
- Device Management:
to(device): Moves the tensor between CPU and GPU.
- Type Casting:
astype(dtype): Returns a new tensor with the specified data type.
- Special Operations:
gather(): Select values along specified dimensions using indices.where(): Element-wise conditional selection.pad2d(): Add padding to spatial dimensions.one_hot(): Convert indices to one-hot encoding.
- Autograd:
# Create a tensor with gradient tracking
a = Tensor([1, 2, 3], device='cpu', dtype=np.float32, requires_grad=True)
b = Tensor([4, 5, 6], requires_grad=True)
# Perform operations
c = a + b # Addition
d = a * b # Element-wise multiplication
e = a @ b.reshape(3, 1) # Matrix multiplication
f = c.mean() # Reduction operation
# Compute gradients
f.backward()
print("Gradient of a:", a.grad.data)
print("Gradient of b:", b.grad.data)
# Device management
if has_cupy: # Check if CuPy is available
a_gpu = a.to('cuda')
print("Device:", a_gpu.device)- Broadcasting: Tensor operations support NumPy-style broadcasting with proper gradient handling.
- Unbroadcasting: During backpropagation, gradients are properly unbroadcast using the
unbroadcast_grad()utility. - Type Consistency: Operations maintain and enforce dtype consistency, especially during gradient calculations.
- Numerically Stable Operations: Implementation of stable operations like
log(),exp(), andsoftmax().
Layers inherit from Base_Layer and provide a standardized interface with forward(), state_dict(), and load_state_dict() methods.
- Description: Implements a fully connected (linear) layer.
- Parameters:
input_size: Number of input features.output_size: Number of output features.name: Optional layer name.initialization: Weight initialization method (default: 'xavier').device: Computing device ('cpu' or 'cuda').dtype: Data type for parameters (default: np.float32).
- Attributes:
weights: Weight matrix as a Tensor.bias: Bias vector as a Tensor.
- Methods:
set_device(device): Moves layer parameters to the specified device.forward(x): Computes layer output (x @ weights + bias).state_dict(): Returns a dictionary with layer parameters.load_state_dict(state_dict): Loads parameters from a dictionary.
- Example:
dense = Dense(128, 10, initialization='xavier', device='cpu', dtype=np.float32) output = dense(input_tensor) # Equivalent to dense.forward(input_tensor)
- Description: 2D convolution layer using stride-tricks for efficient window extraction.
- Parameters:
in_channels: Number of input channels.out_channels: Number of output channels.kernel_size: Size of the convolutional kernel.stride: Convolution stride (default: 1).padding: Zero-padding size (default: 0).device: Computing device (default: 'cpu').dtype: Data type for parameters (default: np.float32).
- Implementation Details:
- Uses Xavier initialization scaled by kernel dimensions.
- Efficient implementation with
as_stridedfor window extraction. - Reshapes data for batch matrix multiplication between windows and kernels.
- Example:
conv = Conv2D(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1) conv_out = conv(image_tensor) # Input shape: [batch_size, in_channels, height, width]
- Description: Performs max pooling over 2D spatial dimensions.
- Parameters:
kernel_size: Size of the pooling window (default: 2).stride: Pooling stride (default: 2).
- Implementation Details:
- Uses stride tricks to extract windows without data duplication.
- Non-differentiable operation; gradients flow through the maximum values.
- Example:
pool = MaxPool2D(kernel_size=2, stride=2) pooled = pool(conv_out) # Reduces spatial dimensions by factor of stride
- Description: Flattens multi-dimensional input to two dimensions (batch × features).
- Implementation Details:
- Preserves the batch dimension.
- Stores original shape for potential backward operations.
- Example:
flatten = Flatten() flat = flatten(pooled) # Output shape: [batch_size, flattened_features]
All activation functions inherit from the abstract Activation class and ensure device/dtype consistency.
The abstract Activation class provides a common interface:
class Activation(Base_Layer):
def __init__(self, device='cpu'):
super().__init__()
self.device = device
def set_gpu(self):
self.device = 'cuda'
def set_cpu(self):
self.device = 'cpu'
@abstractmethod
def forward(self, inputs):
# Implemented by subclasses
pass-
Tanh:
Applies the hyperbolic tangent function with range [-1, 1].tanh = Tanh(device='cpu') activated = tanh(linear_output)
-
ReLU:
Implements rectified linear unit: f(x) = max(0, x).relu = ReLU(device='cpu') activated = relu(linear_output)
-
Sigmoid:
Applies the logistic function with range (0, 1).sigmoid = Sigmoid(device='cpu') activated = sigmoid(linear_output)
-
Softmax:
Normalizes outputs to a probability distribution along a specified axis.softmax = Softmax(axis=-1, device='cpu') # Typically applied to last dimension probabilities = softmax(logits)
-
LeakyReLU:
Allows small negative values instead of zeroing them completely.leaky_relu = LeakyReLU(alpha=0.01, device='cpu') # alpha controls leak slope activated = leaky_relu(linear_output)
-
ELU (Exponential Linear Unit):
Provides smoother activation with negative values.elu = ELU(alpha=1.0, device='cpu') activated = elu(linear_output)
Each activation class implements the standard forward(), state_dict(), and load_state_dict() methods.
The Sequential class enables stacking layers in a linear chain, automatically managing data flow, parameter collection, and device settings.
- Simple Layer Stacking: Layers are executed in the order they appear in the list.
- Parameter Management: Collects trainable parameters from all contained layers.
- Device Handling: Moves all layer parameters to specified device.
- Gradient Control: Provides methods to control gradient tracking and reset gradients.
- Serialization: Supports state saving and loading.
# Create a simple feedforward neural network
model = Sequential([
Dense(784, 256, initialization='xavier'),
ReLU(),
Dense(256, 128),
ReLU(),
Dense(128, 10)
], device='cpu')
# Forward pass
predictions = model(input_tensor)
# Get trainable parameters for optimizer
optimizer = Adam(model.parameters, lr=0.001)
# Reset gradients
model.zero_grad()
# Move model to GPU if available
if has_cupy:
model.set_device('cuda')
# Disable gradient tracking
with model.no_grad():
validation_predictions = model(validation_tensor)# Save model state
state = model.state_dict()
with open('model_state.pkl', 'wb') as f:
pickle.dump(state, f)
# Load model state
with open('model_state.pkl', 'rb') as f:
state = pickle.load(f)
model.load_state_dict(state)For more complex architectures, the framework provides a BaseModel class for creating custom models with advanced features like skip connections, shared weights, and custom forward passes.
The BaseModel class serves as the foundation for custom models:
- Core Features:
- Automatic Module Registration: When you assign layers, tensors, or other models as attributes, they're automatically registered in the internal
_modulesdictionary. - Recursive Parameter Collection: The
parametersproperty traverses all submodules to collect trainable parameters. - Device & Dtype Management: Methods to move parameters between devices and change data types.
- Serialization: Support for saving and loading model state.
- Automatic Module Registration: When you assign layers, tensors, or other models as attributes, they're automatically registered in the internal
To create a custom model:
- Subclass
BaseModel - Define your layers in
__init__ - Implement the
forwardmethod
class SimpleCNN(BaseModel):
def __init__(self, in_channels=1, num_classes=10):
super().__init__()
# Define layers (automatically registered)
self.conv1 = Conv2D(in_channels, 16, kernel_size=3, padding=1)
self.relu = ReLU()
self.pool = MaxPool2D(kernel_size=2, stride=2)
self.conv2 = Conv2D(16, 32, kernel_size=3, padding=1)
self.flatten = Flatten()
self.fc = Dense(32 * 7 * 7, num_classes) # Adjust size based on input
def forward(self, x):
# Define computation flow
x = self.conv1(x)
x = self.relu(x)
x = self.pool(x)
x = self.conv2(x)
x = self.relu(x)
x = self.pool(x)
x = self.flatten(x)
x = self.fc(x)
return xclass ResidualBlock(BaseModel):
def __init__(self, channels):
super().__init__()
self.conv1 = Conv2D(channels, channels, kernel_size=3, padding=1)
self.relu = ReLU()
self.conv2 = Conv2D(channels, channels, kernel_size=3, padding=1)
def forward(self, x):
# Store input for skip connection
residual = x
# Regular convolution path
out = self.conv1(x)
out = self.relu(out)
out = self.conv2(out)
# Add skip connection
out = out + residual
out = self.relu(out)
return out
class ResNet(BaseModel):
def __init__(self, in_channels=3, num_blocks=3, num_classes=10):
super().__init__()
self.conv1 = Conv2D(in_channels, 64, kernel_size=7, stride=2, padding=3)
self.relu = ReLU()
self.pool = MaxPool2D(kernel_size=3, stride=2)
# Create residual blocks
self.res_blocks = []
for i in range(num_blocks):
self.res_blocks.append(ResidualBlock(64))
self.flatten = Flatten()
self.fc = Dense(64 * 7 * 7, num_classes) # Adjust size based on input
def forward(self, x):
x = self.conv1(x)
x = self.relu(x)
x = self.pool(x)
# Pass through residual blocks
for block in self.res_blocks:
x = block(x)
x = self.flatten(x)
x = self.fc(x)
return xThe BaseModel class automatically keeps track of parameters and submodules:
-
Parameter Collection:
model = SimpleCNN() all_trainable_params = model.parameters # List of all trainable Tensors # Use with optimizer optimizer = Adam(model.parameters, lr=0.001)
-
Gradient Zeroing:
model.zero_grad() # Clears gradients from all parameters
-
No Gradient Context:
with model.no_grad(): # Operations within this block don't track gradients predictions = model(test_data)
BaseModel provides comprehensive device and data type management:
-
Moving to Different Device:
model = SimpleCNN() # Run on GPU if available if has_cupy: model.set_device('cuda') # Move back to CPU model.set_device('cpu')
-
Changing Data Type:
# Default is usually np.float32 model.set_dtype(np.float32) # Switch to double precision model.set_dtype(np.float64)
The BaseModel class supports saving and loading model state:
-
Saving Model State:
model = SimpleCNN() # ... Train the model ... # Get state dictionary state_dict = model.state_dict() # Save using the utility function save_model_parameters(model, 'cnn_model.pkl') # Or manually with pickle with open('model.pkl', 'wb') as f: pickle.dump(state_dict, f)
-
Loading Model State:
model = SimpleCNN() # Create model with same architecture # Load using the utility function state_dict = load_state_dict_from_file('cnn_model.pkl') model.load_state_dict(state_dict) # Or manually with pickle with open('model.pkl', 'rb') as f: state_dict = pickle.load(f) model.load_state_dict(state_dict)
class SiameseNetwork(BaseModel):
def __init__(self, in_channels=1):
super().__init__()
# Shared encoder - same weights for both inputs
self.encoder = Sequential([
Conv2D(in_channels, 64, kernel_size=3, padding=1),
ReLU(),
MaxPool2D(kernel_size=2, stride=2),
Conv2D(64, 128, kernel_size=3, padding=1),
ReLU(),
MaxPool2D(kernel_size=2, stride=2),
Flatten(),
Dense(128 * 7 * 7, 128) # Adjust size based on input
])
# Final classification
self.fc = Dense(128, 1)
self.sigmoid = Sigmoid()
def forward(self, x1, x2):
# Pass both inputs through same encoder (weight sharing)
feat1 = self.encoder(x1)
feat2 = self.encoder(x2)
# Compute absolute difference
distance = (feat1 - feat2).abs()
# Final classification
out = self.fc(distance)
out = self.sigmoid(out)
return out
def __call__(self, x1, x2):
return self.forward(x1, x2)All loss functions inherit from the base Loss class, which ensures consistent interfaces and numerical stability.
Mean Squared Error loss for regression tasks:
mse_loss = MSELoss()
loss = mse_loss(prediction, target) # Compute mean squared differenceCross-entropy loss for multi-class classification:
ce_loss = CrossEntropy()
# target should be one-hot encoded
loss = ce_loss(logits, one_hot_target)Combined softmax and cross-entropy for classification:
sce_loss = SoftmaxCrossEntropyLoss()
# Works with class indices or one-hot vectors
loss = sce_loss(logits, class_indices)Binary Cross Entropy for binary classification:
bce_loss = BCELoss()
# Target should be in range [0, 1]
loss = bce_loss(predictions, binary_targets)Computes classification accuracy:
accuracy_metric = Accuracy()
# Works with one-hot encoded targets or class indices
acc = accuracy_metric(predictions, targets) # Returns value between 0 and 1The Accuracy metric intelligently handles different target formats:
- Class indices (1D array of integer class labels)
- One-hot encoding (2D array where each row has a single 1)
The framework includes common optimization algorithms with learning rate scheduling.
Abstract base class with common features:
- Learning rate management
- Learning rate decay
- Device handling
Stochastic Gradient Descent with momentum:
optimizer = SGD(model.parameters,
lr=0.01, # Learning rate
momentum=0.9, # Momentum factor
decay=0.0001) # Learning rate decay
# In training loop
optimizer.step() # Update parameters based on gradients
optimizer.decay_lr() # Optionally apply learning rate decayAdaptive Moment Estimation optimizer:
optimizer = Adam(model.parameters,
lr=0.001, # Learning rate
beta1=0.9, # First moment decay rate
beta2=0.999, # Second moment decay rate
epsilon=1e-8, # Small constant for numerical stability
decay=0.0) # Learning rate decay
# In training loop
optimizer.step() # Update parameters based on adaptive momentsThe framework provides comprehensive device and data type management across all components.
-
Checking CuPy Availability:
if has_cupy: # GPU operations available device = 'cuda' else: device = 'cpu'
-
Tensor Device Movement:
cpu_tensor = Tensor([1, 2, 3], device='cpu') # Move to GPU if available if has_cupy: gpu_tensor = cpu_tensor.to('cuda') print(gpu_tensor.device) # 'cuda' # Move back to CPU cpu_tensor_again = gpu_tensor.to('cpu')
-
Layer Device Settings:
dense = Dense(10, 5, device='cpu') # Move to GPU if has_cupy: dense.set_device('cuda')
-
Model Device Management:
model = Sequential([...]) # or BaseModel subclass # Move entire model if has_cupy: model.set_device('cuda')
-
Specifying dtypes:
# Create tensor with specific dtype tensor = Tensor([1.0, 2.0], dtype=np.float32) # Convert dtype double_tensor = tensor.astype(np.float64)
-
Layer dtype Settings:
# Create layer with specific dtype dense = Dense(10, 5, dtype=np.float32)
-
Model dtype Management:
model = SimpleCNN() # BaseModel subclass # Change dtype for all parameters model.set_dtype(np.float64)
-
Parsing dtype Strings: The framework includes a
parse_dtypeutility for converting string representations:from dense import parse_dtype # Convert string to NumPy dtype dtype = parse_dtype("float32") # Returns np.float32 dtype = parse_dtype("<class 'numpy.float64'>") # Returns np.float64
Here's a complete example training a simple network on a classification task:
import numpy as np
from tensor import Tensor, has_cupy
from dense import Dense
from activations import ReLU, Softmax
from sequential import Sequential
from loss import SoftmaxCrossEntropyLoss, Accuracy
from optimizer import Adam
# Set device
device = 'cuda' if has_cupy else 'cpu'
# Generate dummy data
X = np.random.randn(100, 784).astype(np.float32) # 100 samples of 784 features (e.g., MNIST flattened)
y = np.random.randint(0, 10, size=(100,)) # 10 classes
y_one_hot = np.eye(10)[y] # Convert to one-hot encoding
# Convert to Tensors
X_tensor = Tensor(X, device=device)
y_tensor = Tensor(y_one_hot, device=device)
# Create model
model = Sequential([
Dense(784, 128, device=device),
ReLU(),
Dense(128, 64, device=device),
ReLU(),
Dense(64, 10, device=device)
], device=device)
# Setup loss, optimizer, and metrics
loss_fn = SoftmaxCrossEntropyLoss()
optimizer = Adam(model.parameters, lr=0.001)
accuracy = Accuracy()
# Training loop
num_epochs = 50
batch_size = 32
for epoch in range(num_epochs):
total_loss = 0
total_acc = 0
# Mini-batch training
num_batches = len(X) // batch_size
for i in range(num_batches):
start_idx = i * batch_size
end_idx = start_idx + batch_size
# Get batch
X_batch = X_tensor[start_idx:end_idx]
y_batch = y_tensor[start_idx:end_idx]
# Forward pass
predictions = model(X_batch)
loss = loss_fn(predictions, y_batch)
acc = accuracy(predictions, y_batch)
# Backward pass and optimization
model.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.data
total_acc += acc
# Print epoch statistics
avg_loss = total_loss / num_batches
avg_acc = total_acc / num_batches
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}, Accuracy: {avg_acc:.4f}")
# Save the trained model
from manage import save_model_parameters
save_model_parameters(model, 'trained_mlp.pkl')
print("Training complete!")import numpy as np
from tensor import Tensor, has_cupy
from custom import BaseModel
from dense import Dense
from conv import Conv2D, MaxPool2D, Flatten
from activations import ReLU
from loss import SoftmaxCrossEntropyLoss, Accuracy
from optimizer import Adam
from manage import save_model_parameters
# Set device
device = 'cuda' if has_cupy else 'cpu'
# Create a custom CNN model
class ConvNet(BaseModel):
def __init__(self):
super().__init__()
self.conv1 = Conv2D(1, 32, kernel_size=3, padding=1, device=device)
self.relu = ReLU(device=device)
self.pool = MaxPool2D(kernel_size=2, stride=2)
self.conv2 = Conv2D(32, 64, kernel_size=3, padding=1, device=device)
self.flatten = Flatten()
self.fc1 = Dense(64 * 7 * 7, 128, device=device) # For 28x28 input images
self.fc2 = Dense(128, 10, device=device)
def forward(self, x):
x = self.conv1(x)
x = self.relu(x)
x = self.pool(x)
x = self.conv2(x)
x = self.relu(x)
x = self.pool(x)
x = self.flatten(x)
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
# Generate dummy MNIST-like data
X = np.random.randn(200, 1, 28, 28).astype(np.float32)
y = np.random.randint(0, 10, size=(200,))
y_one_hot = np.eye(10)[y]
# Convert to tensors
X_tensor = Tensor(X, device=device)
y_tensor = Tensor(y_one_hot, device=device)
# Initialize model, loss, and optimizer
model = ConvNet()
loss_fn = SoftmaxCrossEntropyLoss()
optimizer = Adam(model.parameters, lr=0.001)
accuracy = Accuracy()
# Training loop
num_epochs = 10
batch_size = 32
for epoch in range(num_epochs):
total_loss = 0
total_acc = 0
# Mini-batch training
indices = np.random.permutation(len(X))
num_batches = len(X) // batch_size
for i in range(num_batches):
batch_indices = indices[i * batch_size:(i + 1) * batch_size]
# Get batch
X_batch = X_tensor[batch_indices]
y_batch = y_tensor[batch_indices]
# Forward pass
predictions = model(X_batch)
loss = loss_fn(predictions, y_batch)
acc = accuracy(predictions, y_batch)
# Backward pass and optimization
model.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.data
total_acc += acc
# Print epoch statistics
avg_loss = total_loss / num_batches
avg_acc = total_acc / num_batches
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}, Accuracy: {avg_acc:.4f}")
# Save the trained model
save_model_parameters(model, 'trained_cnn.pkl')The framework includes components specifically designed for transformer models, including embedding layers, positional encoding, multi-head attention, and encoder-decoder architecture.
-
Embedding Layer: Maps token indices to continuous vector representations.
embedding = Embedding(vocab_size=10000, d_model=512) embedded = embedding(token_indices) # token_indices shape: [batch_size, seq_len]
-
Positional Encoding: Adds positional information to embeddings using sinusoidal encoding.
positional_encoding = PositionalEncoding(d_model=512, max_seq_len=1000) embedded_with_positions = positional_encoding(embedded)
The MultiHeadAttention class implements the core attention mechanism:
attention = MultiHeadAttention(d_model=512, num_heads=8)
output = attention(query, key, value, mask=None)Features:
- Separate projections for queries, keys, and values
- Scaled dot-product attention
- Support for attention masking
- Multi-head parallel attention
-
Encoder Block: Processes input sequences through self-attention and feed-forward layers.
encoder = Encoder(vocab_size=10000, d_model=512, num_heads=8) encoder_output = encoder(src_tokens) # src_tokens: [batch_size, src_len]
-
Decoder Block: Generates output sequences using both self-attention and cross-attention to encoder output.
decoder = Decoder(vocab_size=10000, d_model=512, num_heads=8) decoder_output = decoder(tgt_tokens, encoder_output) # tgt_tokens: [batch_size, tgt_len]
The full Transformer class combines the encoder and decoder:
transformer = Transformer(
src_vocab_size=10000, # Source vocabulary size
tgt_vocab_size=10000, # Target vocabulary size
d_model=512, # Model dimension
num_heads=8 # Number of attention heads
)
# For sequence-to-sequence tasks
outputs = transformer(src_tokens, tgt_tokens) # Shape: [batch_size, tgt_len, tgt_vocab_size]from tensor import Tensor
from custom import BaseModel
from loss import SoftmaxCrossEntropyLoss
from optimizer import Adam
# Create dummy token data
src_tokens = Tensor(np.random.randint(0, 1000, size=(32, 20))) # [batch_size, src_seq_len]
tgt_tokens = Tensor(np.random.randint(0, 1000, size=(32, 22))) # [batch_size, tgt_seq_len]
tgt_labels = Tensor(np.random.randint(0, 1000, size=(32, 22))) # [batch_size, tgt_seq_len]
# Create transformer model
transformer = Transformer(
src_vocab_size=1000,
tgt_vocab_size=1000,
d_model=256,
num_heads=4
)
# Training setup
loss_fn = SoftmaxCrossEntropyLoss()
optimizer = Adam(transformer.parameters, lr=0.0001)
# Single training step
logits = transformer(src_tokens, tgt_tokens)
loss = loss_fn(logits, tgt_labels)
transformer.zero_grad()
loss.backward()
optimizer.step()
print(f"Training loss: {loss.data}")The framework provides utility functions for saving and loading models.
from manage import save_model_parameters
# After training your model
model = SimpleCNN()
# ... Train the model ...
# Save to a file
save_model_parameters(model, 'my_model.pkl')from manage import load_state_dict_from_file
# Create model with the same architecture
model = SimpleCNN()
# Load parameters
state_dict = load_state_dict_from_file('my_model.pkl')
model.load_state_dict(state_dict)
# Model is ready for inference
predictions = model(test_data)The state_dict() method includes:
- All parameter values (weights and biases)
- Configuration information (layer shapes, device settings, dtypes)
- Metadata that helps with correct restoration
You can also manually handle serialization:
import pickle
# Save
state_dict = model.state_dict()
with open('model.pkl', 'wb') as f:
pickle.dump(state_dict, f)
# Load
with open('model.pkl', 'rb') as f:
state_dict = pickle.load(f)
model.load_state_dict(state_dict)This framework provides a powerful yet flexible foundation for deep learning research and applications:
- Modularity: Easy to extend with new layers, activations, and models
- Automatic Differentiation: Robust gradient computation with proper type handling
- Device/Dtype Management: Seamless CPU/GPU switching and numerical precision control
- PyTorch/TensorFlow-like API: Familiar, clean interface for building models
- Transformer Support: First-class components for attention-based models
- Customizability: Hierarchical model building with
BaseModel
- Recurrent Layers: Add LSTM and GRU implementations
- Dataset Handling: Create data loading and batching utilities
- Regularization: Add dropout, batch normalization, and weight regularization
- Learning Rate Schedules: Implement more sophisticated LR scheduling
- Distributed Training: Add support for multi-GPU and multi-node training
- Automatic Mixed Precision: Add support for mixed precision training
- Graph Visualization: Tools to visualize model architecture and computation graph
- Quantization: Support for reduced precision (int8, float16) operations
- Use CuPy when possible for GPU acceleration
- Consider batching strategy based on your hardware
- Profile your models to identify bottlenecks
- Use appropriate dtype (typically
np.float32for most applications)