Skip to content

Prakash-sa/AI-for-Interview

Repository files navigation

layout default
title AI/ML Fundamentals
permalink /ml-fundamentals/

AI Interview Preparation Guide

Table of Contents

  1. Machine Learning Fundamentals
  2. Deep Learning
  3. Natural Language Processing
  4. Computer Vision
  5. Reinforcement Learning
  6. Data Preprocessing & Feature Engineering
  7. Model Evaluation & Validation
  8. Optimization Techniques
  9. Practical Interview Questions

Machine Learning Fundamentals

Q1: What is Machine Learning and its types?

Answer: Machine Learning is a subset of Artificial Intelligence that enables systems to learn and improve from experience without being explicitly programmed.

Types:

  1. Supervised Learning: Learning from labeled data

    • Classification: Predicting discrete categories (Logistic Regression, Decision Trees, SVM, Random Forest)
    • Regression: Predicting continuous values (Linear Regression, Ridge, Lasso)
  2. Unsupervised Learning: Learning from unlabeled data

    • Clustering: K-means, DBSCAN, Hierarchical Clustering
    • Dimensionality Reduction: PCA, t-SNE, Autoencoders
  3. Semi-Supervised Learning: Learning from partially labeled data

    • Self-training, Co-training, Graph-based methods
  4. Reinforcement Learning: Learning through interaction and rewards

    • Q-Learning, Policy Gradient, Actor-Critic

Q2: What is the difference between supervised and unsupervised learning?

Answer:

Feature Supervised Unsupervised
Data Labeled data Unlabeled data
Goal Predict target variable Find patterns/structure
Examples Classification, Regression Clustering, Dimensionality Reduction
Performance Metric Accuracy, Precision, Recall, F1 Silhouette Score, Inertia
Computational Cost Lower (known targets) Higher (exploration needed)

Q3: What is overfitting and underfitting?

Answer:

Overfitting:

  • Model learns training data too well, including noise
  • Performs well on training data but poorly on test data
  • High training accuracy, low test accuracy
  • Causes: Too many parameters, complex model, insufficient training data

Solutions:

  • Use simpler models
  • Increase training data
  • Feature selection
  • Regularization (L1, L2)
  • Early stopping
  • Cross-validation
  • Dropout (for neural networks)

Underfitting:

  • Model is too simple to capture data patterns
  • Performs poorly on both training and test data
  • Causes: Insufficient model complexity, poor feature engineering

Solutions:

  • Use more complex models
  • Increase model capacity
  • Add relevant features
  • Reduce regularization

Q4: What is the bias-variance tradeoff?

Answer:

Bias: Error from incorrect assumptions in learning algorithm

  • High bias: Model is too simple (underfitting)
  • Low bias: Model captures complexity

Variance: Sensitivity to fluctuations in training data

  • High variance: Model is too complex (overfitting)
  • Low variance: Stable predictions

Total Error = Bias² + Variance + Irreducible Error

Tradeoff:

  • Decreasing bias increases variance
  • Decreasing variance increases bias
  • Goal: Find optimal balance for minimum total error

Q5: Explain Linear Regression

Answer:

Definition: Predicts continuous target variable using linear relationship

Formula: y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Key Concepts:

  • Ordinary Least Squares (OLS): Minimizes sum of squared residuals
  • Cost Function: MSE = (1/n)Σ(y_actual - y_predicted)²
  • Assumptions:
    • Linear relationship between features and target
    • Independence of observations
    • Homoscedasticity (constant variance of errors)
    • Normality of residuals
    • No multicollinearity

Advantages:

  • Simple and interpretable
  • Computationally efficient
  • Works well with linear relationships

Disadvantages:

  • Assumes linearity
  • Sensitive to outliers
  • Cannot capture non-linear patterns

Variants:

  • Ridge Regression (L2 regularization): λΣβ²
  • Lasso Regression (L1 regularization): λΣ|β|
  • Elastic Net: Combination of L1 and L2

Q6: What is Logistic Regression?

Answer:

Definition: Classification algorithm for binary (and multiclass) problems

Formula:

  • Probability: P(y=1) = 1 / (1 + e^(-z))
  • Where z = β₀ + β₁x₁ + β₂x₂ + ...

Key Concepts:

  • Uses sigmoid function to map output to [0,1]
  • Cost Function: Binary Cross-Entropy = -(1/n)Σ[y*log(ŷ) + (1-y)*log(1-ŷ)]
  • Optimization: Gradient Descent
  • Decision boundary: Typically at 0.5 probability

Advantages:

  • Interpretable probability outputs
  • Efficient for binary classification
  • Works well for linearly separable data

Disadvantages:

  • Assumes linear decision boundary
  • May underfit complex relationships
  • Sensitive to feature scaling

Extensions:

  • Multiclass Logistic Regression (Softmax)
  • One-vs-Rest approach

Q7: Explain Decision Trees

Answer:

Definition: Tree-based model that recursively splits data based on feature values

Key Concepts:

  • Node Types: Root (initial decision), Internal (decision points), Leaf (predictions)
  • Splitting Criteria:
    • Classification: Gini Impurity or Information Gain (Entropy)
    • Regression: Variance reduction

Gini Impurity: Gini = 1 - Σ(p_i)² where p_i is proportion of class i

Information Gain: IG = Entropy(parent) - Σ[(N_child/N_parent) × Entropy(child)]

Advantages:

  • Interpretable and visual
  • Handles non-linear relationships
  • Requires minimal feature engineering
  • Works with both categorical and numerical data

Disadvantages:

  • Prone to overfitting
  • Unstable (small data changes cause large tree changes)
  • Biased toward dominant classes

Regularization:

  • Max depth
  • Min samples split
  • Min samples leaf
  • Pruning

Q8: What is a Random Forest?

Answer:

Definition: Ensemble method combining multiple decision trees with bagging

How It Works:

  1. Create multiple bootstrap samples from training data
  2. Train decision tree on each sample
  3. For classification: Majority voting
  4. For regression: Average predictions

Key Hyperparameters:

  • n_estimators: Number of trees
  • max_depth: Maximum tree depth
  • min_samples_split: Minimum samples to split a node
  • max_features: Number of features to consider per split

Advantages:

  • Reduces overfitting (bagging + averaging)
  • Handles missing values
  • Provides feature importance
  • Works with large datasets
  • Parallel computation possible

Disadvantages:

  • Less interpretable than single trees
  • Computationally expensive
  • May be overkill for simple problems
  • Memory intensive

Feature Importance: Calculated as average decrease in impurity across all trees


Q9: Explain Support Vector Machine (SVM)

Answer:

Definition: Algorithm that finds optimal hyperplane maximizing margin between classes

Key Concepts:

  • Margin: Distance between hyperplane and closest points (support vectors)
  • Support Vectors: Critical data points that define decision boundary
  • Objective: Maximize margin while minimizing classification errors

Mathematical Formulation:

  • Decision boundary: w·x + b = 0
  • Optimization: Minimize (1/2)||w||² + C×Σξ_i
    • ξ_i: Slack variables (allow errors)
    • C: Regularization parameter

Kernel Tricks: Used for non-linear classification

  • Linear: K(x_i, x_j) = x_i·x_j
  • Polynomial: K(x_i, x_j) = (x_i·x_j + 1)^d
  • RBF (Radial Basis Function): K(x_i, x_j) = exp(-γ||x_i - x_j||²)
  • Sigmoid: K(x_i, x_j) = tanh(κ(x_i·x_j) + θ)

Advantages:

  • Effective in high dimensions
  • Memory efficient (uses support vectors)
  • Versatile kernel functions
  • Robust to outliers (with soft margin)

Disadvantages:

  • Slow training on large datasets
  • Not suitable for large number of features
  • Sensitive to feature scaling
  • Black-box model (less interpretable)

Q10: What is K-Nearest Neighbors (KNN)?

Answer:

Definition: Non-parametric algorithm classifying/predicting based on K nearest neighbors

How It Works:

  1. Calculate distance from query point to all training points
  2. Select K nearest points
  3. For classification: Majority class
  4. For regression: Average value

Distance Metrics:

  • Euclidean: √(Σ(x_i - y_i)²)
  • Manhattan: Σ|x_i - y_i|
  • Minkowski: (Σ|x_i - y_i|^p)^(1/p)
  • Cosine: 1 - (A·B)/(||A||||B||)

Choosing K:

  • K too small: Overfitting, sensitive to noise
  • K too large: Underfitting, ignores local patterns
  • Use cross-validation
  • Odd K for binary classification (avoid ties)
  • Rule of thumb: K = √n

Advantages:

  • Simple to understand and implement
  • No training phase (lazy learner)
  • Works with any distance metric
  • Naturally handles multiclass problems

Disadvantages:

  • Slow inference (calculates distance to all points)
  • Memory intensive
  • Sensitive to feature scaling
  • Curse of dimensionality
  • Biased toward majority class in imbalanced data

Deep Learning

Q11: What is a Neural Network?

Answer:

Definition: Computational model inspired by biological neural networks

Structure:

  • Input Layer: Receives raw data
  • Hidden Layers: Learn representations (can be multiple)
  • Output Layer: Makes predictions

Neuron (Perceptron):

Output = Activation(Σ(weight × input) + bias)

Activation Functions:

  • ReLU: f(x) = max(0, x) - Most popular for hidden layers
  • Sigmoid: f(x) = 1/(1 + e^(-x)) - Output range [0,1]
  • Tanh: f(x) = (e^x - e^(-x))/(e^x + e^(-x)) - Output range [-1,1]
  • Linear: f(x) = x - For regression
  • Softmax: For multiclass classification

Training Process:

  1. Forward propagation: Input → Hidden layers → Output
  2. Calculate loss (MSE, Cross-entropy, etc.)
  3. Backward propagation: Compute gradients
  4. Update weights using gradient descent

Loss Functions:

  • Regression: Mean Squared Error (MSE)
  • Binary Classification: Binary Cross-Entropy
  • Multiclass Classification: Categorical Cross-Entropy

Advantages:

  • Can learn complex non-linear relationships
  • Automatic feature extraction
  • State-of-the-art results on many tasks

Disadvantages:

  • Requires large amounts of data
  • Computationally expensive
  • Black-box (hard to interpret)
  • Prone to local minima

Q12: Explain Convolutional Neural Networks (CNN)

Answer:

Definition: Neural network designed for processing grid-like data (images, time series)

Key Components:

  1. Convolutional Layer: Applies filters to extract local features

    • Filter size: Typically 3×3 or 5×5
    • Stride: How much filter moves
    • Padding: Adding zeros around edges
  2. Pooling Layer: Reduces spatial dimensions

    • Max Pooling: Takes maximum value
    • Average Pooling: Takes average value
    • Typical size: 2×2
  3. Fully Connected Layer: Connects all neurons (classification)

  4. Activation Functions: ReLU for hidden layers, Softmax for output

Why CNNs Work:

  • Convolutional filters capture local patterns
  • Parameter sharing reduces parameters
  • Translation invariance through pooling
  • Hierarchical feature learning (low-level to high-level)

Popular Architectures:

  • LeNet: Early CNN (1998)
  • AlexNet: Deep CNN breakthrough (2012)
  • VGG: Simplicity and depth (16/19 layers)
  • ResNet: Skip connections enabling very deep networks (152+ layers)
  • Inception: Multi-scale feature extraction
  • MobileNet: Efficient for mobile devices

Applications:

  • Image classification
  • Object detection
  • Semantic segmentation
  • Face recognition
  • Medical imaging

Q13: Explain Recurrent Neural Networks (RNN)

Answer:

Definition: Neural networks with connections forming cycles, suitable for sequential data

Key Concept:

  • Maintains hidden state that captures information from previous time steps
  • h_t = f(h_(t-1), x_t)
  • Can model temporal dependencies

Variants:

Basic RNN:

  • Simple recurrent connections
  • Suffers from vanishing/exploding gradients
  • Hard to learn long-range dependencies

LSTM (Long Short-Term Memory):

  • Introduces cell state and gates (forget, input, output)
  • Forget gate: σ(W_f·[h_(t-1), x_t] + b_f) - Controls what to forget
  • Input gate: σ(W_i·[h_(t-1), x_t] + b_i) - Controls new information
  • Output gate: σ(W_o·[h_(t-1), x_t] + b_o) - Controls output
  • Cell state: C_t = F_t ⊙ C_(t-1) + I_t ⊙ C̃_t
  • Advantages: Learns long-range dependencies, stable gradients
  • Disadvantages: More parameters, slower training

GRU (Gated Recurrent Unit):

  • Simpler than LSTM (fewer gates)
  • Reset gate and update gate
  • Similar performance to LSTM with fewer parameters
  • Faster training

Applications:

  • Machine translation
  • Text generation
  • Speech recognition
  • Time series prediction
  • Sentiment analysis
  • Machine summarization

Challenges:

  • Vanishing/exploding gradients
  • Slower training than feedforward networks
  • Difficulty capturing very long-range dependencies

Q14: What is Transformer Architecture?

Answer:

Definition: Neural network architecture based on self-attention mechanism, replacing RNNs

Key Innovation: Self-Attention Mechanism

Attention(Q, K, V) = softmax(Q·K^T / √d_k)·V
  • Query (Q): "What am I looking for?"
  • Key (K): "What information do I have?"
  • Value (V): "What should I output?"

Architecture Components:

  1. Multi-Head Attention:

    • Multiple parallel attention heads
    • Each head attends to different representation subspaces
    • Outputs concatenated and projected
  2. Feed-Forward Network:

    • Two linear transformations with activation
    • FFN(x) = ReLU(x·W_1 + b_1)·W_2 + b_2
  3. Layer Normalization:

    • Normalizes inputs before each sub-layer
    • Stabilizes training
  4. Positional Encoding:

    • Encodes position information (RNNs have inherent ordering)
    • PE(pos, 2i) = sin(pos/10000^(2i/d_model))
    • PE(pos, 2i+1) = cos(pos/10000^(2i/d_model))
  5. Residual Connections:

    • Each sub-layer: output = LayerNorm(x + SubLayer(x))
    • Aids gradient flow in deep networks

Advantages:

  • Parallel processing (RNNs process sequentially)
  • Long-range dependencies captured effectively
  • Efficient computation
  • Scalable to large datasets

Disadvantages:

  • Requires large amounts of data
  • Quadratic memory/computation w.r.t. sequence length
  • Needs positional encoding

Popular Transformer Models:

  • BERT: Bidirectional Encoder Representations from Transformers
  • GPT: Generative Pre-trained Transformer
  • T5: Text-to-Text Transfer Transformer
  • Vision Transformer (ViT): Applies transformers to images

Natural Language Processing

Q15: What is Word Embedding?

Answer:

Definition: Represents words as dense vectors capturing semantic meaning

Importance:

  • Captures semantic relationships
  • Reduces dimensionality (vs. one-hot encoding)
  • Enables transfer learning
  • Word analogy: king - man + woman ≈ queen

Common Approaches:

1. Word2Vec:

  • CBOW (Continuous Bag of Words): Predicts target word from context
  • Skip-gram: Predicts context from target word
  • Uses shallow neural networks
  • Fast training
  • Most popular pre-trained embeddings

2. GloVe (Global Vectors):

  • Combines count-based and prediction-based methods
  • Minimizes difference between dot product and co-occurrence probability
  • Better captures global statistics

3. FastText:

  • Extension of Word2Vec
  • Uses character n-grams
  • Handles out-of-vocabulary words
  • Better for morphologically rich languages

4. BERT/Contextual Embeddings:

  • Different embedding for same word based on context
  • Bidirectional context consideration
  • Superior performance on NLP tasks

Dimensions:

  • Typical: 100-300 dimensions
  • Trade-off: More dimensions = better representation but more computation

Evaluation:

  • Cosine similarity for word relatedness
  • Analogy tasks
  • Downstream task performance

Q16: Explain Tokenization and Text Preprocessing

Answer:

Tokenization: Breaking text into smaller units (words, subwords, characters)

Types:

  1. Word Tokenization: Split by whitespace/punctuation

    • Issue: "don't" becomes separate tokens
  2. Subword Tokenization:

    • Byte Pair Encoding (BPE): Merges frequent character pairs
    • WordPiece: Similar to BPE, used in BERT
    • SentencePiece: Language-independent
  3. Character Tokenization: Each character is token

    • Handles unknown words
    • More computation needed

Text Preprocessing Steps:

  1. Lowercasing: Convert to lowercase

    • Treats "USA" and "usa" same
    • May lose information (proper nouns)
  2. Removing Punctuation: Remove special characters

    • "Hello!" → "Hello"
  3. Removing Stopwords: Remove common words (the, is, at)

    • Reduces noise
    • May lose context for some tasks
  4. Stemming: Reduce words to root form

    • "running", "runs", "ran" → "run"
    • Fast but crude (overstemming)
  5. Lemmatization: Convert to dictionary form

    • "better" → "good"
    • More accurate but slower
  6. Handling Special Tokens:

    • [CLS]: Classification token (start)
    • [SEP]: Separator token
    • [PAD]: Padding token
    • [UNK]: Unknown token
    • [MASK]: Masking token (BERT)

Sequence Padding/Truncation:

  • Pad shorter sequences to max length
  • Truncate longer sequences
  • Necessary for batch processing

Q17: What is Attention Mechanism?

Answer:

Definition: Mechanism allowing models to focus on relevant parts of input

Problem It Solves:

  • RNNs compress entire input into fixed context vector
  • Long sequences → information loss
  • Attention allows dynamic focus

Mechanism:

  1. Alignment Score: Compute similarity between decoder state and encoder outputs

    • Multiplicative: score = s·e / √d_k
    • Additive: score = v·tanh(W_s·[s, e])
  2. Normalization: Apply softmax to get attention weights

    • Weights sum to 1
    • Focus on most relevant parts
  3. Context Vector: Weighted sum of encoder outputs

    • context = Σ(attention_weight × encoder_output)

Attention Types:

  1. Self-Attention: Query, Key, Value from same sequence

    • Captures dependencies within sequence
  2. Cross-Attention: Query from one sequence, Key/Value from another

    • Encoder-decoder architecture
  3. Multi-Head Attention: Multiple attention heads in parallel

    • Different representation subspaces
    • Concatenate and project outputs

Advantages:

  • Interpretable (can visualize attention weights)
  • Efficient gradient flow
  • Captures long-range dependencies
  • Enables parallel processing

Applications:

  • Machine translation
  • Question answering
  • Summarization
  • Image captioning

Q18: Explain BERT and its Training

Answer:

BERT (Bidirectional Encoder Representations from Transformers)

Key Features:

  • Bidirectional: Considers both left and right context
  • Pre-trained on large unlabeled data
  • Transfer learning to downstream tasks

Training Objectives:

1. Masked Language Modeling (MLM):

  • Randomly mask 15% of tokens
    • 80%: Replace with [MASK]
    • 10%: Replace with random token
    • 10%: Keep original
  • Predict masked tokens using bidirectional context
  • Objective: Minimize cross-entropy loss

2. Next Sentence Prediction (NSP):

  • Binary classification: Is second sentence next to first?
  • 50% positive: Actual next sentence
  • 50% negative: Random sentence from corpus
  • Helps model understand sentence relationships
  • Later found less important, removed in RoBERTa

Architecture:

  • 12-24 transformer encoder layers
  • 768-1024 hidden dimensions
  • 12 attention heads
  • Trained on English Wikipedia + BookCorpus

Fine-tuning for Downstream Tasks:

  1. Classification:

    • Add linear layer on top of [CLS] token
    • Fine-tune on task-specific data
  2. Token Labeling (NER, POS tagging):

    • Add linear layer on top of each token
    • Classify each token independently
  3. Question Answering:

    • Two linear layers: Start and End position prediction
    • Span containing answer is [start, end]

Fine-tuning Hyperparameters:

  • Learning rate: 2e-5 to 5e-5
  • Batch size: 16 or 32
  • Epochs: 2-4
  • Warmup steps: 10% of training steps

Advantages:

  • Pre-trained on massive data
  • Contextual representations
  • Fine-tune for various tasks
  • Strong baselines

Disadvantages:

  • Computational expense (training from scratch)
  • Requires labeled data for fine-tuning
  • Black-box model
  • Limited to 512 token input length

Computer Vision

Q19: How does Object Detection work? (YOLO, Faster R-CNN)

Answer:

Object Detection Task: Localize and classify objects in images Output: Bounding boxes + class labels + confidence scores

Approaches:

1. Two-Stage Detectors (Faster R-CNN):

Process:

  1. Feature Extraction: CNN (VGG, ResNet) extracts features
  2. Region Proposal Network (RPN): Generates candidate bounding boxes
  3. RPN Output: Class scores (object/background) + bounding box adjustments
  4. RoI Pooling: Extract fixed-size feature maps from proposals
  5. Classification & Regression: Classify objects and refine boxes

Advantages:

  • Accurate
  • Fewer false positives
  • Better for small objects

Disadvantages:

  • Slower (two-stage process)
  • More complex
  • Training more involved

2. One-Stage Detectors (YOLO - You Only Look Once):

Process:

  1. Divide image into S×S grid
  2. Each grid cell predicts:
    • B bounding boxes with confidence scores
    • C class probabilities
  3. Non-Maximum Suppression: Remove overlapping boxes

Loss Function:

  • Localization loss: MSE for coordinates
  • Confidence loss: Log loss for objectness
  • Classification loss: Cross-entropy for class labels

Advantages:

  • Fast real-time detection
  • Global context understanding
  • Simpler end-to-end training

Disadvantages:

  • Lower accuracy than two-stage
  • Struggles with small objects
  • Issues with closely packed objects

Performance Metrics:

  • Precision: TP/(TP+FP) - Accuracy of detections
  • Recall: TP/(TP+FN) - Coverage of objects
  • mAP (mean Average Precision): Average precision across classes
  • IoU (Intersection over Union): Overlap of predicted and ground-truth boxes

Other Detectors:

  • SSD (Single Shot MultiBox Detector)
  • RetinaNet (focal loss for class imbalance)
  • EfficientDet (efficient architecture)
  • Mask R-CNN (instance segmentation)

Q20: Explain Image Segmentation

Answer:

Definition: Assigning class label to each pixel in image

Types:

1. Semantic Segmentation:

  • Assigns same label to all pixels of same object
  • No distinction between individual objects of same class
  • Output: Single class map

2. Instance Segmentation:

  • Distinguishes between different objects of same class
  • Combines object detection + semantic segmentation
  • Output: Separate masks for each object

3. Panoptic Segmentation:

  • Combines semantic and instance segmentation
  • Both "stuff" (amorphous regions) and "things" (countable objects)

Architecture: U-Net

Structure:

  • Encoder: Downsampling path (convolutions + pooling)
    • Extracts features, reduces spatial dimensions
  • Decoder: Upsampling path (transposed convolutions)
    • Restores spatial dimensions
  • Skip Connections: Concatenate encoder features to decoder
    • Preserves low-level details
    • Enables precise localization

Why Skip Connections:

  • Earlier layers capture edge/texture information
  • Direct connection helps decoder reconstruct spatial details
  • Addresses information loss from downsampling

Loss Functions:

  • Dice Loss: 2×|X∩Y|/(|X|+|Y|)
    • Good for imbalanced segmentation
  • Cross-Entropy Loss: Standard classification loss
  • Focal Loss: Down-weights easy examples
    • Handles class imbalance

Popular Architectures:

  • FCN (Fully Convolutional Networks): Early approach
  • U-Net: Medical image segmentation (skip connections)
  • DeepLab: Atrous convolutions for context
  • Mask R-CNN: Instance segmentation

Applications:

  • Medical image analysis (tumor segmentation)
  • Autonomous driving (road/pedestrian segmentation)
  • Satellite imagery analysis
  • Video object segmentation

Reinforcement Learning

Q21: What is Reinforcement Learning?

Answer:

Definition: Agent learns to make decisions by interacting with environment, receiving rewards/penalties

Key Components:

  1. Agent: Entity making decisions
  2. Environment: System the agent interacts with
  3. State (s): Current situation
  4. Action (a): Choice the agent makes
  5. Reward (r): Feedback from environment
  6. Policy (π): Mapping from state to action
  7. Value Function (V): Expected cumulative reward from state

Objective: Maximize cumulative reward over time

Markov Decision Process (MDP):

  • Markov Property: Future depends only on current state, not history
  • Transition Probability: P(s'|s,a) probability of reaching s' from s with action a
  • Reward Function: R(s,a) immediate reward for action a in state s

Approaches:

1. Value-Based Methods:

  • Learn value function V(s) or Q(s,a)

  • Q-Learning: Off-policy algorithm

    • Q(s,a) ← Q(s,a) + α[r + γ·max_a'Q(s',a') - Q(s,a)]
    • Learn from exploration but act greedily
    • Guaranteed convergence with discrete state-action spaces
  • Deep Q-Network (DQN): Q-learning with neural networks

    • Experience replay: Store and sample past transitions
    • Target network: Separate network for stability
    • Handles large state spaces

2. Policy-Based Methods:

  • Learn policy π(a|s) directly

  • Policy Gradient: ∇J(θ) = E[∇log π(a|s)·Q(s,a)]

    • Update policy toward higher-value actions
    • Can handle continuous action spaces
  • REINFORCE: Monte Carlo policy gradient

    • Use full episode return as baseline
    • High variance, slow learning
  • Actor-Critic: Combine value and policy methods

    • Actor: Policy network
    • Critic: Value network
    • Lower variance than pure policy gradient

3. Model-Based Methods:

  • Learn environment model (transition probabilities)
  • Plan actions using model
  • Sample efficient but computationally expensive

Exploration vs Exploitation:

  • Exploration: Try new actions to discover rewards
  • Exploitation: Use known good actions
  • ε-Greedy: With probability ε, explore; else exploit

Applications:

  • Game playing (AlphaGo, DQN for Atari)
  • Robotics control
  • Resource allocation
  • Autonomous driving
  • Recommendation systems

Data Preprocessing & Feature Engineering

Q22: What is Feature Scaling and Normalization?

Answer:

Why Scale Features:

  • Different feature ranges affect distance-based algorithms
  • Gradient descent converges faster with scaled features
  • Some algorithms require normalized inputs

Normalization (Min-Max Scaling):

X_normalized = (X - X_min) / (X_max - X_min)
  • Output range: [0, 1]
  • Preserves original distribution shape
  • Use when: Need bounded range, sensitive to outliers
  • Problem: Sensitive to outliers

Standardization (Z-score):

X_standardized = (X - mean) / std_dev
  • Output: Mean 0, Std 1 (standard normal distribution)
  • Outliers still included but with less influence
  • Use for: Most machine learning algorithms
  • Assumed: Data is normally distributed

Robust Scaling:

X_robust = (X - median) / IQR
  • IQR: Interquartile range (Q3 - Q1)
  • Robust to outliers
  • Use when: Data has outliers

Log Scaling:

X_log = log(X)
  • Compress large values
  • Handle skewed distributions
  • Use for: Highly skewed positive values

When to Use:

Algorithm Scaling Needed
Linear/Logistic Regression Yes
Decision Trees No
SVM Yes
Neural Networks Yes
KNN Yes
Naive Bayes Generally No
Gradient Boosting No

Q23: How to Handle Missing Data?

Answer:

Causes:

  • Data collection errors
  • Equipment failures
  • Non-response in surveys
  • Incomplete databases

Strategies:

1. Deletion:

  • Remove Rows (Listwise Deletion): Delete if any value missing

    • Pros: Simple, preserves relationships
    • Cons: Loses data, biased if not MCAR
  • Remove Columns: Delete if too many missing values (>50%)

    • Pros: Simple
    • Cons: Lose information
  • Remove specific columns: If column has >threshold missing

    • Threshold: 50-80%

2. Imputation (Fill Missing Values):

  • Mean/Median/Mode Imputation:

    • Mean/Median for continuous, Mode for categorical
    • Pros: Simple, fast
    • Cons: Reduces variance, ignores relationships
  • Forward/Backward Fill (Time Series):

    • Propagate last known value forward or next value backward
    • Use for temporal data with trends
  • K-Nearest Neighbors Imputation:

    • Use K nearest neighbors' values
    • Average (continuous) or majority (categorical)
    • Respects local patterns
    • Con: Slow for large datasets
  • Multiple Imputation by Chained Equations (MICE):

    • Model missing values using other features
    • Create multiple imputed datasets
    • Combine results
    • Sophisticated but complex
  • Deep Learning Imputation:

    • Train autoencoder to reconstruct missing values
    • Can capture complex relationships
    • Computationally expensive

3. Using Algorithm-Specific Approaches:

  • XGBoost: Handles missing values internally
  • Some tree-based models: Can handle NaN values

Missing Data Mechanisms:

  • MCAR (Missing Completely At Random): Deletion OK
  • MAR (Missing At Random): Imputation preferred
  • MNAR (Missing Not At Random): Most problematic, requires domain knowledge

Practical Guidelines:

  • <5% missing: Deletion or simple imputation
  • 5-20% missing: MICE or KNN imputation
  • 20% missing: Consider feature importance/removal

  • Always explore patterns in missing data

Q24: Explain Dimensionality Reduction

Answer:

Why Reduce Dimensions:

  • Reduce computation and memory
  • Remove noise and redundant features
  • Prevent overfitting (curse of dimensionality)
  • Visualize high-dimensional data

Principal Component Analysis (PCA):

Process:

  1. Standardize features to mean 0, std 1
  2. Compute covariance matrix
  3. Calculate eigenvalues and eigenvectors
  4. Sort by eigenvalues (descending)
  5. Select top K eigenvectors
  6. Project data onto selected components

Key Concepts:

  • PC1 (First Principal Component): Direction of maximum variance
  • PC2: Orthogonal to PC1, maximum remaining variance
  • Covariance: Captures feature relationships
  • Eigenvalues: Variance along each PC
  • Explained Variance Ratio: (λ_i / Σλ) - How much variance each PC captures

Advantages:

  • Removes correlated features
  • Unsupervised (no label info needed)
  • Computationally efficient
  • Linear transformation

Disadvantages:

  • Loses interpretability (new features are combinations)
  • Assumes linear relationships
  • Requires feature standardization
  • Sensitive to outliers

t-SNE (t-Distributed Stochastic Neighbor Embedding):

  • Preserves local structure
  • Good for visualization (2-3D)
  • Non-linear dimensionality reduction
  • Computationally expensive
  • Prone to weird artifacts

UMAP (Uniform Manifold Approximation and Projection):

  • Faster than t-SNE
  • Preserves both global and local structure
  • Better for exploration

Feature Selection vs Reduction:

  • Feature Selection: Choose subset of original features
    • Interpretable
    • Methods: Filter (correlation), Wrapper (recursive), Embedded (L1)
  • Feature Reduction: Combine features into new ones
    • May lose interpretability
    • PCA, autoencoders

Curse of Dimensionality:

  • Distance metrics become meaningless
  • Sparse data in high dimensions
  • More parameters → overfitting
  • Exponential increase in computational complexity

Model Evaluation & Validation

Q25: Explain Cross-Validation

Answer:

Problem: Using test set multiple times causes overfitting to test set

Solution: Cross-Validation - Multiple train/test splits

Types:

1. K-Fold Cross-Validation: Process:

  1. Divide data into K equal parts
  2. For each fold:
    • Use fold as test set
    • Use remaining K-1 as training set
    • Train model and evaluate
  3. Average metrics across all folds

Example (5-fold):

  • Fold 1: Train on [2,3,4,5], Test on [1]
  • Fold 2: Train on [1,3,4,5], Test on [2]
  • ... and so on
  • Average 5 scores

Advantages:

  • Uses all data for both training and testing
  • More stable estimate of model performance
  • Reduces variance in evaluation

Disadvantages:

  • Computationally expensive (train K times)
  • Slower for large datasets

Typical K: 5 or 10 Stratified K-fold: For imbalanced data, maintain class ratios

2. Leave-One-Out Cross-Validation (LOOCV):

  • K = number of samples
  • Train on all except 1 sample, test on that 1
  • Repeat for all samples
  • Very computationally expensive but unbiased

3. Time Series Cross-Validation:

  • Forward chaining: Expanding window
  • Fold t uses data from periods 1 to t-1 for training, t for testing
  • Respects temporal ordering (no future data leakage)

4. Stratified Cross-Validation:

  • For imbalanced classification
  • Maintains class distribution in each fold
  • Each fold has similar class proportions to full dataset

Metrics to Report:

  • Mean and standard deviation across folds
  • Individual fold scores

Q26: Classification Metrics - Precision, Recall, F1

Answer:

Confusion Matrix:

                Predicted Positive    Predicted Negative
Actual Positive      TP                   FN
Actual Negative      FP                   TN

Metrics:

Accuracy:

Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Proportion of correct predictions
  • Issue: Misleading for imbalanced data
  • Example: 99% accuracy when 99% of data is negative class

Precision:

Precision = TP / (TP + FP)
  • "Of predicted positives, how many are correct?"
  • Answers: "How precise are positive predictions?"
  • High precision: Few false alarms
  • Use when: False positives are costly (spam detection, medical alerts)

Recall (Sensitivity, True Positive Rate):

Recall = TP / (TP + FN)
  • "Of actual positives, how many did we find?"
  • Answers: "Did we catch the positive cases?"
  • High recall: Few missed positives
  • Use when: False negatives are costly (disease detection, fraud)

F1 Score (Harmonic Mean):

F1 = 2 × (Precision × Recall) / (Precision + Recall)
  • Balances precision and recall
  • Useful when: Need balance, data imbalanced
  • Range: 0 to 1 (higher is better)

Precision-Recall Tradeoff:

  • Increasing threshold → Higher precision, lower recall
  • Decreasing threshold → Lower precision, higher recall
  • Adjust threshold based on business needs

ROC-AUC:

ROC: Receiver Operating Characteristic
X-axis: False Positive Rate = FP / (FP + TN)
Y-axis: True Positive Rate = TP / (TP + FN)
AUC: Area Under the Curve
  • Measures classifier's ability to distinguish classes
  • AUC = 0.5: Random classifier
  • AUC = 1.0: Perfect classifier
  • Good metric for imbalanced data
  • Threshold-independent

When to Use Which:

  • Imbalanced data: Precision, Recall, F1, ROC-AUC (not Accuracy)
  • Balanced data: Accuracy, Precision, Recall, F1
  • Spam detection: High precision
  • Cancer detection: High recall
  • General purpose: F1 or ROC-AUC

Q27: Regression Metrics

Answer:

Mean Squared Error (MSE):

MSE = (1/n) Σ(y_actual - y_predicted)²
  • Penalizes large errors heavily (squared term)
  • Same units as squared target
  • Sensitive to outliers
  • Use when: Outliers should be penalized heavily

Root Mean Squared Error (RMSE):

RMSE = √MSE
  • Same units as target variable
  • More interpretable than MSE
  • Still sensitive to outliers
  • Most popular for regression

Mean Absolute Error (MAE):

MAE = (1/n) Σ|y_actual - y_predicted|
  • Linear penalty for errors
  • Robust to outliers
  • Same units as target
  • Use when: Treat all errors equally

R² (Coefficient of Determination):

R² = 1 - (SS_res / SS_tot)

Where:

  • SS_res = Σ(y_actual - y_predicted)² (residual sum of squares)
  • SS_tot = Σ(y_actual - ȳ)² (total sum of squares)

Interpretation:

  • R² = 1: Perfect fit
  • R² = 0: Model as good as mean baseline
  • R² < 0: Model worse than baseline
  • 0 to 1 is typical range
  • Tells proportion of variance explained

Adjusted R²:

Adjusted R² = 1 - [(1 - R²) × (n-1) / (n-p-1)]
  • Penalizes for adding features
  • p: number of features, n: number of samples
  • Use for: Model comparison with different feature counts
  • Higher values favor simpler models

MSE vs MAE:

Metric Outlier Sensitivity Interpretability
MSE/RMSE High Squared units
MAE Low Same units as target

Optimization Techniques

Q28: Explain Gradient Descent Variants

Answer:

Gradient Descent: Iteratively move in direction of negative gradient

θ = θ - α × ∇J(θ)

Where:

  • θ: Parameters/weights
  • α: Learning rate
  • ∇J(θ): Gradient of loss function

Variants:

1. Batch Gradient Descent:

  • Update after computing loss on entire dataset
  • Pros: Stable convergence, smooth loss curve
  • Cons: Slow (requires full dataset in memory)
  • Use when: Small datasets, offline learning

2. Stochastic Gradient Descent (SGD):

  • Update after each individual sample
  • Pros: Fast, online learning, can escape local minima
  • Cons: Noisy updates, oscillating loss curve
  • Use when: Large datasets, online learning

3. Mini-Batch Gradient Descent:

  • Update after processing batch of B samples
  • Pros: Balance between batch and stochastic
  • Cons: Hyperparameter B to tune
  • Use when: Most practical approach

4. Momentum:

v = β × v + (1-β) × ∇J(θ)
θ = θ - α × v
  • Accumulate velocity in gradient direction
  • β ≈ 0.9
  • Pros: Faster convergence, dampens oscillations
  • Cons: One more hyperparameter

5. Nesterov Accelerated Gradient (NAG):

v = β × v + (1-β) × ∇J(θ - α × v)
θ = θ - α × v
  • "Look-ahead" gradient at next position
  • Faster convergence than momentum
  • More stable

6. Adaptive Learning Rate Methods:

AdaGrad (Adaptive Gradient):

g_t² = g_t² + (∇J(θ))²
θ = θ - (α / √(g_t² + ε)) × ∇J(θ)
  • Higher learning rate for sparse gradients
  • Pros: Good for sparse data, self-adjusting
  • Cons: Learning rate always decreases, may become too small

RMSprop:

g_t² = β × g_t² + (1-β) × (∇J(θ))²
θ = θ - (α / √(g_t² + ε)) × ∇J(θ)
  • Uses moving average of squared gradients
  • Fixes AdaGrad's monotonic decay
  • Good for non-stationary problems

Adam (Adaptive Moment Estimation):

m = β₁ × m + (1-β₁) × ∇J(θ)          [First moment estimate]
v = β₂ × v + (1-β₂) × (∇J(θ))²       [Second moment estimate]
m̂ = m / (1 - β₁^t)                    [Bias correction]
v̂ = v / (1 - β₂^t)
θ = θ - α × m̂ / (√v̂ + ε)
  • Combines momentum and adaptive learning rates
  • Default: β₁=0.9, β₂=0.999, α=0.001
  • Works well across many problems
  • Most popular optimizer in practice

Comparison:

Method Convergence Stability Memory Best For
SGD Slow Moderate Low Simple problems
Momentum Fast Good Low General purpose
AdaGrad Medium Good Medium Sparse data
Adam Fast Excellent Medium Most problems

Q29: What is Regularization?

Answer:

Problem: Overfitting - Model learns training data too well

Solution: Add penalty term to loss function

Loss = Data Loss + Regularization Term

Types:

L1 Regularization (Lasso):

Loss = MSE + λ × Σ|θ|
  • λ: Regularization strength (hyperparameter)
  • Penalty proportional to absolute parameter values
  • Can drive weights to exactly zero (feature selection)
  • Sparse solutions

L2 Regularization (Ridge):

Loss = MSE + λ × Σ(θ²)
  • Penalty proportional to squared parameter values
  • Shrinks weights proportionally (never exactly zero)
  • Smoother solutions
  • Handles multicollinearity better

L1 vs L2:

Aspect L1 L2
Penalty Absolute Squared
Sparsity Yes (zero weights) No
Multicollinearity No Yes
Feature Selection Yes No
Solution Corner (sparse) Ridge (smooth)

Elastic Net:

Loss = MSE + λ₁ × Σ|θ| + λ₂ × Σ(θ²)
  • Combination of L1 and L2
  • Best of both worlds
  • Hyperparameters: λ₁, λ₂ (or α, l1_ratio)

Other Regularization Techniques:

Dropout (Neural Networks):

  • Randomly set activations to 0 during training
  • Probability p = 0.5 (typical)
  • Forces network to learn redundant representations
  • Acts as ensemble of thinned networks
  • Turn off during inference

Early Stopping:

  • Monitor validation loss
  • Stop training when validation loss stops improving
  • Prevents overfitting without explicit penalty term
  • Simple and effective

Data Augmentation:

  • Generate synthetic training data
  • Image: rotation, flip, crop, zoom
  • Text: paraphrase, back-translation
  • Increases effective training set size

Hyperparameter λ (regularization strength):

  • λ = 0: No regularization (overfitting)
  • λ = large: Heavy regularization (underfitting)
  • Tune via cross-validation
  • Grid search or random search

Practical Interview Questions

Q30: How would you approach building a recommendation system?

Answer:

Types of Recommendation Systems:

1. Content-Based Filtering: Process:

  1. Extract features of items (movies: genre, director, actors)
  2. Create user preference profile (weighted feature vectors)
  3. Recommend items similar to user's past interactions
  4. Similarity metric: Cosine similarity

Advantages:

  • No cold-start problem for items (use content)
  • Interpretable recommendations
  • No need for user-user interactions

Disadvantages:

  • Cold-start for new users (no history)
  • Limited discovery (similar items only)
  • Feature extraction required

2. Collaborative Filtering:

User-Based:

  • Find similar users (based on rating history)
  • Recommend items liked by similar users
  • Formula: rating = weighted average of similar users' ratings

Item-Based:

  • Find similar items (based on user ratings)
  • Recommend items similar to user's liked items
  • Often more stable than user-based

Advantages:

  • Captures complex relationships
  • Enables discovery (collaborative signal)
  • No content features needed

Disadvantages:

  • Cold-start: New users/items have no history
  • Sparsity: Rating matrix is sparse
  • Scalability: O(users × items) for similarity

3. Matrix Factorization:

  • Decompose user-item rating matrix into lower-rank matrices
  • User matrix: U (n_users × k factors)
  • Item matrix: V (n_items × k factors)
  • Predicted rating: R̂ = U × V^T
  • Learn U, V by minimizing MSE + regularization

Advantages:

  • Handles sparsity better
  • Computational efficient
  • Latent factors capture complex patterns

Disadvantages:

  • Cold-start still problematic
  • Less interpretable

4. Deep Learning Approaches:

  • Neural Collaborative Filtering: Learn user/item embeddings
  • RNNs: Capture sequential patterns (next item prediction)
  • Transformers: Multi-head attention for recommendations
  • Advantages: Complex non-linear relationships

Combining Approaches (Hybrid):

  • Content + Collaborative: Use both information sources
  • Matrix factorization + side information (user features, item features)
  • Multiple models with weighted ensemble

Challenges & Solutions:

Challenge Solution
Cold-start (new user) Use content features, user demographics
Cold-start (new item) Use content features, item metadata
Sparsity Matrix factorization, deep learning
Diversity Re-rank recommendations, diversity loss
Popularity bias Down-weight popular items, calibration
Exploitation vs Exploration Bandit algorithms, Thompson sampling

Evaluation Metrics:

  • RMSE/MAE: Rating prediction accuracy
  • Precision@K: Fraction of top-K recommendations relevant
  • Recall@K: Fraction of relevant items in top-K
  • NDCG: Discounted cumulative gain (ranking quality)
  • Coverage: Fraction of items ever recommended
  • Diversity: Dissimilarity among recommendations

Q31: Explain how you would detect and handle outliers

Answer:

Detection Methods:

1. Statistical Methods:

Z-Score:

z = (x - mean) / std_dev
  • Flag if |z| > 3 (99.7% of data in normal distribution)
  • Issue: Assumes normality, sensitive to outliers themselves

IQR (Interquartile Range):

Lower bound = Q1 - 1.5 × IQR
Upper bound = Q3 + 1.5 × IQR
  • Flag if value outside bounds
  • Robust to outliers
  • Common choice

2. Visualization:

  • Box plots: Visual representation of outliers
  • Histograms: Identify unusual values
  • Scatter plots: Detect multivariate outliers

3. Machine Learning Methods:

Isolation Forest:

  • Random Forest variant
  • Isolates anomalies (fewer splits needed)
  • Good for multivariate outliers
  • Fast and scalable

Local Outlier Factor (LOF):

  • Density-based method
  • Compares local density to neighbors
  • Detects local outliers
  • Parameter: n_neighbors

Mahalanobis Distance:

  • Accounts for feature correlations
  • D = √((x - μ)^T × Σ^(-1) × (x - μ))
  • Better for multivariate data

4. Domain Knowledge:

  • Validate with subject matter expert
  • Business logic (impossible values)
  • Context matters (e.g., outlier in test, normal in production)

Handling Outliers:

1. Deletion:

  • Remove outlier samples
  • Cons: Lose information, reduce sample size
  • Use when: Few outliers, not critical information

2. Transformation:

  • Log, square root, Box-Cox transformations
  • Makes distribution less skewed
  • Reduces impact without removing data

3. Capping (Winsorization):

  • Cap at percentile (e.g., 99th, 1st)
  • Extreme value → nearest percentile value
  • Preserves sample size and variance

4. Robust Methods:

  • Algorithms resistant to outliers
  • Robust scaling (median, IQR)
  • Robust regression
  • Median instead of mean

5. Separate Model:

  • Different model for outliers
  • Identify outliers, train separate predictor
  • Use ensemble to combine predictions

Guidelines:

  • Investigate cause: Error, fraud, or real phenomenon?
  • Preserve for analysis, remove for modeling
  • Use robust methods if uncertain
  • Document decisions for reproducibility

Q32: Walk me through your Machine Learning project

Answer:

Project Example: Predicting house prices

1. Problem Definition:

  • Objective: Predict house prices given features
  • Type: Regression
  • Success metric: RMSE, R² on test set
  • Business value: Price estimation tool

2. Data Collection & Exploration:

  • Dataset: Boston Housing (506 samples, 13 features)
  • Features: Square footage, location, crime rate, etc.
  • Target: Price
  • EDA:
    • Missing values: Check distribution
    • Correlation: Feature-target relationships
    • Distributions: Histograms (identify skewness)
    • Outliers: Box plots
    • Sample size: Adequate?

3. Data Preprocessing:

  • Missing values: Imputation (median for continuous)
  • Outliers: Investigated → keep (real estate variation)
  • Feature scaling: StandardScaler (for linear models)
  • Encoding: One-hot for categorical features

4. Feature Engineering:

  • Create new features:
    • rooms_per_person = rooms / population
    • price_per_sqft = price / sqft
  • Feature selection: Correlation, VIF, permutation importance
  • Remove: Low variance, highly correlated features

5. Train/Test Split:

  • 80-20 split (or time-based for temporal data)
  • Stratified for imbalanced regression

6. Model Selection:

  • Baseline: Mean prediction (R² = 0)
  • Simple: Linear Regression
  • Complex: Random Forest, Gradient Boosting
  • Justification: Comparison and complexity-performance tradeoff

7. Model Training:

  • Hyperparameter tuning: GridSearchCV, RandomizedSearchCV
  • Cross-validation: 5-fold to estimate generalization
  • Parameter examples:
    • Random Forest: n_estimators, max_depth
    • Gradient Boosting: learning_rate, n_estimators

8. Model Evaluation:

  • Metrics: RMSE, MAE, R² on test set
  • Residual analysis:
    • Errors normally distributed?
    • Errors independent of predictions?
    • Heteroscedasticity present?
  • Learning curves: Detect bias/variance
  • Error analysis: Large errors on which samples?

9. Results Interpretation:

  • Feature importance: Which features matter?
  • SHAP values: Individual prediction explanations
  • Prediction intervals: Uncertainty quantification

10. Model Deployment:

  • Serialization: Save model (pickle, joblib)
  • API: Flask/FastAPI for predictions
  • Monitoring: Track performance in production
  • Retraining: Schedule periodic updates

11. Documentation:

  • Code: Clean, commented, modular
  • Report: Methods, results, insights
  • Reproducibility: Random seeds, package versions

Common Challenges & Solutions:

Issue Solution
High RMSE Check data quality, engineer features, try complex model
Train/test gap Regularization, more training data, cross-validation
Predictions biased Feature engineering, different model
Slow predictions Feature selection, model simplification, caching
Unstable model Ensemble methods, regularization, more data

Q33: How would you handle imbalanced data?

Answer:

Problem: Class distribution skewed (e.g., 95% negative, 5% positive)

  • Accuracy misleading
  • Minority class underrepresented
  • Model biased toward majority

Evaluation Metrics (First!):

  • Don't use: Accuracy
  • Use: Precision, Recall, F1, ROC-AUC
  • Confusion matrix: Identify false negatives

Solutions:

1. Data Level Approaches:

Oversampling (Increase minority):

Original: 95% negative, 5% positive
After: 50% negative, 50% positive
  • Random Oversampling: Duplicate minority samples

    • Issue: Overfitting (duplicate exact samples)
  • SMOTE (Synthetic Minority Oversampling):

    • Generate synthetic samples between nearest minority neighbors
    • Interpolate: x_new = x_i + rand(0,1) × (x_neighbor - x_i)
    • More generalizable than duplication
    • Most popular approach
  • Variants: ADASYN (adaptive threshold), Borderline-SMOTE

Undersampling (Decrease majority):

Original: 95% negative, 5% positive
After: 50% negative, 50% positive
  • Random Undersampling: Remove majority samples

    • Issue: Lose information
  • Stratified: Keep representative subset

  • NearMiss: Remove majority samples far from minority

Hybrid: Combine over + undersampling (SMOTE + ENN)

2. Algorithm Level Approaches:

Class Weights:

  • Penalize misclassification of minority class more
  • Loss = Σ(class_weight_i × loss_i)
  • Parameters: balanced (auto), custom weights
  • Works with most algorithms (SVM, Logistic Regression, Neural Networks)

Threshold Adjustment:

  • Default threshold: 0.5
  • Lower threshold (e.g., 0.3) → More positive predictions
  • Adjust based on precision-recall tradeoff
  • Use when: Cost of FN > cost of FP

Ensemble Methods:

  • Random Forest: Can handle imbalance naturally
  • XGBoost: scale_pos_weight parameter
  • Balanced Bagging: Bootstrap with class balancing

3. Hybrid Approaches:

  • Combine sampling with ensemble
  • Pipeline: SMOTE → Train Random Forest
  • Example: Easy Ensemble

4. Cost-Sensitive Learning:

  • Assign higher misclassification cost to minority
  • Total Cost = FN_cost × FN + FP_cost × FP
  • Optimize different cost matrix

Best Practices:

  1. SMOTE BEFORE train-test split:

    X_train, X_test, y_train, y_test = train_test_split(X, y)
    smote = SMOTE()
    X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
    # Then train model on balanced training data
    • Avoid data leakage
    • Test set remains imbalanced (real distribution)
  2. Use appropriate metrics:

    • Precision/Recall/F1 over Accuracy
    • Confusion matrix analysis
    • Cost-based metrics if available
  3. Experiment with combinations:

    • SMOTE + Ensemble often works best
    • Try different ratios (oversample to what level?)
    • Cross-validation to avoid overfitting
  4. Domain context:

    • Cost of false positives vs false negatives
    • Business requirements (tolerable error rates)

Q34: Explain Hyperparameter Tuning

Answer:

Hyperparameters: Settings configured before training (not learned from data)

Examples:

  • Learning rate (α)
  • Number of layers, units
  • Tree depth, min samples split
  • Regularization strength (λ)
  • Batch size
  • Number of epochs
  • Activation function

Tuning Methods:

1. Grid Search:

param_grid = {
    'learning_rate': [0.001, 0.01, 0.1],
    'max_depth': [3, 5, 7],
    'n_estimators': [50, 100, 200]
}
# Test all 3×3×3 = 27 combinations
  • Exhaustively search defined grid
  • Computationally expensive: O(grid_size × CV_folds)
  • Best for small number of hyperparameters
  • Guaranteed to find best in grid

2. Random Search:

param_dist = {
    'learning_rate': uniform(0.001, 0.1),
    'max_depth': randint(3, 10),
    'n_estimators': randint(50, 300)
}
# Sample N random combinations
  • Sample random combinations
  • Faster than grid search: O(N × CV_folds)
  • Good for high-dimensional spaces
  • May miss optimal

3. Bayesian Optimization:

  • Model performance as probability distribution
  • Use past evaluations to guide next search
  • Expected Improvement acquisition function
  • Smart sampling (explore promising regions)
  • Tools: Optuna, Hyperopt, Ray Tune
  • More efficient than grid/random search

4. Evolutionary Algorithms:

  • Genetic algorithm approach
  • Population of hyperparameters
  • Selection, crossover, mutation
  • Slower but can find good solutions

Workflow:

  1. Identify important hyperparameters:

    • Focus on sensitive parameters first
    • Less important: Learning rate vs activation
  2. Understand parameter effects:

    • Increasing max_depth: Overfit risk ↑
    • Increasing λ: Underfitting risk ↑
    • Decreasing learning_rate: Slower convergence
  3. Define search space:

    • Grid: When you know approximate range
    • Random/Bayesian: Larger continuous spaces
  4. Choose CV strategy:

    • K-fold (typically 5)
    • Stratified for classification
    • Time-series for temporal data
  5. Optimize metric:

    • Regression: RMSE or R²
    • Classification: F1, ROC-AUC (not Accuracy if imbalanced)
  6. Monitor overfitting:

    • Validation curve: Training vs validation score
    • If validation plateaus, stop
    • Learning curve: Score vs dataset size

Example: Learning Curves

X-axis: Training set size
Y-axis: Model score

High Bias (Underfitting):
Training error: High
Validation error: High (gap small)
Solution: More complex model

High Variance (Overfitting):
Training error: Low
Validation error: High (gap large)
Solution: More data or regularization

Tips:

  1. Coarse-to-fine: Start broad, narrow down
  2. Parallelization: Use multiple processors
  3. Early stopping: Stop unpromising combinations
  4. Combine with feature engineering:
    • Features matter more than hyperparameters
  5. Avoid multiple comparisons: Use hold-out test set
  6. Document results: For reproducibility and understanding

Q35: How would you explain model predictions to stakeholders?

Answer:

Importance:

  • Build trust in model
  • Identify biases
  • Regulatory compliance (GDPR, HIPAA)
  • Debug model failures
  • Stakeholder buy-in

Methods:

1. Feature Importance:

  • Which features influence predictions most?
  • Methods:
    • Tree models: Information gain
    • Permutation: Decrease in performance when feature shuffled
    • Coefficient magnitude: Linear models

Example:

Top 3 important features for house price prediction:
1. Square footage (importance: 0.45)
2. Location (importance: 0.35)
3. Age (importance: 0.20)

Pros: Simple, global view Cons: Only global importance, not individual predictions

2. SHAP (SHapley Additive exPlanations):

  • Game-theoretic approach
  • Contribution of each feature to prediction
  • Property: Sum of contributions = prediction

SHAP Value Interpretation:

  • Positive: Pushes prediction up
  • Negative: Pushes prediction down
  • Magnitude: How much it contributes

Types:

  • SHAP Summary Plot: Feature importance + direction
  • SHAP Force Plot: Individual prediction breakdown
  • SHAP Dependence Plot: Feature relationship with prediction

Pros: Theoretically sound, individual explanations, global insights Cons: Computationally expensive for large datasets

3. LIME (Local Interpretable Model-Agnostic Explanations):

  • Explain individual prediction
  • Fit simple model (linear) locally around prediction
  • Coefficients show feature contributions

Example:

This email is classified as SPAM because:
- Contains "free" (weight: +0.3)
- Unknown sender (weight: +0.25)
- Multiple links (weight: +0.2)

Pros: Model-agnostic, easy to understand Cons: Local approximation, less theoretically rigorous

4. Decision Rules:

  • Extract interpretable rules from complex models
  • Example:
    IF age > 35 AND income > 50k AND credit_score > 700 THEN approve_loan
    
  • Pros: Human-interpretable
  • Cons: Loss of accuracy vs original model

5. Partial Dependence Plots:

  • Show average prediction as feature varies
  • X-axis: Feature value
  • Y-axis: Prediction
  • Reveals non-linear relationships

6. Counterfactual Explanations:

  • "What if feature X was different?"
  • Example:
    Current: Loan DENIED
    If income was $60k instead of $40k → Loan APPROVED
    
  • Pros: Actionable insights
  • Cons: May be unrealistic

7. Confidence/Uncertainty:

  • Model often outputs single prediction
  • Stakeholders need confidence level
  • Methods:
    • Prediction intervals: [lower, upper] bounds
    • Calibration: Probability ≈ actual frequency
    • Ensemble confidence: Variance across models

Communication Strategy:

1. Know Your Audience:

  • Technical: Detailed metrics, visualizations
  • Non-technical: Simple analogies, business impact

2. Tailor Explanation:

  • High-stakes decisions: More detailed explanation
  • Real-time systems: Faster methods acceptable
  • Regulatory: Complete audit trail

3. Visualization:

  • Feature importance bar chart
  • SHAP plots for individual predictions
  • Confusion matrix for classification
  • Calibration curves

4. Validation:

  • Cross-validation metrics
  • Error on known cases
  • Edge case analysis
  • Robustness checks

5. Document Limitations:

  • Model assumptions
  • Known biases
  • Data limitations
  • Conditions for use

Example Explanation Narrative:

Our recommendation system predicts which customers will churn with 87% accuracy.

For customer #12345:
- Predicted to churn with probability 0.75
- Key reasons:
  1. No purchase in last 3 months (contribution: +0.35)
  2. Older customer segment (contribution: +0.25)
  3. Below average satisfaction (contribution: +0.15)

Confidence: Medium (similar customers have 73% actual churn rate)

Recommendation: Proactive outreach with discount offer

Summary Table: When to Use Different Techniques

Task Best Approach
Classification Logistic Regression (baseline), SVM, Random Forest, Gradient Boosting
Regression Linear Regression (baseline), Ridge/Lasso, Random Forest, Gradient Boosting
Clustering K-means (simple), DBSCAN (arbitrary shape), Hierarchical (dendrograms)
Dimensionality Reduction PCA (linear), t-SNE (visualization), UMAP (preservation)
Text Classification BERT, Logistic Regression with TF-IDF
Sequence Modeling LSTM/GRU for short-range, Transformer for long-range
Image Classification CNN (ResNet, EfficientNet)
Object Detection YOLO (speed), Faster R-CNN (accuracy)
Imbalanced Data SMOTE + Ensemble, cost-weighted learning
Outlier Detection IQR, Isolation Forest, LOF
Recommendation Collaborative Filtering, Content-based, Matrix Factorization

Key Takeaways

  1. Always start with simple baselines before complex models
  2. Data quality matters more than model complexity
  3. Use appropriate evaluation metrics for your problem
  4. Cross-validation for reliable estimates of generalization
  5. Feature engineering is crucial for model performance
  6. Interpretability important for stakeholder trust
  7. Hyperparameter tuning requires systematic approach
  8. Imbalanced data needs special handling
  9. Monitor train-test gap to detect overfitting
  10. Document assumptions and limitations clearly

Recommended Reading & Resources

  • Papers:

    • "Attention Is All You Need" (Transformers)
    • "Deep Residual Learning for Image Recognition" (ResNet)
    • "BERT: Pre-training of Deep Bidirectional Transformers"
  • Books:

    • "Hands-On Machine Learning" by Aurélien Géron
    • "Deep Learning" by Goodfellow, Bengio, Courville
    • "The Hundred-Page Machine Learning Book" by Andriy Burkov
  • Platforms:

    • Coursera: Andrew Ng's ML course, Deep Learning specialization
    • Fast.ai: Practical Deep Learning
    • Kaggle: Competitions and datasets

Good luck with your AI interviews! Remember to practice, stay updated with recent developments, and understand the fundamentals deeply.

https://github.com/ktwillcode/AI-Agents-Interview

https://github.com/Devinterview-io/llms-interview-questions

https://github.com/KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub/blob/main/Interview_QA/QA_1-3.md https://skphd.medium.com/interview-questions-and-answers-on-agentic-ai-9069ed148f3d

About

AI topics with interview questions and answers related to ML Basics, GenAI, Agentic AI, and Prompt Engineering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors