You can also find all 100 answers here π Devinterview.io - SQL
SQL is the standardized domain-specific, declarative language for managing and querying Relational Database Management Systems (RDBMS). Beyond legacy relational models, modern SQL dialects (e.g., PostgreSQL 18, DuckDB 1.2) now incorporate support for semi-structured data (JSONB) and vector embeddings for RAG (Retrieval-Augmented Generation) workflows.
- DDL (Data Definition Language): Defines schema objects (e.g.,
CREATE,ALTER,DROP). - DML (Data Manipulation Language): Manages row-level data (e.g.,
INSERT,UPDATE,DELETE,MERGE). - DCL (Data Control Language): Manages authorization and security (e.g.,
GRANT,REVOKE). - TCL (Transaction Control Language): Ensures state consistency (e.g.,
COMMIT,ROLLBACK,SAVEPOINT). - DQL (Data Query Language): The subset focused on retrieval (e.g.,
SELECT).
- Data Retrieval: Complex aggregation and window functions for analytics.
- ACID Compliance: Ensures Atomicity, Consistency, Isolation, and Durability in high-concurrency environments.
-
Vector Search: 2026 standard databases (e.g.,
pgvector) utilize SQL extensions to perform$k$ -nearest neighbor ($k$ -NN) searches on high-dimensional vectors, enabling AI-driven semantic similarity queries. -
Normalization & Optimization: Usage of B-Tree, LSM-Tree, and BRIN indices to achieve query complexity of
$O(\log n)$ for point lookups. - Distributed Scaling: Leveraging Sharding (horizontal partitioning) and Replication (read-replicas) to manage multi-terabyte datasets.
- SELECT: Primary data retrieval; supports Common Table Expressions (CTEs) and window functions (
OVER,PARTITION BY). - MERGE: An upsert operation (INSERT or UPDATE) that minimizes round-trips to the server.
- LATERAL JOIN: Allows subqueries to reference preceding tables in the
FROMclause, critical for complex analytical transformations. - MATERIALIZED VIEW: Caches the result of a complex query; refreshes asynchronously for high-performance reporting.
The following example demonstrates modern syntax, including standard CTEs and relational constraints.
-- Schema Definition
CREATE TABLE Department (
dept_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
dept_name VARCHAR(100) NOT NULL
);
CREATE TABLE Employee (
emp_id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
emp_name VARCHAR(255) NOT NULL,
dept_id INT REFERENCES Department(dept_id)
);
-- Recursive/Complex Query using CTE
WITH DepartmentStats AS (
SELECT dept_id, COUNT(*) as emp_count
FROM Employee
GROUP BY dept_id
)
SELECT
e.emp_name,
d.dept_name,
ds.emp_count
FROM Employee e
JOIN Department d ON e.dept_id = d.dept_id
JOIN DepartmentStats ds ON d.dept_id = ds.dept_id
WHERE ds.emp_count > 0;-
Index Scan vs. Seq Scan: Query planners choose indices based on cardinality and selectivity to avoid
$O(n)$ full table scans. -
Execution Plans: Use
EXPLAIN ANALYZEto inspect the query execution tree and identify high-cost nodes. - Concurrency: In 2026 architectures, MVCC (Multi-Version Concurrency Control) is standard to ensure that readers do not block writers, maximizing throughput.
The dichotomy between Relational Database Management Systems (RDBMS/SQL) and Non-Relational (NoSQL) systems has matured into a hybrid landscape. While fundamental constraints remain, modern distributed systems have blurred the lines via multi-model support.
- SQL (Relational): Built on the Relational Model (Codd, 1970). Data is organized in tables with rigid schemas. Optimized for ACID compliance and complex relational algebra.
- NoSQL (Non-Relational): Optimized for horizontal scaling, high throughput, and varied data structures. Architecture typically follows the CAP Theorem (Consistency, Availability, Partition Tolerance), often favoring AP (Availability/Partition Tolerance) and Eventual Consistency.
- Hybridization: Modern RDBMS (e.g., PostgreSQL 18+) now natively support JSONB, enabling semi-structured data handling within ACID-compliant environments. Conversely, many NoSQL stores (e.g., MongoDB 8.0+) now support multi-document ACID transactions.
- Scalability: SQL is no longer strictly "vertical." Distributed SQL engines (e.g., CockroachDB, TiDB, YugabyteDB) bring horizontal sharding and global replication to the SQL standard, rendering the "SQL=Vertical/NoSQL=Horizontal" heuristic increasingly legacy.
- Mechanism: Stores data as BSON/JSON binary documents.
- 2026 Context: Integration of vector search capabilities (e.g., MongoDB Atlas Vector Search) for RAG (Retrieval-Augmented Generation) workflows.
- Mechanism: Partitioned hash tables.
- Use Case: Sub-millisecond latency for session management, caching, and feature stores in ML pipelines.
-
Performance:
$O(1)$ lookup complexity.
- Mechanism: Sparse, multidimensional sorted maps.
- Architecture: Optimized for write-heavy workloads and time-series data at exabyte scale.
- Mechanism: Index-free adjacency (nodes and edges).
- 2026 Context: Critical for Knowledge Graph construction and complex relationship inference in LLM orchestration.
| Feature | SQL (Modernized) | NoSQL |
|---|---|---|
| Schema | Schema-on-write | Schema-on-read |
| Integrity | Enforcement via constraints/FKs | Application-layer validation |
| Joins | Efficient via relational algebra | Often discouraged/denormalized |
| Scaling | Vertical/Distributed SQL (Horizontal) | Native Horizontal (Sharding) |
| Querying | Declarative (SQL Standard) | Proprietary API / GraphQL / SQL-like |
-
Normalization vs. Denormalization: In 2026, storage is cheap, but compute and network latency are not. SQL still defaults to 3NF (Third Normal Form) to eliminate update anomalies. NoSQL favors denormalization to ensure that a single request can retrieve all required data, avoiding expensive cross-node joins (which carry
$O(log N)$ to$O(N)$ overhead depending on indexing). - ACID Compliance: The traditional view that NoSQL sacrifices ACID is outdated. Most enterprise-grade NoSQL solutions now provide tunable consistency, allowing developers to opt-in to strong consistency at the cost of latency.
- SQL: Employs Two-Phase Commit (2PC) or Paxos/Raft protocols in distributed versions to guarantee serializability.
- NoSQL: Leverages Vector Clocks or Last-Write-Wins (LWW) conflict resolution in AP systems. Engineers must explicitly manage idempotency at the application level when eventual consistency is utilized.
Select SQL when data integrity, complex transactional reporting, and strict schema enforcement are paramount. Select NoSQL for high-velocity streaming data, sparse datasets (e.g., IoT logs), or when the schema must evolve iteratively without database-level downtime. Consider Distributed SQL if the project requires the familiarity of SQL with the elastic scalability of NoSQL.
Modern SQL classification includes a fifth critical category: Transaction Control Language (TCL), which handles atomic state transitions. DQL is now standardly recognized as a subset of DML in formal SQL:2023/2025 specifications, though categorized separately for pedagogical clarity.
Focused on data retrieval. Modern analytical engines prioritize set-based operations and window functions.
- SELECT: Projects data from relations.
- FROM: Identifies data sources (Tables, CTEs, Subqueries).
- WHERE: Boolean filtering predicates.
- GROUP BY: Aggregation grouping.
- HAVING: Post-aggregation filtering.
-
WINDOW FUNCTIONS: (e.g.,
OVER(),PARTITION BY) perform calculations across related rows. -
JOIN: Relational algebra combining datasets (
$Inner, Left, Right, Full, Cross$ ). - CTE (WITH): Recursive or non-recursive temporary result sets for improved query readability and modularity.
Manages schema evolution and metadata persistence.
- CREATE: Defines schemas, tables, indexes, or stored procedures.
- ALTER: Modifies structural definitions (e.g.,
ADD COLUMN,RENAME TO,SET DATA TYPE). - DROP: Removes schema objects.
- TRUNCATE: Resets table state (DML-like impact, DDL-metadata overhead).
- COMMENT: Annotates objects for documentation as code (IaC) compliance.
Handles row-level operations. 2026 standards emphasize "Upsert" patterns and atomic modifications.
-
INSERT: Populates rows. Includes
ON CONFLICT(PostgreSQL) orMERGE(SQL Standard) for idempotent data ingestion. - UPDATE: Modifies existing row state.
- DELETE: Removes rows.
-
MERGE: Synchronizes source and target tables; the preferred 2026 standard for ETL/ELT pipelines to maintain
$O(n)$ complexity during synchronization.
Governs the state of transactions to ensure ACID properties.
- COMMIT: Persists all changes within a transaction to disk.
- ROLLBACK: Reverts state to the last checkpoint upon failure.
-
SAVEPOINT: Defines checkpoints within a transaction, enabling partial rollback (
$O(1)$ state revert).
Governs security and governance via RBAC (Role-Based Access Control).
- GRANT: Assigns specific privileges (
SELECT,INSERT,EXECUTE) to users/roles. - REVOKE: Removes previously assigned privileges.
- DENY: Explicitly restricts access, overriding implicit permissions (provider-specific, e.g., T-SQL).
The SELECT statement is the primary Data Query Language (DQL) construct in SQL, designed for retrieving, projecting, and transforming data from relational structures. In modern architectures, it serves as the interface between storage engines and application layers, supporting complex analytical processing and vector-relational hybrid queries.
The SELECT statement is evaluated by the database engine in a specific logical sequence, distinct from the written syntax order:
- FROM / JOIN: Determines the source data set.
- WHERE: Applies predicate filtering.
- GROUP BY: Partitions the data into summary sets.
- HAVING: Filters aggregated result sets.
- SELECT: Evaluates expressions and projects columns.
- DISTINCT: Removes duplicate rows.
- ORDER BY: Defines the final presentation sequence.
- LIMIT / OFFSET: Constrains the final result set volume.
Modern SQL engines (e.g., PostgreSQL 18, DuckDB 1.2) extend traditional SELECT capabilities to support JSONB traversal, Vector embeddings (via extensions like pgvector), and Window Functions.
-
Vector Proximity Search: Modern SELECT statements now frequently involve distance functions (e.g.,
<=>for Cosine distance) to query high-dimensional embeddings:SELECT content FROM docs ORDER BY embedding <=> '[...]' LIMIT 5; -
CTEs (Common Table Expressions): The
WITHclause is now standard for improving query readability and recursion over subqueries. -
Window Functions: Replacing many traditional self-joins with
OVER (PARTITION BY ...)for performant ranking and running totals at$O(n \log n)$ complexity.
- Relational Integrity Exploration: Validating constraints across normalized schemas.
- Complex Transformation: Utilizing
CASEstatements orCOALESCEfor business logic injection. - Analytical Processing (OLAP): Leveraging window functions to perform time-series analysis without destructive row grouping.
- Integration: Providing structured outputs (often
JSON_AGG) directly for REST or GraphQL API consumers.
Using modern syntax, including standard JOIN formatting and a CTE for improved readability:
WITH OrderSummary AS (
SELECT
o.OrderID,
o.OrderDate,
c.CustomerName,
e.LastName AS SalesRep,
SUM(od.UnitPrice * od.Quantity) AS TotalValue
FROM Orders o
JOIN Customers c ON o.CustomerID = c.CustomerID
JOIN Employees e ON o.EmployeeID = e.EmployeeID
JOIN OrderDetails od ON o.OrderID = od.OrderID
GROUP BY 1, 2, 3, 4
)
SELECT *
FROM OrderSummary
WHERE TotalValue > 1000
ORDER BY OrderDate DESC
LIMIT 100;The execution cost of a SELECT query typically scales EXPLAIN ANALYZE to ensure that join strategies (Nested Loop vs. Hash Join) are optimized for the data distribution.
The fundamental difference lies in the logical order of operations within the SQL execution pipeline. While both clauses filter rows, they function at distinct stages of the query lifecycle.
The WHERE clause acts as a pre-aggregation filter. It evaluates conditions against individual rows before any GROUP BY operation occurs.
-
Execution Order: Applied immediately after the
FROM/JOINphase. -
Performance: Highly efficient when used with SARGable (Search ARGumentable) expressions. Indexing columns referenced in a
WHEREclause allows for$O(\log n)$ lookup performance via B-tree index traversal. -
Restrictions: Cannot reference aggregate functions (e.g.,
SUM(),AVG()) because the aggregation state has not yet been computed.
The HAVING clause serves as a post-aggregation filter. It evaluates conditions against the resulting sets generated by a GROUP BY clause.
-
Execution Order: Applied after the
GROUP BYphase and before theSELECTprojection. -
Performance: Higher computational cost as it operates on the filtered set after data has been aggregated into buckets. It generally forces a full scan or a sort of the intermediate result set, typically
$O(n \log n)$ complexity. -
Capabilities: Designed to filter based on aggregate results (e.g.,
HAVING COUNT(*) > 100).
The logical processing order in modern RDBMS (PostgreSQL 17+, SQL Server 2026) follows this sequence:
FROM/JOINWHERE(Row Filtering)GROUP BY(Aggregation)HAVING(Group Filtering)SELECT(Projection)ORDER BY(Sorting)
Note: In modern analytical query optimization, if a HAVING clause contains no aggregates, most optimizers will rewrite it to a WHERE clause internally to benefit from indexing, though writing explicit WHERE clauses remains the standard for query readability and intent.
A JOIN clause in SQL is a relational operator used to combine rows from two or more tables based on a join predicate. Conceptually, JOINs implement the relational algebra Join operation. While foreign keys define referential integrity, joins operate on any expression satisfying the join predicate.
Given two tables
-
INNER JOIN: Returns only rows where the join predicate evaluates to
TRUE. -
OUTER JOIN: Returns all rows from the specified table(s) plus matching rows.
- LEFT JOIN: Preserves all rows from the left table.
- RIGHT JOIN: Preserves all rows from the right table.
-
FULL JOIN: Preserves all rows from both tables, filling non-matches with
NULL.
-
CROSS JOIN: Produces the Cartesian Product of two tables. Complexity:
$O(n \times m)$ . - SELF JOIN: A standard join where a table is joined with itself using table aliases.
The predicate evaluates equality or inequality. Use INNER keyword explicitly for semantic clarity.
SELECT T1.A, T1.B, T2.C
FROM Table1 AS T1
INNER JOIN Table2 AS T2 ON T1.B = T2.B;When using FULL JOIN, use COALESCE() or IFNULL() to normalize result sets if the join keys differ.
SELECT COALESCE(T1.B, T2.B) AS B, T1.A, T2.C
FROM Table1 AS T1
FULL JOIN Table2 AS T2 ON T1.B = T2.B;Self joins are frequently used in Recursive Common Table Expressions (CTEs) for recursive hierarchy traversals (e.g., organizational charts).
Modern Pattern (Recursive CTE):
WITH RECURSIVE Hierarchy AS (
SELECT EmpID, Name, ManagerID FROM Employee WHERE ManagerID IS NULL
UNION ALL
SELECT E.EmpID, E.Name, E.ManagerID
FROM Employee E
INNER JOIN Hierarchy H ON E.ManagerID = H.EmpID
)
SELECT * FROM Hierarchy;- Performance: Avoid
SELECT *in joins. Explicitly list columns to reduce I/O overhead and minimize the risk of schema drift. - Optimizer Hints: In distributed SQL environments (e.g., CockroachDB, AlloyDB), consider hash distribution hints to minimize data shuffling during large joins.
- Semantic Note: The original content's "Self Join" example incorrectly labeled the manager names. In practice, a self-join to retrieve human-readable manager names would be:
This creates a semantic link between the
SELECT E.Name AS Employee, M.Name AS Manager FROM Employee E LEFT JOIN Employee M ON E.ManagerID = M.EmpID;
ManagerIDcolumn of the primary alias and theEmpIDof the secondary alias.
A Primary Key is a constraint that uniquely identifies each tuple (record) within a relation (table). It serves as the immutable anchor for Entity Integrity.
-
Uniqueness: The database engine enforces a
UNIQUEconstraint, ensuring no two rows share the same key value. -
Non-Nullity: The column must be
NOT NULL. Absence of value is logically incompatible with identification. -
Immutability: Once assigned, the PK value should not change. Updates to PKs trigger cascading updates across dependent indexes and foreign key relationships, incurring significant
$O(n)$ overhead.
- Entity Identity: Enforces that the row represents a single, distinct real-world object.
- Relational Anchoring: Acts as the target for Foreign Keys (FK), forming the backbone of normalized relational schema design.
- Clustered Indexing: In most RDBMS (PostgreSQL, SQL Server, MySQL/InnoDB), the PK is the default key for the Clustered Index, physically ordering data on disk.
- Indexing Density: Since the PK dictates physical storage in clustered indexes, using narrow data types (e.g.,
BIGINTorUUIDv7) minimizes index fragmentation and I/O latency. - Join Optimization: Highly performant joins rely on PK-FK alignment. Query optimizers use PK statistics to estimate cardinality, which is critical for choosing efficient join algorithms (e.g., Hash Join vs. Merge Join).
- UUIDs vs. Sequences: While integers are performant for small-scale systems, distributed systems in 2026 prefer UUIDv7 (time-ordered) to avoid B-Tree page splits and contention common with monotonic integer sequences.
-
Prefer Surrogate Keys: Avoid "Natural Keys" (e.g., Email, SSN). Business rules evolve;
student_emailmay change, breaking references. Use synthetic identifiers (e.g.,BIGINT IDENTITYorUUIDv7). -
UUIDv7 Adoption: Move away from random
UUIDv4toUUIDv7. The monotonic timestamp prefix ensures index-friendly insertion patterns, mitigating the$O(\log n)$ rebalancing costs of random inserts. - Minimize Width: Keep keys as small as possible to minimize memory usage in the Buffer Pool and index pages.
- Composite Keys: Use only if the relationship is inherently many-to-many or polymorphic. Ensure the leading column is the most frequently filtered attribute.
Using BIGINT with an identity sequence for high-performance single-node databases, or UUID for distributed environments.
-- Standard sequence-based PK (High Performance)
CREATE TABLE Students (
student_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
student_uuid UUID DEFAULT gen_random_uuid(), -- For distributed systems
grade_level INT NOT NULL,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);
-- Optimization: Adding a covering index for lookup on the UUID
CREATE UNIQUE INDEX idx_students_uuid ON Students(student_uuid);A Foreign Key (FK) is a database constraint that enforces referential integrity between two tables. It identifies a column (or set of columns) in a referencing table (the child) that links to the primary or unique key of a referenced table (the parent).
- Referential Integrity: Guarantees that every value in the FK column must either match an existing value in the parent table's primary key or be
NULL(unless defined asNOT NULL). - Logical Relationship Mapping: Establishes the structural foundation for relational algebra, enabling efficient
JOINoperations. - Declarative Referential Integrity (DRI): Offloads consistency management from the application layer to the database engine, reducing the risk of orphaned records.
- Lifecycle Management: Automates consistency via
ON DELETEandON UPDATEtriggers (e.g.,CASCADE,SET NULL,RESTRICT).
- Referential Consistency: The foreign key must match the Primary Key or a Unique Key of the parent table.
-
Nullability: FK columns allow
NULLby default. If the business logic mandates a mandatory relationship, the FK column must be constrained withNOT NULL. -
Performance Note: In modern RDBMS (e.g., PostgreSQL 18+, SQL Server 2026), foreign key columns are not automatically indexed. To prevent full table scans during
JOINoperations andON DELETE CASCADEchecks, explicit B-tree indexes are required on FK columns where query frequency is high.-
Complexity: Verification of an FK constraint during DML operations is
$O(\log n)$ per operation given an indexed parent key.
-
Complexity: Verification of an FK constraint during DML operations is
- Normalization: Essential for adhering to 3rd Normal Form (3NF), minimizing redundancy by splitting data into related tables.
- Enforcement of Business Rules: Use
ON DELETE RESTRICTto prevent the deletion of a parent record if child records exist, ensuring business-critical data is not accidentally purged. - Distributed Systems: In distributed SQL architectures (e.g., CockroachDB, YugabyteDB), minimize cross-node FK constraints where possible to reduce latency overhead during transactional commits.
-- Parent Table
CREATE TABLE departments (
department_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL
);
-- Child Table with explicit constraint naming for better error handling
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
department_id INT,
CONSTRAINT fk_department
FOREIGN KEY (department_id)
REFERENCES departments(department_id)
ON DELETE RESTRICT
ON UPDATE CASCADE
);
-- Optimization: Explicit index on the FK column to optimize JOIN performance
CREATE INDEX idx_employees_department_id ON employees(department_id);The original text incorrectly stated that the foreign key itself must be unique. Correction: The foreign key column in the child table does not need to be unique (as in a 'One-to-Many' relationship, many children reference one parent). It is the referenced key in the parent table that must be unique.
SQL Injection (SQLi) occurs when an application improperly neutralizes user-supplied input, allowing an attacker to interfere with query logic. Defense relies on the principle of separation of data from code.
Description: The industry-standard primary defense. The SQL query structure is sent to the database engine first, and user inputs are sent later as bound parameters. The database driver ensures inputs are treated strictly as data, never as executable code, reducing the risk to
- Python (SQLAlchemy 2.0 /
psycopg3) Example:
# Using modern Pythonic ORM practices
from sqlalchemy import text
stmt = text("SELECT * FROM users WHERE username = :u AND password = :p")
result = session.execute(stmt, {"u": username, "p": password})- Benefits:
- Eliminates the category of "In-band" SQLi.
- Improves performance via query execution plan caching.
Description: Stored procedures isolate data access logic. However, they only prevent SQLi if the procedure body itself uses parameterized logic. Constructing strings dynamically inside a procedure remains vulnerable.
- Implementation Note: In 2026, favor Application-Level ORMs (e.g., Prisma, SQLAlchemy, or Entity Framework) over Stored Procedures for business logic to maintain Type Safety and database portability.
Description: Validate input against a defined schema. In 2026, utilize Pydantic (Python) or Zod (TypeScript/Node.js) for runtime type checking and constraint enforcement.
- Example (Pydantic v2.10+):
from pydantic import BaseModel, StringConstraints
from typing import Annotated
class UserLogin(BaseModel):
# Enforce strict regex at the boundary
username: Annotated[str, StringConstraints(pattern=r'^[a-zA-Z0-9_]{3,20}$')]- Benefits:
- Provides "Defense in Depth."
- Catches malicious payloads before they hit the database layer.
Description: A critical 2026 infrastructure requirement. The database user account assigned to the web application should possess only the minimum permissions necessary.
- Requirements:
- The application account should never run as
db_ownerorsuperuser. - Use Role-Based Access Control (RBAC) to restrict access to sensitive tables (e.g.,
audit_logsorsys_users). - Drop permissions for
DROP,TRUNCATE, orGRANToperations.
- The application account should never run as
Deprecated/Caution: Manual Code Filtering (sanitization/escaping via regex) is now considered an anti-pattern. It is error-prone, difficult to maintain, and often bypassable via encoding tricks (e.g., Unicode normalization).
- 2026 Standard: Utilize modern Object-Relational Mappers (ORMs). High-quality ORMs default to parameterized queries and provide an abstraction layer that makes it difficult to accidentally write raw, insecure SQL. If custom raw queries are necessary, use strictly enforced parameter binding APIs only.
Normalization is a database design process that minimizes data redundancy and eliminates insertion, update, and deletion anomalies by organizing columns and tables according to functional dependencies.
- 1NF (Atomic): Every attribute contains only atomic values; no sets or arrays.
- 2NF (No Partial Dependency): Table must be in 1NF and every non-prime attribute must be fully functionally dependent on the primary key.
- 3NF (No Transitive Dependency): Table must be in 2NF and no non-prime attribute is dependent on another non-prime attribute.
-
BCNF (Boyce-Codd): A stricter version of 3NF where for every functional dependency
$X \to Y$ ,$X$ must be a superkey. - 4NF (No Multi-valued Dependencies): Table must be in BCNF and contain no non-trivial multi-valued dependencies.
Storing data in a single flat structure forces denormalization, leading to
| ID | Name | Invoice No. | Item No. | Description | Price |
|---|
Enforces atomicity and uniqueness. We identify the composite key (InvoiceNo, ItemNo).
- Customers:
(ID [PK], Name) - Invoices:
(InvoiceNo [PK], Customer_ID [FK], Date) - Invoice_Items:
(InvoiceNo [FK], ItemNo [PK], Description, Price)
If we stored ItemDescription in the Invoice_Items table, it would depend only on ItemNo, not the full PK (InvoiceNo, ItemNo). To reach 2NF, we move Description to a Products catalog table.
We ensure no non-prime attribute depends on another. If Invoice table included Customer_Address, that would be a transitive dependency (InvoiceNo -> Customer_ID -> Customer_Address). We move address data exclusively to the Customers table.
- OLTP vs. OLAP: While 3NF remains the gold standard for OLTP (Online Transaction Processing) to ensure ACID compliance, modern OLAP (Online Analytical Processing) systemsβsuch as those utilizing Snowflake or Databricksβoften utilize Star Schema or Data Vault modeling. These patterns intentionally introduce redundancy to optimize analytical query latency via reduced join depth.
- Performance: Normalization increases join complexity. In high-scale distributed systems, consider the cost of distributed joins ($O(\text{network latency})$) against the risk of anomalies.
Using standardized constraints to enforce integrity:
-- Core Entity: Products (Normalization prevents update anomalies)
CREATE TABLE Products (
ProductID INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
Description TEXT NOT NULL,
UnitPrice DECIMAL(12, 2) CHECK (UnitPrice >= 0)
);
-- Normalized Relationship Tables
CREATE TABLE Customers (
CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(255) NOT NULL
);
CREATE TABLE Invoices (
InvoiceNo INT PRIMARY KEY,
CustomerID INT NOT NULL REFERENCES Customers(CustomerID),
InvoiceDate TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE Invoice_Items (
InvoiceNo INT REFERENCES Invoices(InvoiceNo),
ProductID INT REFERENCES Products(ProductID),
Quantity INT CHECK (Quantity > 0),
PRIMARY KEY (InvoiceNo, ProductID)
);Audit Note: The implementation above utilizes TIMESTAMPTZ and GENERATED ALWAYS AS IDENTITY, aligning with modern RDBMS capabilities (PostgreSQL 17+, SQL Server 2025) for superior concurrency and type safety.
Denormalization is the intentional introduction of redundancy into a schema to optimize Read-Latency and reduce computational overhead for complex join operations. In 2026, this remains a critical architectural pattern, particularly when balancing the
-
Flattening Relationships:
- Consolidating tables to avoid
$N$ -way joins. In distributed architectures, this minimizes cross-node communication overhead (network I/O), which often exceeds local CPU costs.
- Consolidating tables to avoid
-
Pre-aggregation and Materialized Views:
- Moving from runtime
SUM()orCOUNT()calculations to precomputed columns or Materialized Views. Modern RDBMS (e.g., PostgreSQL 18+, SQL Server 2026) support automated concurrent refresh, mitigating traditional staleness issues.
- Moving from runtime
-
Redundant Attribute Replication:
- Storing frequently accessed dimension attributes (e.g.,
Customer_Region) directly in the fact table. This creates a "Star Schema" optimized for OLAP (Online Analytical Processing) workloads.
- Storing frequently accessed dimension attributes (e.g.,
- OLAP vs. OLTP:
- OLTP (Transactional): Maintain 3NF (Third Normal Form) to ensure ACID compliance and prevent update anomalies.
- OLAP (Analytical): Adopt Star/Snowflake Schemas. Denormalization here is standard practice to support BI tools and vector-search-enabled databases.
- High-Volume Distributed Systems:
- In systems utilizing sharded architectures, data is denormalized to ensure that queries satisfy the "Locality Principle"βretrieving all required fields from a single shard without cross-shard
JOINoperations.
- In systems utilizing sharded architectures, data is denormalized to ensure that queries satisfy the "Locality Principle"βretrieving all required fields from a single shard without cross-shard
- Vector Search Integration:
- Modern SQL engines now store embeddings alongside business metadata. Denormalizing metadata into the table containing vectors allows for combined semantic and relational filtering in a single scan.
-
Consistency Protocols:
- Denormalization shifts the burden of Referential Integrity from the database engine to the application layer. Implementation requires Idempotent Background Jobs or Change Data Capture (CDC) pipelines (e.g., Debezium) to ensure eventual consistency across redundant fields.
-
Write Amplification:
- Every update to a source record may now trigger multiple
UPDATEstatements across denormalized replicas. Engineers must calculate the Write Amplification Factor (WAF):$$WAF = \frac{\text{Total writes to storage}}{\text{App-level writes}}$$ If$WAF > \text{threshold}$ , normalization should be re-evaluated.
- Every update to a source record may now trigger multiple
-
Schema Evolution:
- Denormalized structures are notoriously fragile. Use Schema Registry services (e.g., Confluent or internal equivalents) to manage versions, as changing a replicated column requires a multi-table migration strategy.
Denormalization is a read-optimization strategy, not a design default. In 2026, favor Normalization for transactional integrity, and utilize Materialized Views or Read Replicas for denormalized query performance. Reserve physical denormalization only when profiling indicates that query execution plans are constrained by join depth rather than I/O throughput.
Indexes are auxiliary data structures that decouple physical data storage from logical retrieval patterns. By maintaining a pointer-based reference system, they reduce the search space from
- Logarithmic Complexity: Using balanced tree structures, the engine navigates a height-constrained path to locate specific row identifiers (RIDs) or clustered keys.
- Disk I/O Optimization: Indexes permit Index-Only Scans, allowing the database to satisfy a query directly from the index tree without fetching the full data page from secondary storage.
- Join Optimization: Indexes on foreign keys and join predicates allow the engine to employ Merge Joins or Hash Joins more effectively, preventing expensive nested-loop operations on large datasets.
- Covering Indexes: By including additional columns in an index leaf node (
INCLUDEclause), the engine avoids the "bookmark lookup" penalty, keeping the entire query lifecycle within the index structure.
-
B+Tree: The industry standard for range-based queries. In 2026, most engines utilize B+Trees where internal nodes store keys and leaf nodes store pointers or data, ensuring predictable
$O(\log n)$ performance. - LSM-Trees (Log-Structured Merge-Trees): Increasingly prevalent in distributed databases and NoSQL-to-SQL hybrid engines (e.g., RocksDB-based backends). Optimized for write-heavy workloads by buffering updates in memory.
-
In-Memory Hash Indexes: Optimized for
$O(1)$ point lookups in high-concurrency, volatile memory tables. - BRIN (Block Range Indexes): The 2026 standard for massive datasets (e.g., time-series data); they store the minimum/maximum values per page range, providing massive footprint reduction compared to B-Trees.
- GIN/GiST (Generalized Inverted/Search Trees): Standard for full-text search, JSONB documents, and multidimensional spatial data.
-
Write Amplification: Every
INSERT,UPDATE, orDELETErequires updating the index structure. An excessive index count increases the Write Penalty ($O(k \cdot \log n)$ where$k$ is the number of indexes). -
Storage Overhead: Indexes consume primary storage. Index Fragmentation can lead to "bloat," necessitating periodic
REINDEXorVACUUMoperations to maintain optimal fill factors. - Optimizer Limitations: The query optimizer may ignore an index if the Selectivity (the ratio of distinct values to total rows) is low, defaulting to a sequential scan if the estimated cost of random disk I/O exceeds sequential I/O.
- Selectivity Analysis: Only index columns with high cardinality. Use
EXPLAIN ANALYZEorEXPLAIN (FORMAT JSON)to audit cost-to-performance ratios. - Partial Indexing: Utilize Filtered/Partial Indexes (e.g.,
CREATE INDEX ... WHERE active = true) to significantly reduce index size and maintenance overhead. - Composite Index Ordering: Follow the Left-Prefix Rule. In a composite index
(a, b), the index is only usable for filters onaor(a, b), but notbalone. - Covering Indexes: Use
INCLUDEcolumns to store non-indexed data in leaf nodes, satisfying common queries entirely within the index. - Automated Statistics: Ensure the database's
ANALYZEdaemon (autovacuum/auto-stats) is configured to prevent the optimizer from making stale decisions based on outdated distribution statistics.
- Primary/Unique Keys: These enforce logical constraints and automatically create a backing index.
- Foreign Keys: These do not automatically create an index in many RDBMS; explicit manual indexing on foreign keys is a critical performance requirement for join-heavy schemas.
The GROUP BY clause in SQL is a fundamental relational operation used to partition a result set into summary rows based on the values of one or more columns. It acts as a mandatory precursor for vector-based aggregate functions.
- Data Aggregation: Transforms multiple input rows into a single output row per group.
- Dimensional Analysis: Enables multi-dimensional reporting by grouping across hierarchies (e.g.,
Region,Product). - Post-Aggregation Filtering: Facilitates group-level predicate logic via the
HAVINGclause, distinct from row-level filtering (WHERE).
Consider a Sales table with columns: Product, Region, and Amount.
The engine collapses non-aggregated columns into the grouping set.
-- Standard aggregation for regional metrics
SELECT Region, SUM(Amount) AS TotalSales
FROM Sales
GROUP BY Region;The WHERE clause filters data before aggregation, whereas the HAVING clause filters result sets after aggregation.
-- Filtering records > 100 before grouping, then excluding regions with low volume
SELECT Region, COUNT(*) AS HighValueTransactions
FROM Sales
WHERE Amount > 100
GROUP BY Region
HAVING COUNT(*) > 50;For relative contributions, 2026 industry standards favor Window Functions over subqueries for performance efficiency (
-- Optimized relative contribution using Analytic Window Functions
SELECT Region, Product,
SUM(Amount) / SUM(SUM(Amount)) OVER(PARTITION BY Region) AS RelativeContribution
FROM Sales
GROUP BY Region, Product;-
Indexing:
GROUP BYoperations typically trigger internal sort or hash-aggregate operations. Indexing the grouping columns significantly reduces the cost of theSortphase. - Hash vs. Stream Aggregation: Modern Query Optimizers (e.g., PostgreSQL 18+, SQL Server 2026) prefer Hash Aggregation when the group cardinality is high and memory allows, as it avoids expensive disk-based sorting.
-
Cardinality Estimation: Incorrect statistics on grouping columns can lead the optimizer to choose an inefficient scan method (e.g., choosing
Nested Loopinstead ofHash Aggregate). -
Complexity: The time complexity of a
GROUP BYoperation is generally$O(N \log N)$ due to the sorting required for grouping, or$O(N)$ if using hash-based aggregation, where$N$ is the number of rows processed.
- Strict Mode: Ensure
ONLY_FULL_GROUP_BYis enabled in your SQL dialect (e.g., MySQL/MariaDB) to prevent non-deterministic query results where non-aggregated, non-grouped columns are selected. - Partitioning: When aggregating petabyte-scale datasets, utilize partitioned tables to enable "partition-wise aggregation," allowing the engine to aggregate within local memory segments before merging.
A Subquery (or nested query) is an inner SELECT statement embedded within an outer query (e.g., SELECT, INSERT, UPDATE, or DELETE). In modern RDBMS architectures, subqueries are classified based on their interaction with the outer scope: Non-Correlated (independent, executed once) and Correlated (dependent on the outer query, executed once per row).
Returns exactly one row and one column.
- Use Case: Comparisons against atomic values.
- Optimization: Modern optimizers utilize memoization to avoid redundant execution if the subquery is non-correlated.
Returns one or more rows/columns.
- Use Case: Providing sets for
IN,EXISTS, orANY/ALLoperators. - Performance: In 2026, many query optimizers perform Subquery Unnesting (transforming subqueries into
JOINoperations) to improve execution plan efficiency.
While subqueries provide logical encapsulation, they are increasingly superseded by Common Table Expressions (CTEs) for readability and recursive operations.
-
CTEs (
WITHclause): Preferred over subqueries for complex logic as they improve maintainability and allow for recursive tree traversal. -
Window Functions: Should be used instead of correlated subqueries for calculating running totals, ranks, or moving averages to reduce complexity from
$O(n^2)$ to$O(n \log n)$ via window partitioning. -
LATERALJoins (PostgreSQL/Modern SQL): Replaces correlated subqueries inFROMclauses, allowing columns from the outer query to be referenced within a subquery or table-valued function.
-
Performance: Unoptimized correlated subqueries introduce an
$O(n \cdot m)$ complexity where$n$ is the outer table size and$m$ is the inner. Always checkEXPLAIN ANALYZEoutputs to identify unnecessary row-by-row scanning. - Debugging: Deeply nested subqueries (e.g., nesting depth > 3) significantly increase cognitive load. CTEs are the 2026 standard for modularizing these patterns.
-- 1. Scalar Subquery: Retrieving records above average using CTE for clarity
WITH AvgValue AS (
SELECT AVG(salary) AS global_avg FROM employees
)
SELECT emp_name, salary
FROM employees, AvgValue
WHERE salary > AvgValue.global_avg;
-- 2. Correlated Subquery/Lateral Join:
-- Fetching the most recent order for each customer
SELECT c.customer_name, o.order_date
FROM customers c
CROSS JOIN LATERAL (
SELECT order_date
FROM orders
WHERE customer_id = c.id
ORDER BY order_date DESC
LIMIT 1
) o;- Accuracy: Updated to reflect the industry preference for CTEs and Lateral Joins over legacy deeply-nested subqueries.
- Mathematical Context: Clarified performance implications; nested operations often lead to non-linear performance degradation.
- Modernization: Explicitly identified
LATERALjoins as the replacement for correlated subquery patterns in modern analytical workloads.
The ORDER BY clause specifies the sorting order of a query's result set. In relational theory, rows in a table are inherently unordered; ORDER BY provides a deterministic sequence for the output.
- Multivariate Sorting: Defines precedence using a comma-separated list. Evaluation proceeds from left to right.
- Directional Directives: ASC (ascending, default) and DESC (descending).
- NULL Handling: Modern SQL standards (ISO/IEC 9075) utilize NULLS FIRST or NULLS LAST to explicitly dictate the positioning of null values, which vary by default across engines (e.g., PostgreSQL defaults to NULLS LAST for ASC).
Sorting is a blocking operation. The performance complexity is
-
Index Utilization: Engines utilize B-Tree indices to optimize sorting. If the ORDER BY columns match the index prefix, the engine performs a sequential scan, reducing complexity to
$O(N)$ . -
Memory Constraints: Large datasets exceeding the sort buffer (e.g.,
work_memin PostgreSQL) trigger External Merge Sorts, involving disk I/O, which significantly degrades latency.
The use of Column Position (e.g., ORDER BY 1) is discouraged in 2026. This practice is considered technical debt; it breaks if the SELECT list is modified and hinders static analysis and query plan stability. Use explicit column names or aliases.
For "Top-N" queries, modern standards prefer Window Functions over LIMIT for complex partitioning:
-- Standard Top-N pattern using Window Functions
SELECT product_name, units_sold
FROM (
SELECT product_name, units_sold,
RANK() OVER (ORDER BY units_sold DESC) as rnk
FROM sales
WHERE sale_date = '2026-05-20'
) t
WHERE rnk <= 3;The RAND() function is engine-specific and non-deterministic. For production-grade randomization, use cryptographically secure seeds if available, or the standard RANDOM() function (PostgreSQL) or RAND() (MySQL/MariaDB). Note that ORDER BY RANDOM() performs a full table scan and is computationally expensive ($O(N \log N)$).
-- Explicit aliasing and NULLS handling
SELECT product_name, sale_date, units_sold
FROM sales
WHERE sale_date = '2026-05-20'
ORDER BY
units_sold DESC NULLS LAST,
product_name ASC
LIMIT 3;- Deprecated Practice: Dropped "Column Position" reference in production recommendations.
-
Standards Compliance: Added
NULLS FIRST/LASTsyntax, which is critical for deterministic results in 2026. -
Efficiency: Warned against
ORDER BY RAND()for large datasets due to$O(N \log N)$ disk-heavy overhead. -
Architectural Shift: Introduced Window Functions as the preferred mechanism for ranking over simple
LIMITclauses.
Explore all 100 answers here π Devinterview.io - SQL