Top 15 DBMS Interview Questions

Vasu Deo Sankrityayan Last Updated : 05 Oct, 2025

5 min read

Recruiters no longer ask you to recite the six normal forms. They want to hear you reason about data at 2 a.m. when the primary shard is hot and the CFO is responding to the stakeholders. The questions you’ll encounter in this article have been harvested from real interviews at Google, Amazon, Stripe, Snowflake, and a handful of YC unicorns. Each answer is long enough to sql muscle memory, short enough to fit in the conversational window before the interviewer nods or interrupts. Use these DBMS interview questions as a checklist, and a non-exhaustive one at that.

Metric for Segregation
Beginners
Intermediate
Advanced
Conclusion

Metric for Segregation

I’ve categorized the questions into three categories. Each category is tailored to a specific experience level and gradually goes up. The list contains a mix of theoretical questions that are asked during an interview, and some hands-on additions, to take care of the pragmatics.

Beginners

These questions are relevant for those still learning the ropes.

Q1. What is a primary key, and why can’t we just use ROWID?

A. A primary key is a logical, unique identifier chosen by the designer. ROWID (or CTID, _id, etc.) is a physical locator maintained by the engine and can change after maintenance operations such as VACUUM, cluster re-ordering, or shard re-balancing. Exposing a physical pointer would break foreign-key relationships the moment the storage layer reorganises pages. A primary key, by contrast, is immutable and portable across storage engines, which is exactly what referential integrity needs.

Q2. Explain logical data independence vs physical data independence.

A. Logical data independence means you can change the logical schema (e.g., adding attributes or new tables) without rewriting application programs. Physical data independence means you can change the storage structure (e.g., indexes, file organization) without affecting the logical schema or queries.

Q3. Define 1NF, 2NF, and 3NF in one paragraph each, then tell me which one you would relax first for analytics.

A. 1NF: every column contains atomic, indivisible values (no arrays or nested tables). 2NF: 1NF plus every non-key column is fully dependent on the entire primary key (no partial dependency). 3NF: 2NF plus no transitive dependency—non-key columns may not depend on other non-key columns. In star-schema analytics, we usually drop 3NF first: we happily duplicate the customer’s segment name in the fact table to save a join, accepting update anomalies for read speed.

Q4. What’s the difference between a schema and an instance in a DBMS?

A. The schema is the database’s overall design (its blueprint), usually fixed and rarely changed. The instance is the actual content of the database at a given moment. The schema is stable; the instance changes every time data is updated.

Q5. State the four ACID properties and give a one-sentence war story that violates each.

A. Atomicity: a debit posts, but the credit disappears, and the money vanishes. Consistency: a negative balance is written; the check constraint fires, and the whole transaction rolls back. Isolation: two concurrent bookings grab the last seat; both commit, leading to an oversold flight. Durability: commit returns success, power fails, write-ahead log is on the corrupted SSD, leading to data loss.

Q6. What are the different types of data models in DBMS?

A. Common models include:

Object-oriented model (objects, classes, inheritance).
Hierarchical model (tree structure, parent-child).
Network model (records connected by links).
Relational model (tables, keys, relationships).
Entity-Relationship model (high-level conceptual).

Intermediate

You have some experience with Databases.

Q7. What is a deadlock in DBMS? How can it be handled?

A. Deadlock occurs when two transactions each hold a resource and wait for the other’s resource, blocking forever. Solutions:

Avoidance (Banker’s algorithm).
Prevention (lock ordering, timeouts).
Detection (wait-for graph, cycle detection).

Q8. What is checkpointing in DBMS recovery?

A. A checkpoint is a marker where the DBMS flushes dirty pages and logs to stable storage. During crash recovery, the system can start from the last checkpoint instead of scanning the entire log, making recovery faster.

Q9. What does the optimizer really do during a cost-based join choice between nested-loop, hash, and merge?

A. It estimates the cardinality of each child, consults column statistics (most common values, histograms), considers available memory (work_mem), indexes, and sorts. If the outer side is tiny (after filters) and the inner side has a selective index, nested-loop wins. Both sides are large and unsorted, which builds an in-memory hash table (hash join). If both are already sorted (index scan or previous sort step), merge join is O(n+m) and memory-cheap. The final cost number is I/O + CPU weighted by empirical constants stored in pg_statistic or mysql.column_stats.

Q10. Explain phantom read and which isolation level prevents it.

A. Transaction A runs SELECT SUM(amount) WHERE status='PENDING' twice; between runs, transaction B inserts a new pending row. A sees a different total—phantom. Only SERIALIZABLE (or Snapshot Isolation with predicate locks) prevents phantoms; REPEATABLE READ does not (contrary to folklore in MySQL).

Advanced

You’ve deleted production data and lived through that.

Q11. Your 2 TB table must be sharded. Give the exact shard-key decision tree you would defend to the CTO.

A. 1: List the top 10 queries by frequency and by bandwidth—shard must satisfy both.
2: Choose a high-cardinality, uniformly distributed column (user_id, not country_code).
3: Ensure the column appears in every multi-row transaction; otherwise, two-phase commit becomes inevitable.
4: check for hot-spot risk (e.g., one celebrity user) — use hash-shard + per-shard autoincrement, not range-shard.
5: Prove you can re-shard online with logical replication; present a dry-run cut-over script. Only when all five boxes are ticked do you sign the design doc.

Q12. Walk me through the internal steps PostgreSQL takes from INSERT statement to a durable disk byte.

A. 1: Parser → raw parse tree.
2: Analyzer → query tree with types.
3: Planner → one-node ModifyTable plan.
4: Executor grabs a buffer pin on the target page, inserts the tuple, and sets xmin/xmax system columns.
5: WAL record inserted into shared buffers in memory.
6: COMMIT writes WAL to disk via XLogWrite—now crash-safe.
7: Background writer later flushes dirty data pages; if the server dies before that, redo recovery replays WAL. Durability is guaranteed at step 6, not step 7.

Q13. Design a bitemporal table that keeps valid time (when the fact was true in reality) and transaction time (when the database knew it). Write the primary key and the SQL to correct a retroactive price change.

A. Primary key: (product_id, valid_from, transaction_from). Correction is an append-only insert with a new transaction_from; no UPDATEs.

INSERT INTO price(product_id, price, valid_from, valid_to, transaction_from, transaction_to)

VALUES (42, 19.99, '2025-07-01', '2025-12-31', now(), '9999-12-31');

To end the previous incorrect assertion:

UPDATE price SET transaction_to = now()

WHERE product_id = 42 AND valid_from <= '2025-07-01' AND valid_to > '2025-07-01'

  AND transaction_to = '9999-12-31';

Selects now use FOR SYSTEM_TIME AS OF and BETWEEN valid_from AND valid_to to retrieve the correct temporal slice.

Conclusion

The list consists of an eclectic mix of questions from hands-on to purely theoretical. What this really means is you’re being tested on DBMS thinking, not syntax: keys and normalization, ACID and isolation anomalies, query planning, recovery and WAL, deadlocks, shard-key strategy, Postgres internals, and bitemporal modeling. The goal is to surface trade-offs, invariants, failure modes, and operational judgment.

Skip memorizing clauses. Show why primary keys outlive ROWIDs, when REPEATABLE READ still leaks phantoms, why a hash join beats nested loops, and how you’d reshard without downtime. If you can walk through those decisions out loud, you’ll come across as a data systems engineer.

If you really wanna go through all that could be asked for in an interview of a database engineer, go through the following resources:

Vasu Deo Sankrityayan

I specialize in reviewing and refining AI-driven research, technical documentation, and content related to emerging AI technologies. My experience spans AI model training, data analysis, and information retrieval, allowing me to craft content that is both technically accurate and accessible.

Beginner Career Database Interview Prep

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Top 15 DBMS Interview Questions

Table of contents

Metric for Segregation

Beginners

Q4. What’s the difference between a schema and an instance in a DBMS?

Intermediate

Advanced

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Top 15 DBMS Interview Questions

Table of contents

Metric for Segregation

Beginners

Q4. What’s the difference between a schema and an instance in a DBMS?

Intermediate

Advanced

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques