Data Engineering Interview Questions That Decide Indian Hiring Loops (2026)
The internet has roughly nine hundred "Top 100 Data Engineer Interview Questions" articles, and they all share a flaw: they're dictionaries. "What is Apache Spark?" followed by a definition you could paste from the docs. Memorize all hundred and you'll still fail the loop, because Indian data engineering interviews don't test definitions — each round tests one specific capability, and the questions are just instruments for measuring it.
So this post is organized the way the interview actually is: four rounds, the questions that decide each one, what the interviewer is really measuring, and the trap hidden inside the familiar-sounding ones. It pairs with the interview-loop table in our data engineer roadmap; this is the round-by-round detail.
Round 1 — The SQL screening (where most candidacies end)
A timed online test, usually HackerRank or a company platform, 45–90 minutes. This round exists to shrink three hundred applicants to fifteen, and it does so with a small set of recurring patterns. If you can do these four cold, you pass; if you can't, nothing else in this article matters yet.
Top-N-per-group. The expected shape is ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) in a CTE, filtered in the outer query.
The trap: solving it with a correlated subquery or self-join. It can work, but under time pressure it breaks on ties and edge cases, and it signals you learned SQL before window functions existed. Also know when RANK vs DENSE_RANK vs ROW_NUMBER changes the answer — interviewers deliberately put ties in the data.
The bread and butter of real pipelines, which is why it's everywhere. ROW_NUMBER() partitioned by the business key, ordered by the timestamp descending, keep rn = 1.
The trap: DISTINCT. The moment you reach for it, the interviewer knows you haven't cleaned production data — DISTINCT removes identical rows, not business-duplicate rows with different timestamps.
Gaps-and-islands. The classic technique: the difference between a row's date and its ROW_NUMBER() is constant within a consecutive run — group on that difference.
The trap is panic. This pattern looks impossible until you've solved it twice, after which it's mechanical. It appears in a large share of product-company screens precisely because it can't be improvised.
LAG for the former, a frame clause (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for the latter.
The trap: NULLs and division. The first month has no previous month; dividing by zero or returning NULL unhandled is exactly the kind of edge the auto-grader's hidden test cases check. Read the expected output format twice — more candidates fail on output format than on logic.
How to prepare this round: not by reading. Pick one practice platform, solve the medium-difficulty SQL set until window-function problems take you under ten minutes, then do timed mock tests. Volume of reps is the only variable that moves this round.
Round 2 — Live coding (where silent coders fail)
A shared editor, a human watching, usually one SQL problem and one Python problem. The secret of this round: the code is half the score. The other half is whether you think out loud, because the interviewer is simulating what you'd be like to work with.
Pandas or plain Python both fine — announce your choice and why. Narrate the messy-data decisions: what happens to rows with missing keys? Malformed dates? Say your assumption, ask if it's acceptable, proceed.
The trap: candidates prep leetcode trees and graphs, then meet a dirty CSV. Indian data engineering loops rarely run FAANG-style algorithm rounds — the Python bar is "clean, readable, defensive scripting," and the differentiator is handling the bad rows without being told.
The follow-up is the real question. Talk partitioning, predicate pushdown, why SELECT * hurts on columnar storage, when an index helps and when the warehouse doesn't even have one in the OLTP sense.
The trap: treating the follow-up as criticism and getting defensive. It's an invitation to show depth — interviewers escalate until you reach your edge, and saying "I'd need to check the query plan, here's what I'd look for" at that edge scores better than bluffing.
Round 3 — System design (where tool-name-droppers fail)
"Design a pipeline that ingests payment events from 50 services and makes them queryable for analysts within 15 minutes." One question, forty-five minutes, and a consistent failure mode: candidates who answer with an instant shopping list — "Kafka, Spark, S3, Airflow, Snowflake" — and nothing else.
The framework that passes this round has five moves, in order. Clarify scale and freshness — how many events per second, how fresh is "fresh," what breaks if data is late? The numbers change everything, and asking for them is itself scored. Choose batch vs streaming and defend it — 15-minute freshness is micro-batch territory; saying "streaming everywhere" signals you've never paid a streaming bill. Design the storage layers — raw immutable landing zone, cleaned/conformed layer, serving layer; mention file formats (Parquet) and partitioning strategy and why. Handle failure explicitly — what happens when a source sends garbage, when a job dies mid-run, when the same event arrives twice. Idempotency and reprocessing are the words interviewers wait for. Talk cost and operations — who gets paged, what's monitored, what this costs at this scale.
The single highest-signal sentence in a design round is some version of: "I'd make the load idempotent so reruns are safe — here's how." Exactly the thing pure course-theory candidates never say, and the thing anyone who has operated a real pipeline can't stop saying.
These probe whether your design survives contact with reality. Good answers involve versioned raw data, late-arriving-data strategy, and a debugging narrative that starts from the serving layer and walks backward through lineage.
The trap: redesigning from scratch when challenged. Defend the design, amend it where the challenge is right, and say which trade-off you're consciously accepting. Conviction with flexibility is the grading rubric.
Round 4 — The project defense (where padded resumes die)
Hiring managers run this round like an audit, and it's brutally effective because it can't be crammed: a thirty-minute conversation about projects on your resume. Five questions appear in nearly every loop:
"Walk me through the architecture — draw it." Then: "What was the hardest bug, and how did you find it?" Then: "What would you change if you rebuilt it?" Then: "How much data, how often, and what did it cost to run?" And finally some version of: "Which parts did you build versus your teammates?" — the integrity check. Vague answers to the volume-and-cost question are the most common giveaway that a project is a tutorial wearing a resume's clothes; real builders know their numbers, even small ones. A small honest project defended in depth beats an impressive-sounding one you can only describe at brochure level.
This round is also where the offer is shaped. When salary comes up, the difference between accepting the first number and negotiating competently is routinely ₹1–3 LPA — we've covered the market rates to anchor against in the salary guide.
What's conspicuously absent from real loops
Worth saying because prep time is finite: almost nobody asks you to define the V's of big data, recite Hadoop daemon names, compare every NoSQL database, or write a red-black tree. The hundred-question listicles weight heavily toward exactly this dead material. If your prep plan mirrors the four rounds above — SQL reps, narrated coding, the design framework, and a defensible project — you're preparing for the interview that actually happens. Building projects worth defending is the slow part; that's the gap a good course can compress, and weekly code review is the mechanism — it's the core of how our data engineering course runs, and the feature we'd tell you to demand from any course you compare.
Practice the loop before the loop
Our batches run mock interviews with recorded feedback for every round above — including the project defense.
See the Data Engineering Course →