Tech StartupExpert-built kit

Data Engineer

Builds data pipelines, develops transformation models, and implements quality tests across the warehouse.

Calibrated for the level you’re hiring

What’s inside this kit

21Competency interview questions
13Attitude interview questions
8Resume screening criteria
3Video screening prompts
1Hands-on work simulations
1Presentation prompts
Progression framework, Junior–Principal
Ready-to-use job description

Why this role is hard · Ryan Mahoney

Hiring at this level is hard because you need someone who takes ownership instead of just completing tickets. Most candidates can write code, but few can design a system that survives real traffic without breaking or needing constant fixes. You need to see curiosity where they dug into a failure instead of hiding it, and they must show they understand how their models impact downstream analytics and costs. It is not about knowing every tool but about knowing why a tool fits the problem, because the gap between writing a script and owning a pipeline is wide. You need people who treat data quality as a product feature rather than an afterthought, which requires shifting focus from task completion to system reliability.

Core Evaluation

Critical questions for this role

The competency and attitude questions below are where the hiring decision is made. They run in the live interview rounds and are calibrated to the level selected above.

21 Competency Questions

1 of 21

Discipline
Data Platform & Infrastructure
Job requirement
Data Orchestration
Design workflow dependencies, implement retry logic, and optimize scheduling for resource efficiency.
Expected at Mid
2 / 5
While orchestration is valuable, mid-level engineers typically work within templated or established frameworks rather than architecting resilient systems from scratch. A guided proficiency allows them to design dependencies and implement retry logic under senior oversight, balancing growth opportunities with the risk of cascade failures or wasted compute from inefficient scheduling. This ensures reliability without overcomplicating their core delivery scope.

Interview round: Hiring Manager Technical

Describe a workflow where you managed dependencies between multiple data tasks.

Positive indicators

Uses directed acyclic graph concepts
Implements alerting on workflow failures
Plans for catch-up processing
Ensures idempotent task design
Documents dependency logic

Negative indicators

Relies on manual task triggering
No handling for task failures
Hard-codes execution delays
Ignores resource contention
No visibility into workflow status

13 Attitude Questions

1 of 13

Accountability Mindset

The consistent willingness to accept full responsibility for the reliability, integrity, and outcomes of data infrastructure and pipelines, proactively addressing errors and ensuring trustworthiness of data assets without shifting blame.

Interview round: Hiring Manager Technical

How do you approach ownership when an incident occurs outside of your direct working hours?

Positive indicators

Mentions supporting on-call person
Follows up next day
Respects rotation

Negative indicators

Ignores incident completely
Takes over without being asked
Blames on-call person for issues

Supporting Evaluation

How candidates earn the selection conversation

The goal is to reduce effort for everyone by collecting more useful signal before adding more interviews. Lightweight application prompts and structured screens help the panel focus live time on the candidates most likely to succeed.

Stage 1 · Application

Filter at the door

Runs the moment a candidate hits Submit. Disqualifying answers end the application; everything else is captured for review.

Video-Response Questions

1 of 3

Application Screen: Video Response

Imagine a Product Manager is pushing to instrument high-volume events rapidly to meet a sprint deadline, but your analysis shows this will compromise data quality and require significant rework later. Describe how you would communicate the technical constraints and propose a feasible alternative during a meeting with them.

Candidate experience

REC

0:42 / 2:00

1Record

2Review

3Submit

Response time

2 min

Format

Recorded video

Stage 2 · Resume Screening

Read the resume against fixed criteria

Reviewers score every application that clears the door against the same criteria. Stronger reviews advance to live interviews; weaker ones are archived without further screening.

Resume Review Criteria

8 criteria

Independent ownership of designing, deploying, and maintaining production data workflows with automated testing, deployment pipelines, and scheduling.

Development of structured data models and self-serve analytics layers that translate raw data into business-ready datasets for analysts and product teams.

Provisioning, configuring, and optimizing cloud data warehouse resources using infrastructure-as-code tools, with attention to cost and performance.

Collaboration with product and business stakeholders to design event tracking, translate business logic into data pipelines, and align deliverables with product roadmaps.

Does the resume show relevant prior work experience?

Does the cover letter or personal statement convey clear relevance and familiarity with the job?

Does the resume indicate required academic credentials, relevant certifications, or necessary training?

Is the resume complete, well-organized, and free from formatting, spelling, and grammar mistakes?

Stage 3 · During Interviews

Where the hire is decided

Interview rounds use the competency and attitude questions outlined above, then add tests, work simulations, and presentations that reveal deeper evidence about how the candidate thinks and works.

Coding Test

1 of 2

Live Interview · Coding Test

Without AI

Write a Python function that checks if today's data partition exists and contains expected row counts. Return a status dict.
Implement a validation function `check_pipeline_health(partition_date, expected_min_rows)` that queries a mock data warehouse API and returns a dict with 'status', 'actual_rows', and 'message'. Handle missing partitions and low row counts gracefully.

With AI

Use AI to draft the validation function, then critically review its error handling and return contract. Improve it for production use.
Draft a pipeline health check function using AI. Critically review its error handling and return structure. Refactor it to clearly distinguish between 'missing data', 'low volume', and 'API failure' states.

Response time

20 min

Positive indicators

Clear separation of API call and validation logic
Explicit handling of missing data and threshold breaches
Returns structured, actionable status
Identifies AI's tendency to oversimplify error states
Adds explicit state differentiation
Improves return contract for downstream consumers

Negative indicators

Hardcoded thresholds
No error handling for API failures
Returns ambiguous boolean instead of structured status
Accepts generic AI error handling
Fails to distinguish failure modes
Cannot explain state transitions

Presentation Prompt

Prepare a short deck walking us through a past data modeling project you owned or significantly contributed to. Discuss how you translated ambiguous business requirements into a dimensional model, the trade-offs you made between normalization and query performance, and how you handled schema evolution over time.

Format

deck-and-walkthrough · 20 min · ~2 hr prep

Audience

Senior Data Engineer and Analytics Lead

What to prepare

3-5 slides summarizing a real-world data modeling initiative from past work or a significant academic/side project.
Include context, your approach, key architectural decisions, and measurable outcomes or lessons learned.

Deliverables

A 3-5 slide presentation deck
A structured verbal walkthrough focusing on your decision-making process

Ground rules

Use only work you are permitted to share. Redact sensitive company data, metrics, or proprietary schemas.
Focus on your reasoning and stakeholder collaboration, not on delivering a polished final artifact.
Do not create new models for this exercise; reflect on past experience.

Scoring anchors

Exceeds: Demonstrates deep understanding of modeling patterns, clearly links technical choices to business outcomes, anticipates future scalability needs, and shows strong stakeholder collaboration and documentation practices.
Meets: Presents a coherent model with clear rationale, addresses basic performance and evolution concerns, and shows adequate stakeholder alignment and iterative improvement.
Below: Struggles to explain the 'why' behind modeling decisions, ignores downstream query patterns or performance impacts, or fails to acknowledge trade-offs and stakeholder feedback.

Response time

20 min

Positive indicators

Clearly articulates initial business requirements and maps them to specific technical schema decisions.
Explains trade-offs (e.g., star vs snowflake, partitioning strategy, denormalization) with data-backed reasoning.
Demonstrates iterative refinement based on stakeholder feedback and changing business needs.
Proactively addresses backward compatibility, data lineage, and downstream query patterns.

Negative indicators

Presents a model without explaining the underlying business problem or user personas.
Glosses over performance or maintenance trade-offs, treating the chosen approach as universally optimal.
Cannot defend design choices when questioned about alternative patterns or scalability limits.
Treats stakeholder input as an afterthought rather than a core requirement driver throughout the lifecycle.

Work Simulation Scenario

Scenario. A critical dbt model that powers the company's core subscription revenue and churn dashboards has started failing SLAs and causing massive BigQuery cost spikes. You are tasked with diagnosing the root cause and designing a performance optimization plan. The model joins multiple large fact tables, uses several CTEs, and runs daily. Downstream analysts are complaining about stale data, and finance needs the numbers by 9 AM daily. Walk us through your investigation and optimization strategy.

Problem to solve. Identify the performance bottleneck in a critical daily pipeline and design a cost-effective optimization strategy that restores SLA compliance without breaking downstream dashboards.

Format

discovery-interview · 35 min · ~2 hr prep

Success criteria

Systematically diagnoses query performance issues using available tooling
Evaluates tradeoffs between query refactoring, partitioning/clustering, and compute scaling
Maintains backward compatibility for existing Looker dashboards
Communicates technical constraints clearly to non-technical stakeholders

What to review beforehand

BigQuery query execution plans and cost drivers
Common dbt performance anti-patterns and optimization techniques
Overview of current data warehouse architecture and SLA definitions

Ground rules

You will ask clarifying questions to shape your diagnostic approach
The interviewer will answer honestly but will not volunteer information or coach
Focus on your reasoning, tradeoffs, and decision process
No need to produce a final optimization script

Roles in scenario

Senior Data Platform Lead (informed_partner, played by hiring_manager)

Motivation. Restore 9 AM finance SLA, reduce warehouse spend, and prevent future cost spikes.

Constraints

Cannot alter source system schemas or ingestion pipelines
Limited budget for additional compute slots
Must maintain exact backward compatibility for existing Looker dashboards

Tensions to introduce

Quick fix (e.g., increasing slots) vs long-term refactor (e.g., incremental models)
Cost optimization requires breaking changes that downstream teams resist
Finance demands immediate accuracy while engineering needs time to implement fixes

In-character guidance

Provide current query plan details, execution times, and cost metrics only when asked
Explain downstream dashboard dependencies and stakeholder pressure honestly
Acknowledge budget and timeline constraints without steering the solution

Do not

Do not suggest specific partitioning, clustering, or materialization strategies unless asked
Do not hand over a finished optimization plan or query rewrite
Do not resolve the stakeholder pressure for you

Scoring anchors

Exceeds: Systematically isolates the root cause using execution metrics, designs a phased optimization plan balancing cost and compatibility, and proactively communicates risk and timelines to stakeholders.
Meets: Identifies likely performance bottlenecks, proposes reasonable optimization techniques, and acknowledges downstream constraints and cost implications.
Below: Jumps to expensive or disruptive solutions without diagnostics, ignores compatibility risks, or struggles to articulate a structured troubleshooting approach.

Response time

35 min

Positive indicators

Asks targeted questions about query execution plans, table sizes, and join patterns before proposing fixes
Evaluates multiple optimization paths (refactoring, partitioning, compute scaling) with clear tradeoff analysis
Surfaces assumptions about downstream dashboard compatibility and proposes safe migration steps
Communicates technical constraints and ETA clearly to business stakeholders

Negative indicators

Guesses at the bottleneck without requesting execution metrics or query plans
Proposes high-cost compute scaling as the only solution without exploring query refactoring
Ignores backward compatibility risks when suggesting schema or model changes
Freezes under pressure from finance deadlines or defaults to vague troubleshooting steps

Progression Framework

This table shows how competencies evolve across experience levels. Each cell shows competency at that level.

Data Platform & Infrastructure

4 competencies

Competency	Junior	Mid	Senior	Principal
Data Orchestration	Configure basic workflow schedules, monitor job execution, and respond to alerts under guidance.	Design workflow dependencies, implement retry logic, and optimize scheduling for resource efficiency.	Build resilient orchestration frameworks, implement cross-pipeline dependencies, and establish SLA monitoring.	Define orchestration strategy across the organization, evaluate platform alternatives, and drive automation maturity.
Data Pipeline Development	Execute predefined pipeline tasks under supervision, write basic SQL transformations, and troubleshoot common pipeline failures.	Design and implement moderately complex pipelines independently, optimize query performance, and establish monitoring for data flows.	Architect scalable pipeline solutions, mentor junior engineers on best practices, and drive pipeline standardization across teams.	Define enterprise pipeline strategy, evaluate emerging technologies, and establish organization-wide data engineering standards.
Performance Optimization	Identify slow-running queries, apply basic indexing strategies, and follow optimization guidelines.	Analyze execution plans, implement partitioning strategies, and optimize resource utilization.	Lead performance audits, design caching strategies, and establish performance baselines across systems.	Define performance standards, evaluate infrastructure investments, and drive optimization culture organization-wide.
Streaming Architecture	Consume streaming data using predefined patterns, monitor stream health, and handle basic stream failures.	Build streaming pipelines, implement windowing operations, and manage stateful stream processing.	Architect streaming platforms, ensure exactly-once processing guarantees, and optimize stream throughput.	Define streaming strategy, evaluate real-time technologies, and establish event-driven architecture standards.

Data Quality, Governance & Analytics

6 competencies

Competency	Junior	Mid	Senior	Principal
Data Governance	Document data lineage, maintain metadata catalogs, and follow access control procedures.	Implement governance workflows, manage data classifications, and ensure policy compliance.	Design governance frameworks, lead compliance audits, and establish data ownership models.	Define governance strategy, align with regulatory requirements, and drive data culture transformation.
Data Modeling	Create basic dimensional models, follow established modeling conventions, and document schema changes.	Design star/snowflake schemas, normalize/denormalize appropriately, and optimize for query patterns.	Define modeling standards, lead data architecture reviews, and balance competing stakeholder requirements.	Establish enterprise data modeling strategy, drive data mesh/fabric initiatives, and align models with business strategy.
Data Quality Testing	Write basic data quality tests, execute validation scripts, and report quality issues.	Design comprehensive test suites, implement automated quality gates, and establish quality metrics.	Define quality frameworks, lead quality incident response, and establish data quality SLAs.	Set enterprise quality standards, integrate quality into data culture, and drive continuous improvement.
ML Feature Engineering	Create basic features from existing data, follow feature engineering guidelines, and document feature definitions.	Design feature transformations, implement feature validation, and manage feature versioning.	Architect feature stores, establish feature quality standards, and enable ML team collaboration.	Define feature platform strategy, integrate with MLOps pipelines, and drive feature reuse across teams.
Reverse ETL	Configure basic sync jobs, monitor data delivery, and troubleshoot common sync failures.	Design sync workflows, implement transformation logic, and ensure data consistency across systems.	Architect activation platforms, establish sync patterns, and optimize for operational system constraints.	Define activation strategy, evaluate sync technologies, and drive operational data integration standards.
Team Enablement	Create basic documentation, respond to data requests, and support team members with data access.	Develop self-service tools, conduct training sessions, and establish knowledge sharing practices.	Design enablement programs, mentor team members, and drive adoption of data best practices.	Define enablement strategy, scale knowledge transfer, and build data literacy across organization.

Data Engineer

Critical questions for this role

21 Competency Questions

Data Orchestration

13 Attitude Questions

Accountability Mindset

How candidates earn the selection conversation

Filter at the door

Video-Response Questions

Read the resume against fixed criteria

Resume Review Criteria

Where the hire is decided

Coding Test

Presentation Prompt

Format

Audience

What to prepare

Deliverables

Ground rules

Scoring anchors

Work Simulation Scenario

Format

Success criteria

What to review beforehand

Ground rules

Roles in scenario

Senior Data Platform Lead (informed_partner, played by hiring_manager)

Scoring anchors

Progression Framework

Data Platform & Infrastructure

Data Quality, Governance & Analytics

Sample Job Description Content

Data Engineer

What you'll do

Who you are

Why this role will be interesting

Our Process