
Importance of Enterprise Data Services
November 4, 2025
What is Data Automation and Why It Matters
November 11, 2025Essential Skills for ETL and ELT Developers
Data rarely arrives tidy. Formats clash, timestamps drift, and yesterday’s rules change today. The craft of an ETL or ELT developer is to turn that reality into something people can trust, every day, without drama. Titles and tools evolve, the fundamentals stay the same. You move data from sources to a safe home, you shape it so questions have clear answers, and you keep the whole system predictable, observable, and affordable.
ETL vs ELT, what are the differences?
ETL means extract, transform, then load. ELT flips the last two steps, load first, then transform inside your warehouse or lake environment. Pick the pattern that reduces friction. When sensitive cleanup or strict validations must happen before shared storage, ETL can fit. When you want faster iteration and easier backfills, ELT is often simpler. Think about governance, latency, and where the heavy work should live. Consistency is more valuable than perfection, choose once for a domain, write it down, and stick to it.
Once you’ve chosen your pattern, the real work begins. The skills that follow turn any approach into something reliable.
SQL and modeling, the core craft
SQL is the bread and butter of data work – it’s how you ask questions of your data and get reliable answers back. Think of it as a declarative language where you describe what you want, not how to get it. You tell the database “show me all customers who placed orders last month,” and the engine figures out the fastest path through indexes and joins. SQL lets you think in sets rather than loops, making it natural for data transformations.
Data modeling is the strategic thinking that comes before you write any SQL. It’s about deciding how to structure your data so it tells a clear story. You choose the grain of each table (one row per what?), pick stable business keys, and design dimensions and facts that match how people actually think about the business. Good modeling makes the right questions easy to ask and the wrong questions hard to accidentally answer.
Simple SQL example – finding active customers
SELECT customer_id,
customer_name,
email
FROM customers
WHERE status = 'active'
AND last_order_date >= '2024-01-01'
ORDER BY last_order_date DESC;
Data modeling example – order facts and customer dimensions
-- Fact table: one row per order line
fact_order_lines:
order_id (FK)
line_id
customer_id (FK)
product_id (FK)
quantity
unit_price
extended_price
order_date
-- Dimension table: one current row per customer
dim_customers:
customer_id (PK)
customer_name
email
segment
created_date
updated_date
Moving data, batch, streaming, and change capture
This skill is about choosing how and when to move data from source systems to your data warehouse or lake. You need to understand three fundamental patterns: batch processing (moving data in scheduled chunks), streaming (moving data in real-time as events happen), and change data capture (efficiently tracking only what changed). Each pattern fits different business needs – batch for daily reports, streaming for real-time dashboards, and CDC for keeping systems in sync without overwhelming them.
Some pipelines breathe nightly, some pulse every second. Batch gives you windows and checkpoints, streaming asks you to think in events and clocks. In streaming you will decide how long to wait for late arrivals, and you will deduplicate without losing real updates. Change data capture is the bridge, you ship only what changed, and you apply those changes so the destination mirrors the source. The mindset is simple, every run should be safe to repeat, and each record should know how to find its correct home.
Example: Customer updates using three different patterns
Batch approach – nightly customer sync:
Every night at 2 AM, extract all customer records from the CRM system and load them into the warehouse. Simple but can miss intraday changes that matter for real-time personalization.
Streaming approach – real-time customer events:
Stream every customer update as it happens using Kafka. Perfect for live recommendation engines, but requires handling late arrivals and duplicate events.
Change Data Capture – efficient incremental sync:
Capture only customers that changed since the last run using database transaction logs. Efficient and near real-time, ideal for keeping multiple systems synchronized without full table scans.
Orchestration, testing, and observability
This skill is about making your data pipelines reliable, trustworthy, and easy to operate. Orchestration means scheduling and coordinating pipeline tasks with proper dependencies and error handling. Testing ensures your data meets quality standards before it reaches end users. Observability gives you visibility into what’s happening – tracking performance, catching failures early, and providing the information needed to fix problems quickly. Together, these practices turn fragile scripts into production-ready systems that teams can depend on.
Pipelines are promises, not just tasks on a clock. You promise arrival times, quality bars, and calm recovery when something fails. Orchestration wires those promises together with dependencies, retries, and backfills. Testing gives you confidence, not only logic tests, but checks on the data itself. Some columns must never be null, some values must be unique, some relationships must always match. Observability turns the system into a live dashboard. You track rows read and written, durations, failure rates, and cost. You alert when schemas drift, you capture lineage so fixes find the right owners, and you keep short run books so anyone on call can restore order quickly.
Example: A customer order pipeline with orchestration, testing, and observability
Orchestration:
Schedule the pipeline to run after the CRM export completes, with automatic retries on failure and notifications to the data team if all retries fail.
Testing:
Check that every order has a valid customer ID, all prices are positive numbers, and order totals match the sum of line items before promoting data to production tables.
Observability:
Track daily row counts, processing time trends, and set up alerts when order volumes drop below expected ranges or processing takes longer than usual – helping catch upstream system issues before stakeholders notice missing data.
Performance and cost, the quiet levers
This skill is about optimizing your data pipelines to run efficiently without breaking the budget. You need to balance speed and cost by making smart choices about data storage formats, query patterns, compute resources, and processing schedules. Good performance optimization means your pipelines finish on time and your data is fresh when people need it. Good cost management means you’re not paying for unused resources or inefficient operations that could be redesigned.
Speed and budget travel together. Design tables that help queries skip what they do not need. Store data in columnar formats so scans read only the columns in play. Keep files and partitions healthy in size, not too many tiny files, not giant single blobs. Avoid work that does not change results, process only new or changed records, and right size compute for the job. A fast pipeline that burns money is not finished, a cheap pipeline that misses the morning report is not finished either. Aim for steady efficiency and make it boring to operate.
Example: Optimizing a daily sales report pipeline
Storage optimization:
Partition sales data by date and store in Parquet format so the daily report only reads one day’s partition instead of scanning the entire table.
Compute optimization:
Use a smaller cluster for the initial data extraction, then scale up only for the heavy aggregation step, and scale back down for the final report formatting.
Query optimization:
Pre-aggregate common metrics into summary tables so dashboards query small rollup tables instead of scanning millions of raw transaction records every time.
Avoid functions on partition columns, let pruning work
-- Good, allows partition pruning on order_date
SELECT COUNT(*)
FROM fact_orders
WHERE order_date >= DATE '2025-11-01'
AND order_date < DATE '2025-12-01';
-- Risky, wrapping the column can block pruning
-- WHERE DATE(order_ts) = DATE '2025-11-15'
Backfills, migrations, and the reality of change
You will reprocess history and you will change storage patterns. Plan for it. Keep a raw layer that lets you rebuild without begging upstream teams to resend. When you introduce a new model, run it beside the old one, compare row counts and business totals, then switch when both lines match. Tag runs, record inputs and outputs, and write down what you did so the next migration feels routine.
Security and governance people can live with
Security works best when it is invisible to honest users. Protect secrets, never hardcode credentials, give teams only the access they need, and mask sensitive fields by default. Keep an audit trail of who changed what and why, and put access reviews on a real calendar. Governance is not a wall, it is a set of guardrails that lets the organization move quickly without falling.
Common FAQs
Is ETL better than ELT
Neither wins by default. Pick based on governance needs, iteration speed, and where your team wants to spend compute. The right choice is the one that keeps data trustworthy and teams productive.
How do I make pipelines safe to rerun
Design each step so running it again gives the same result. Track change keys, apply updates carefully, and keep inputs that let you rebuild on demand. If a job fails in the middle, you should not fear pressing the button twice.
What should I learn first
Start with SQL and clear modeling, add a scheduler, then add testing and observability. Learn one good way to do each, you will be effective on any stack.
Closing thoughts
The job is not to make pipelines flashy, it is to make them reliable. Pick patterns that last, write SQL your future self will understand, and let the data tell a consistent story every day. When the system is quiet, accurate, and a little bit boring, you did it right.
Join the Team
Mandrill Tech Sdn Bhd is hiring passionate Data Engineers to help us build reliable, scalable, and intelligent data systems. If you’re excited about transforming raw data into actionable insights and love working with modern data tools, we’d love to hear from you.
📩 Email your resume to hr@mandrill.com.my




