The lost art of data engineering 3: Where data lies

Ara Islam
3 days ago
4 min read

This article explores the evolution of data storage, from the early days of punch cards to the rise of data warehouses, lakes, and the modern lakehouse. Ara traces how each stage shaped today’s data landscape and explains how the lakehouse unites flexibility, governance, and trust in a single architecture.

The evolution of data storage

The explosion of data has demanded continuous evolution of how the data is stored. Starting as early back as we can go, in the “prehistoric” times data was stored in manual records, punch cards and magnetic tapes.

Then came the 1980’s, when Bill Inmon formally founded the data warehouse concept, giving structure to chaos. Bill wrote the script and then IBM built the stage. Data warehousing centralised data from different systems in a standard format, allowing business to connect insight from their data which were previously disconnected and siloed.

But data warehousing came with its own limitations. The data needed to behave like a polished gentleman - it had to fit neatly in predefined schemas, making them expensive for handling unstructured and semi structured data like logs, images and JSON.

As organisations began to collect more and more data this limitation was exacerbated. Organisations demanded cheaper storage and more flexible storage solutions. In the early 2010s companies like Hadoop and Amazon popularised the data lake concept, which involved centralised repositories of data in its native format stored cheaply until it is needed.

The rise of the data lakehouse

Thus, we enter the next chapter of storage technologies. Organisations needed the flexibility of a data lake but the reliability and structure of a data warehouse. This is where the data lakehouse enters the scene. The data lakehouse pioneered by Databricks, combined the best of both worlds. It allows raw and structured data to live side by side, whilst ensuring governance, schema enforcement and transactional consistency.

It is worth noting the data lakehouse is more of a data architecture than a stand alone microservice like Amazon S3. It utilises file format technologies like Delta Lake, Apache Iceberg and Hudi.

Let's run through a quick example by empowering a life science organisation with the data lakehouse architecture. The organisation has siloed genome data in a S3 data lake, clinical trial data in a SQL server and lab experiments in CSVs. Scientists struggle to join these data sets, and machine learning teams face long waits to prepare consistent inputs, leading to slower analytics, low trust in the data, and ultimately, delayed time to market.

With the data lakehouse, the organisation now ingests the raw files as they are but converts them into a Delta file format. Instantly adding ACID (atomicity, consistency, isolation & durability) compliance to the data, adding reliability for scientists publishing results to the same folder. The delta format also maintains a transaction log, which allows teams to audit changed or even rollback data to previous versions.

The shift from ETL to ELT and the medallion architecture

Another major shift that came with the cloud was in how data is moved and transformed — the shift from ETL to ELT. Traditionally, in ETL (Extract → Transform → Load), data was cleaned and reshaped before being loaded into a warehouse. This worked well when compute power was expensive and storage limited.

In the modern cloud world, we first bring all the raw data into the lakehouse, then transform it inside scalable cloud compute engines. This not only simplifies the process but allows you to also preserve the raw data for future use. If an organisation has a governance layer like Microsoft Purview or Unity Catalog, you can discover all of your data assets easily in a centralised catalog.

Tools like Azure Data Factory, Databricks jobs and data build tools (DBT) have made it easier than ever to orchestrate these data pipelines. You can set up your orchestrated pipelines simply to follow the medallion architecture. You start off in your bronze layer where your raw unfiltered data resides exactly as it is with perhaps the addition of audit columns. In your silver layer you can do some data cleaning and apply filtering. Finally the gold layer is your final polish data with all of the aggregations, purpose built for your use case.

In our life science example all experiment data is collected as it is in the bronze layer. Suppose the experiment produced invalid results due to incorrect equipment calibration. In the silver layer, business logic adds a flag - is_valid_experiment = False. These records remain stored and traceable but are excluded from the gold layer, where only valid experiments are aggregated and used to train machine-learning models.

This approach maintains full lineage and auditability so that nothing is lost and only trusted, high-quality data informs critical decisions. That’s the real power of the modern data lakehouse: uniting flexibility, governance, and trust in a single architecture built for today’s data-driven world.

The evolutions of data stroage and architecture, manual in the 60s, warehouse in the 80s, data lakes in 2010s, and data lakehouses in 2020s

Contact information

If you have any questions about our Data Engineering services, or you want to find out more about other services we provide at Solirius Reply, please get in touch (opens in a new tab).

Insights

The lost art of data engineering 3: Where data lies

The evolution of data storage

The rise of the data lakehouse

The shift from ETL to ELT and the medallion architecture

Contact information

Recent Posts

Comments