Case Study:

From Spreadsheets to a Scalable Open-Source Data Lake

How a global research organization modernized its data workflows to enable reproducible, collaborative research at scale

Client name withheld for privacy and confidentiality.

Talk to our experts.

Overview

Client Details

Our client is a specialized office within a major international intergovernmental organization. The office is uniquely dedicated to advancing the priorities of a specific global region and strengthening international support for that region’s long-term development.

Its core functions include advisory services, advocacy, inter-agency coordination, and performance monitoring. The office works across a large, complex institutional system to ensure coherence in efforts related to peace, security, and sustainable development.

Before working with Database Tycoon, this international research and policy advisory team lacked a reliable way to share datasets or reproduce prior work. Data lived across CSVs and Excel files, making collaboration slow, inconsistent, and difficult to scale.

There was no standardized pathway for onboarding new developers, integrating new datasets, or ensuring research outputs were reproducible across contributors.

Database Tycoon designed and delivered a fully serverless, open-source data lake using S3, DuckDB, SQLMesh, and Ibis, giving the team a scalable foundation for reproducible research and long-term collaboration. The infrastructure was built to be capable of extending beyond the initial office to support broader institutional adoption.

The Stack

Client Challenge

This global research organization needed to improve how datasets were shared, transformed, and reproduced across distributed research teams.

Their goal was to prototype a platform that could scale across regions while remaining fully open-source and cost-efficient. However, existing workflows made this difficult.

Researchers were repeatedly downloading public datasets manually, transforming data in spreadsheets, and recreating the same logic across projects without version control or reproducibility safeguards.

The team lacked:

A shared transformation layer
Dataset versioning
A reliable way to reproduce results across contributors

Without structured infrastructure, collaboration was slow, inconsistent, and difficult to scale.

Our Approach

Database Tycoon partnered with the team to design a lightweight, fully serverless data foundation that addressed reproducibility, collaboration, and scalability without introducing vendor lock-in or operational overhead. We architected a future-proof data platform that met strict technical and governance requirements while enabling modern analytics workflows.

Rather than adding complex infrastructure, the solution centered on composable, Python-first tools that researchers could understand, extend, and maintain independently. The result was an open-source architecture that provided a shared transformation layer, dataset versioning, and a reliable pathway for reproducible research, all within a cost-efficient, scalable framework designed for long-term institutional growth.

We designed a fully serverless data lake using cloud object storage as the foundation, DuckDB for querying Parquet files directly in storage, and SQLMesh to orchestrate transformations without persistent compute.
Transformations were implemented using a Python-based modeling layer, enabling portable logic without vendor-specific SQL dependencies.
Storage-level versioning ensured historical datasets remained accessible, allowing prior analyses to be reproduced without reprocessing or overwriting data.
We implemented role-based cloud authentication to enable secure, auditable access across environments without long-term credentials.
Instead of traditional metadata catalogs, we implemented a lightweight persistence pattern to keep the platform simple, portable, and cost-efficient.

“Our goal was to make research reproducible and accessible without depending on expensive vendors. This architecture proved it could be done with open-source tools, and it lays the groundwork for a scalable, community-driven platform.”

— Project Lead, Data Scientist & Researcher, Global Research Organization

Client Results

Reproducible Research at Scale

Teams can now publish datasets with confidence, knowing prior versions remain accessible and analyses can be reliably reproduced.
Reduced Research Overhead

Reusable transformations replaced manual spreadsheet workflows, eliminating duplicated effort and accelerating analysis.
Foundation for Global Collaboration

The platform provides a scalable foundation for expanding participation across contributors without increasing infrastructure complexity.
Reduced operational burden

By resolving ingestion failures and addressing legacy technical debt, dashboards are now maintainable with minimal effort, allowing analysts to focus on insights instead of firefighting data issues.

Related Resources

Debating the Data Stack: Why We Chose SQLMesh Over dbt + DuckDB

Read More
DuckDB Meets SQLMesh: Building a Lean Data Architecture For One of the World’s Leading Intergovernmental Organizations

Read More
AWS Cross-Account Authentication in a Data Pipeline

Read more
What We Delivered for Our Serverless Data Stack

Read More