Case Study:
From Spreadsheets to a Scalable Open-Source Data Lake
How a global research organization modernized its data workflows to enable reproducible, collaborative research at scale
Client name withheld for privacy and confidentiality.
Overview
Client Details
Our client is a specialized office within a major international intergovernmental organization. The office is uniquely dedicated to advancing the priorities of a specific global region and strengthening international support for that region’s long-term development.
Its core functions include advisory services, advocacy, inter-agency coordination, and performance monitoring. The office works across a large, complex institutional system to ensure coherence in efforts related to peace, security, and sustainable development.
Before working with Database Tycoon, this international research and policy advisory team lacked a reliable way to share datasets or reproduce prior work. Data lived across CSVs and Excel files, making collaboration slow, inconsistent, and difficult to scale.
There was no standardized pathway for onboarding new developers, integrating new datasets, or ensuring research outputs were reproducible across contributors.
Database Tycoon designed and delivered a fully serverless, open-source data lake using S3, DuckDB, SQLMesh, and Ibis, giving the team a scalable foundation for reproducible research and long-term collaboration. The infrastructure was built to be capable of extending beyond the initial office to support broader institutional adoption.
The Stack
Client Challenge
This global research organization needed to improve how datasets were shared, transformed, and reproduced across distributed research teams.
Their goal was to prototype a platform that could scale across regions while remaining fully open-source and cost-efficient. However, existing workflows made this difficult.
Researchers were repeatedly downloading public datasets manually, transforming data in spreadsheets, and recreating the same logic across projects without version control or reproducibility safeguards.
The team lacked:
A shared transformation layer
Dataset versioning
A reliable way to reproduce results across contributors
Without structured infrastructure, collaboration was slow, inconsistent, and difficult to scale.
Our Approach
Database Tycoon partnered with the team to design a lightweight, fully serverless data foundation that addressed reproducibility, collaboration, and scalability without introducing vendor lock-in or operational overhead. We architected a future-proof data platform that met strict technical and governance requirements while enabling modern analytics workflows.
Rather than adding complex infrastructure, the solution centered on composable, Python-first tools that researchers could understand, extend, and maintain independently. The result was an open-source architecture that provided a shared transformation layer, dataset versioning, and a reliable pathway for reproducible research, all within a cost-efficient, scalable framework designed for long-term institutional growth.
-
We designed a fully serverless data lake using cloud object storage as the foundation, DuckDB for querying Parquet files directly in storage, and SQLMesh to orchestrate transformations without persistent compute.
-
Transformations were implemented using a Python-based modeling layer, enabling portable logic without vendor-specific SQL dependencies.
-
Storage-level versioning ensured historical datasets remained accessible, allowing prior analyses to be reproduced without reprocessing or overwriting data.
-
We implemented role-based cloud authentication to enable secure, auditable access across environments without long-term credentials.
-
Instead of traditional metadata catalogs, we implemented a lightweight persistence pattern to keep the platform simple, portable, and cost-efficient.
“Our goal was to make research reproducible and accessible without depending on expensive vendors. This architecture proved it could be done with open-source tools, and it lays the groundwork for a scalable, community-driven platform.”
Client Results
-
Reproducible Research at Scale
Teams can now publish datasets with confidence, knowing prior versions remain accessible and analyses can be reliably reproduced.
-
Reduced Research Overhead
Reusable transformations replaced manual spreadsheet workflows, eliminating duplicated effort and accelerating analysis.
-
Foundation for Global Collaboration
The platform provides a scalable foundation for expanding participation across contributors without increasing infrastructure complexity.
-
Reduced operational burden
By resolving ingestion failures and addressing legacy technical debt, dashboards are now maintainable with minimal effort, allowing analysts to focus on insights instead of firefighting data issues.
Related Resources