Skip to main content

Command Palette

Search for a command to run...

Building a Distributed Storage Metadata Service in Go

A Deep Dive into Chunk-Based Deduplication, gRPC Streaming, and Garbage Collection

Updated
6 min read
Building a Distributed Storage Metadata Service in Go
F
I build distributed systems that stay reliable under pressure, and I bring ML intuition to every layer of the stack. With 2+ years shipping production-grade C++/gRPC services at Cohesity (formerly Veritas), I've owned everything from anomaly detection pipelines to cyber-resiliency features end-to-end. Now pursuing my MSc at Stuttgart, I'm combining systems depth with autonomous intelligence research to work on problems that actually matter at scale. At Cohesity, I contributed to Stargate - an enterprise-scale distributed file services layer - where I designed garbage collection logic, built high-performance concurrent file services with thread safety guarantees, and implemented stress-testing frameworks that validated system robustness under high-load conditions. I also independently drove a cyber-resiliency feature from architecture proposal through to production delivery. At Veritas, I led anomaly detection for structured workloads (Oracle, MySQL) using unsupervised ML (K-Means, DBSCAN, Isolation Forest) and built the full ELK data pipeline - Logstash, Elasticsearch, Kibana - containerized with Docker and orchestrated with Kubernetes.

Modern distributed storage systems like Amazon S3, Google Drive, and HDFS are built on a deceptively simple idea: separate metadata from data, and manage storage in chunks instead of files.

In this project, I built a production-style metadata service in Go that mimics core concepts of distributed object storage systems:

  • Chunk-based storage (4MB fixed-size blocks)

  • Content-addressed storage using SHA-256 hashing

  • Deduplication using reference counting

  • gRPC streaming upload/download APIs

  • PostgreSQL-backed metadata layer

  • Background garbage collection system

This article breaks down the architecture, design decisions, and internal mechanics of the system.


System Overview

At a high level, the system is composed of four main layers:

Client
  ↓
gRPC Metadata Service (Control Plane)
  ↓
PostgreSQL (Metadata Store)
  ↓
Local Disk Storage (Chunk Store)
  ↓
Garbage Collection Worker

The system is designed to simulate how real-world storage engines separate:

  • Control plane - metadata, indexing, lifecycle

  • Data plane - actual file storage

  • Background maintenance - GC, cleanup


Core Design Goals

The system was designed with the following goals:

1. Efficient large file handling

Files are split into 4MB chunks and streamed.

2. Deduplication at the storage layer

Identical chunks are stored only once using SHA-256 hashing.

3. Strong metadata consistency

PostgreSQL acts as the single source of truth.

4. Fault-tolerant deletion

Deletion is delayed and handled via garbage collection.

5. Streaming-first design

No full file buffering on the server side.


Architecture Breakdown

1. gRPC API Layer (Metadata Service)

The API layer exposes three core operations:

  • UploadObject (streaming upload)

  • DownloadObject (streaming download)

  • DeleteObject (logical deletion)

Why gRPC streaming?

Instead of uploading files in a single request, the system uses streaming:

  • Handles large files efficiently

  • Avoids memory pressure

  • Enables real-time chunk processing


2. Upload Pipeline

The upload flow is the most critical part of the system.

Step-by-step flow

  1. Client streams the file in 4MB chunks

  2. Server receives a chunk

  3. SHA-256 hash is computed

  4. Chunk is written to disk (if not already present)

  5. Chunk metadata is stored in PostgreSQL

  6. Object -> chunk mapping is created


Key design decision: content addressing

Each chunk is stored using:

SHA256(chunk_data) → filename

This ensures:

  • Deterministic storage

  • Deduplication across all objects

  • Fast lookup without scanning disk


Storage layout

data/chunks/ab/cd/<sha256-hash>

This prevents single-directory file explosion and improves filesystem performance.


3. Metadata Schema (PostgreSQL)

PostgreSQL acts as the system's source of truth.

Objects table

Stores file-level metadata:

  • object_id (UUID)

  • name

  • status (PENDING -> READY)

  • deleted flag

  • timestamps


Chunks table

Stores deduplicated chunks:

  • chunk_id

  • hash

  • storage_path

  • ref_count


object_chunks table

Maintains ordering:

  • object_id

  • chunk_id

  • order_index

This is essential for reconstructing files correctly during download.


Deduplication System

The system implements hash-based deduplication with reference counting.

How it works?

When a chunk is uploaded:

  • If the chunk already exists -> increment ref_count

  • Else -> create a new file and DB entry

Example

Object A → [chunk1, chunk2, chunk3]
Object B → [chunk2, chunk3, chunk4]

Chunks 2 and 3 are shared.

This reduces storage usage significantly for:

  • Backups

  • Versioned files

  • Repeated uploads


Deletion Model (Two-Phase Design)

Deletion is not immediate. Instead, it follows a 2-phase lifecycle.

Phase 1: Logical deletion

  • Mark object as deleted = true

  • Remove object -> chunk mappings

  • Decrement chunk reference counts

At this stage, data is still physically present on disk.

Phase 2: Garbage collection

A background worker periodically:

  • Finds orphan chunks

  • Deletes files with ref_count == 0

  • Removes metadata entries

This ensures:

  • No race conditions

  • Safe concurrent uploads/deletes

  • Crash-safe cleanup


Garbage Collection System

The GC worker runs continuously and performs the following.

1. Deleted object cleanup

  • Finds objects marked as deleted

  • Processes associated chunks

  • Removes metadata safely

2. Orphan chunk detection

Chunks not referenced by any object:

NOT EXISTS (SELECT 1 FROM object_chunks)

...are considered garbage.

3. Physical deletion

  • Removes the file from disk

  • Deletes the DB record

Why background GC instead of immediate deletion?

Because immediate deletion can cause:

  • Race conditions during uploads

  • Inconsistent reference counts

  • Partial reads during streaming


Download Flow

The download process reconstructs files deterministically.

Steps

  1. Fetch object metadata

  2. Retrieve ordered chunks

  3. Stream each chunk via gRPC

  4. Write sequentially to the client file

Guarantee

File reconstruction is byte-identical to the original input.


Consistency & Transactions

All critical operations use PostgreSQL transactions:

  • Upload object

  • Delete object

  • Chunk mapping updates

This ensures:

  • Atomicity

  • No partial writes

  • Safe failure recovery


Performance Characteristics

What this system optimizes for

  • Large file streaming

  • Storage deduplication

  • Metadata consistency

  • Sequential reconstruction speed

What it intentionally does NOT optimize

  • Distributed scaling

  • Multi-node replication

  • High availability

  • Global consistency protocols


Key System Properties

Property Description
Content Addressable Storage Chunks identified by SHA-256 hash
Deduplication Same chunk stored only once
Streaming Upload/Download No full-file buffering
Safe Garbage Collection Reference-count + background cleanup
Transactional Metadata Layer PostgreSQL ensures correctness

Limitations

This is a single-node prototype system, not a production storage engine.

Limitations include:

  • No distributed chunk replication

  • No fault tolerance across nodes

  • No erasure coding

  • No consensus layer (Raft, etc.)

  • GC is periodic, not event-triggered

  • Local disk storage only


Even as a prototype, this system demonstrates real-world engineering concepts used in:

  • Amazon S3 - object storage design

  • HDFS -- chunk-based storage model

  • Git - content-addressed storage principles

  • Ceph - metadata and object separation


Conclusion

This project is a miniature storage engine, built to explore how real distributed storage systems are designed internally.

It brings together the following:

  • Systems programming (Go)

  • Database design (PostgreSQL)

  • Storage architecture (CAS + deduplication)

  • Network programming (gRPC streaming)

  • Background processing (GC systems)