Memory-Efficient LLM Inference via MoH-Guided Head-wise Offloading (In Progress)

Routing trace analysis -> head importance & regularity -> offloading/prefetch strategies -> accuracy/performance trade-off validation

MoH-guided head-wise offloading visualization

Overview

Key contributions

System Architecture

Bring-up Pipeline

Validation & Evidence

Debugging Case Study

Symptom:

Hypotheses:

Tests:

Fix:

Takeaway:

Skills Demonstrated