EPFL ยท VILAB ยท Educational Release

nanoMFM

Learn to build multimodal generative foundation models from scratch.

Minimal, readable implementations of GPT, MaskGIT, 4M, Flow Matching, and a VLM โ€” each built step by step through guided notebooks, trained on toy datasets with modest compute.

Taught in CS-503 Visual Intelligence & COM-304 Intelligent Systems at EPFL.

The five nanoMFM modules: nanoGPT (text), nanoMaskGIT (masked tokens), nano4M (any-to-any multimodal), nanoFlowMatching (continuous denoising), and nanoVLM (vision-language QA)
The five modules โ€” from a single-stream language model to multimodal any-to-any generation and vision-language understanding.

What is this?

nanoMFM is an educational codebase that teaches the modeling fundamentals behind modern generative and multimodal models. It strips each idea down to its essentials so you can read every line, fill in the gaps, and train a working model yourself โ€” no infrastructure rabbit holes.

โœ๏ธ Exercises & notebooks

Each module ships a Jupyter notebook that walks you through the implementation, training, and inference. You fill in the TODO gaps in the exercise stubs, train your model, and visualize its predictions.

๐Ÿš€ Get started right away

Prefer to tinker? A complete reference implementation is included. Switch between modes with a single environment variable โ€” NANOMFM_EXERCISES=1 for stubs, unset for solutions.

Designed to be minimal. Toy datasets (TinyStories, MNIST, CIFAR-10, MultimodalCLEVR) and small models keep compute modest โ€” everything was tested on 1โ€“4 V100 GPUs. The fundamentals you learn here translate directly when scaled up.

The Modules

Five self-contained modules, building from a single-stream language model up to multimodal any-to-any generation and vision-language understanding.

01

nanoGPT

Autoregressive transformer for language & image generation

Start with the foundation: a decoder-only transformer trained next-token prediction on TinyStories and MNIST. Implement attention, the transformer trunk, and efficient generation with a KV-cache.

nanoGPT generating a TinyStories story one token at a time
Text: autoregressive generation, one token at a time.
nanoGPT generating MNIST digits patch by patch
Image: MNIST digit generated patch by patch.
02

nanoMaskGIT

Generation via unimodal masked modeling

Move from left-to-right to parallel, iterative decoding. Inspired by MaskGIT, you generate images (and text) by progressively unmasking tokens โ€” the masked-modeling recipe that also powers recent diffusion-style language models like LLaDA.

All 10 MNIST digits being unmasked step by step by nanoMaskGIT
Image: all 10 digits unmasking in parallel.
nanoMaskGIT iteratively unmasking a TinyStories text sequence
Text: iterative masked decoding from the same prompt as GPT above.
05

nanoVLM

Vision-language model for image understanding & VQA

Bring it together into a vision-language model. Following LLaVA, you connect a vision encoder to a language model through a lightweight projector and train for captioning and visual question answering.

nanoVLM architecture: tokenizer and vision encoder feeding a language model that answers a question about an image
Vision encoder + projector + language model โ†’ answer.

Get Started

Clone, set up the environment, and open the notebooks. Datasets for the first modules download automatically.

Install
git clone https://github.com/EPFL-VILAB/nanoMFM.git
cd nanoMFM
bash scripts/setup_env.sh
source activate nanomfm

Work through the exercises

export NANOMFM_EXERCISES=1
jupyter lab   # then fill in the TODOs

Imports resolve to stub implementations with TODOs. Reference _solutions/ if you get stuck.

Run the complete models

from nanomfm.models import (
    GPT, MaskGIT, FourM, RectifiedFlow
)

Solutions are the default โ€” no environment variable needed. Train via scripts/run_training.py.

Roadmap

nanoMFM is actively evolving. On the near-term list:

Citation

If you find nanoMFM useful, please consider citing it.

BibTeX
@misc{bachmann2026nanomfm,
  author    = {Bachmann, Roman and Gao, Zhitong and Khattak, Muhammad Uzair and Ye, Mingqiao and Zamir, Amir},
  title     = {nanoMFM: Educational Multimodal Foundation Models},
  publisher = {GitHub},
  year      = {2026},
  url       = {https://github.com/EPFL-VILAB/nanoMFM}
}