EPFL · VILAB · Educational Release

nanoMFM

Learn to build multimodal generative foundation models from scratch.

Minimal, readable implementations of GPT, MaskGIT, 4M, Flow Matching, and a VLM — each built step by step through guided notebooks, trained on toy datasets with modest compute.

View on GitHub Get Started

Taught in CS-503 Visual Intelligence & COM-304 Intelligent Systems at EPFL.

The five nanoMFM modules: nanoGPT (text), nanoMaskGIT (masked tokens), nano4M (any-to-any multimodal), nanoFlowMatching (continuous denoising), and nanoVLM (vision-language QA) — The five modules — from a single-stream language model to multimodal any-to-any generation and vision-language understanding.

What is this?

nanoMFM is an educational codebase that teaches the modeling fundamentals behind modern generative and multimodal models. It strips each idea down to its essentials so you can read every line, fill in the gaps, and train a working model yourself — no infrastructure rabbit holes.

✏️ Exercises & notebooks

Each module ships a Jupyter notebook that walks you through the implementation, training, and inference. You fill in the TODO gaps in the exercise stubs, train your model, and visualize its predictions.

🚀 Get started right away

Prefer to tinker? A complete reference implementation is included. Switch between modes with a single environment variable — NANOMFM_EXERCISES=1 for stubs, unset for solutions.

Designed to be minimal. Toy datasets (TinyStories, MNIST, CIFAR-10, MultimodalCLEVR) and small models keep compute modest — everything was tested on 1–4 V100 GPUs. The fundamentals you learn here translate directly when scaled up.

The Modules

Five self-contained modules, building from a single-stream language model up to multimodal any-to-any generation and vision-language understanding.

nanoGPT

Autoregressive transformer for language & image generation

Start with the foundation: a decoder-only transformer trained next-token prediction on TinyStories and MNIST. Implement attention, the transformer trunk, and efficient generation with a KV-cache.

📓 Notebook 🖥 Rundown Slides </> gpt.py

nanoGPT generating a TinyStories story one token at a time — Text: autoregressive generation, one token at a time.

nanoGPT generating MNIST digits patch by patch — Image: MNIST digit generated patch by patch.

nanoMaskGIT

Generation via unimodal masked modeling

Move from left-to-right to parallel, iterative decoding. Inspired by MaskGIT, you generate images (and text) by progressively unmasking tokens — the masked-modeling recipe that also powers recent diffusion-style language models like LLaDA.

📓 Notebook 🖥 Rundown Slides </> maskgit.py

All 10 MNIST digits being unmasked step by step by nanoMaskGIT — Image: all 10 digits unmasking in parallel.

nanoMaskGIT iteratively unmasking a TinyStories text sequence — Text: iterative masked decoding from the same prompt as GPT above.

nano4M

Multimodal any-to-any generation

Generalize masked modeling across modalities. Following 4M, you train a single model that maps any subset of modalities to any other — RGB, depth, surface normals, and scene descriptions — on the MultimodalCLEVR dataset.

📓 Notebook 🖥 Rundown Slides </> fourm.py ＋ 4M Tutorial (pretrained)

MultimodalCLEVR sample: RGB, depth, surface normals, and text — A MultimodalCLEVR sample across four aligned modalities.

nanoFlow

Continuous-time generative modeling

Switch from discrete tokens to continuous data with Rectified Flow / Flow Matching and a DiT-style denoiser. Learn the velocity field that transports noise to data on MNIST and CIFAR-10.

📓 Notebook 🖥 Rundown Slides </> rectified_flow.py

Flow matching animation transporting Gaussian noise into data samples — The flow transports a noise distribution into data.

nanoFlow generating MNIST digits: noise evolves into recognizable digits — Generated MNIST digits: noise → digits via 50 Euler steps.

nanoVLM

Vision-language model for image understanding & VQA

Bring it together into a vision-language model. Following LLaVA, you connect a vision encoder to a language model through a lightweight projector and train for captioning and visual question answering.

📓 Notebook 🖥 Rundown Slides </> vision_language_model.py

nanoVLM architecture: tokenizer and vision encoder feeding a language model that answers a question about an image — Vision encoder + projector + language model → answer.

Get Started

Clone, set up the environment, and open the notebooks. Datasets for the first modules download automatically.

Install

git clone https://github.com/EPFL-VILAB/nanoMFM.git
cd nanoMFM
bash scripts/setup_env.sh
source activate nanomfm

Work through the exercises

export NANOMFM_EXERCISES=1
jupyter lab   # then fill in the TODOs

Imports resolve to stub implementations with TODOs. Reference _solutions/ if you get stuck.

Run the complete models

from nanomfm.models import (
    GPT, MaskGIT, FourM, RectifiedFlow
)

Solutions are the default — no environment variable needed. Train via scripts/run_training.py.

Roadmap

nanoMFM is actively evolving. On the near-term list:

More techniques — additional architectures and training tricks as the course material grows.
Richer visualizations — more worked examples and qualitative results across modules.

Citation

If you find nanoMFM useful, please consider citing it.

BibTeX

@misc{bachmann2026nanomfm,
  author    = {Bachmann, Roman and Gao, Zhitong and Khattak, Muhammad Uzair and Ye, Mingqiao and Zamir, Amir},
  title     = {nanoMFM: Educational Multimodal Foundation Models},
  publisher = {GitHub},
  year      = {2026},
  url       = {https://github.com/EPFL-VILAB/nanoMFM}
}