Given the same compute budget, which is better: parameter efficient finetuning (PEFT) or full finetuning?

less than 1 minute read

Published: May 04, 2023

Motivation

If we are allowed to train till convergence, we know that full finetuning is better than parameter efficient finetuning (PEFT). But what if we have a fixed compute budget? Given a fixed budget, PEFT can go through significantly more tokens. Will full finetuning still be better than PEFT?

For my research problem, it turns out full finetuning is still better than PEFT.

Experiment Setup

For context, my research problem is to adapt english LLMs to other languages. I finetuned llama-7B on roots_* datasets (datasets used to train BLOOM) and open subtitles. For PEFT, I used LoRA and additionally trained the embedding and LM head.

Results

Loss plots for finetuning vs peft

Here, we can see that full finetuning (blue) achieves a better loss than PEFT (orange) given the same compute budget! Seems like finetuning can - ironically - be more efficient that PEFT. Interesting!

From my colleagues’ experiments, turns out PEFT only speeds up 25% compared to finetuning despite training <1% of parameters.

Share on

Twitter Facebook LinkedIn

Suppose that every device takes in a batch of tensors where the tensors across devices are of different sizes, will 3D parallelism still work?

2 minute read

Published: July 11, 2024

As I’m learning more about 3D parallelism, I wonder - suppose that every device takes in a batch of tensors where the tensors across devices are of different sizes, will 3D parallelism still work? Turns out, it works for data and pipeline parallelism, but tensor parallelism will need some work.

Motivation

Retrieval Augmented Generation (RAG) is a framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information. RAG is increasingly popular in industry as it’s simple to implement yet powerful. Here I’ll share some tricks to improve RAG systems.

Larry Law

Given the same compute budget, which is better: parameter efficient finetuning (PEFT) or full finetuning?

Motivation

Experiment Setup

Results

Share on

You May Also Enjoy

My first kaggle competition: March Machine Learning 2025

Problem

Suppose that every device takes in a batch of tensors where the tensors across devices are of different sizes, will 3D parallelism still work?

LLM Theory

Simple tricks to improve Retrieval Augmented Generation (RAG) systems

Motivation