Engineering Tricks

2 minute read

Published: December 27, 2022

Compilation of tricks I find useful.

LLMs Training

Memory efficient attention

Flash Attention 2 which has a HF integration to llama 2 by fastchat here
[Xformers] which also has a HF integration to llama2 by fastchat here

Fused Kernels

matmul + bias, matmul + bias + gelu, cross-entropy loss, rotary embedding, fused dropout + residual + layer norm, implemented by FA2 here
FusedAdam, implemented by torch natively or via deepspeed here
Fused SwiGLU from xformers.

Parallelism

Difference between 3D parallelism and ZeRO-3, explained here.

Misc

If LLM is pre-trained in bf16, finetune it with bf16 (not fp16). See more [hhere
Packing with attention segment masking. See this twitter thread here. Credits to the twitters in thread.
4 bit optimiser PEFT training. See more here.

LLM Loading

Deepspeed: need to use deepspeed.zero.init
HF: model.from_pretrained(low_cpu_mem_usage=True)

LLM Inference

Use vLLM - paged attention and 2x speed up compared to vanilla transformers on internal benchmark
FP16 inference.
8 bit inferece.

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME_OR_PATH,
                                             device_map="auto",
                                             load_in_8bit=True,)
model.eval();

To support 8 bit inference for your own models, follow this discussion.

Generate text excluding prompt.

def remove_input_ids_from_output_ids(input_ids, output_ids, tokenizer):
    """ Remove input_ids from output_ids. Applicable only for causalLM models, which output input_ids in outputs."""
    input_ids_lens = torch.tensor([
    tokenized.ne(tokenizer.pad_token_id).sum().item() for tokenized in input_ids])
    padding_lens = torch.tensor([(tokenized == tokenizer.pad_token_id).sum().item() for tokenized in input_ids])
    total_lens = input_ids_lens + padding_lens
    outputs = [op[total_lens[i]:] for i, op in enumerate(output_ids)]
    return outputs

Software Engineering

Set up passwordless ssh. If you still have issues, it’s likely file permissions. Set permissions according to this.
Project Setup Boilerplate. Finally doing away with relative imports!
Gitignore Boilerplate. Initialise gitignore from here.
Commands to create venv

python3.9 -m venv name-of-venv-folder
# python3.9 -m venv name-of-venv-folder --system-site-packages  
source name-of-venv-folder/bin/activate

Find kernel for jupyter notebook. (general jupyter notebook, not jupyter notebook + VScode)

# activate env
source env/bin/activate

# ensures jupyter and python is in same environment. See https://stackoverflow.com/questions/48193822/import-of-package-works-in-ipython-shell-but-not-in-jupyter-notebook
pip install notebook --ignore-installed  

# these needs to be same
which python3
which jupyter

# use jupyter notebook in venv. See
ipython kernel install --user --name=venv

# to see env, simply refresh

For Jupyter notebook + VScode, ensure that you’ve installed the following extensions: jupyter, python.

HuggingFace

Sharing datasets. Guides on how to share my own HF datasets.

Nvidia Containers

Maping between Nvidia container version and CUDA Toolkit and Pytorch version: Link here
How to find appropriate container to download? Use nvidia-smi to find driver version. Then find latest nvidia container that still supports the driver version.
How to know if container is supported by driver? Can check forward compatibility here.

Share on

Twitter Facebook LinkedIn

Larry Law

Engineering Tricks

LLMs Training

Memory efficient attention

Fused Kernels

Parallelism

Misc

LLM Loading

LLM Inference

Software Engineering

HuggingFace

Nvidia Containers

Share on

You May Also Enjoy

My first kaggle competition: March Machine Learning 2025

Problem

Suppose that every device takes in a batch of tensors where the tensors across devices are of different sizes, will 3D parallelism still work?

LLM Theory

Simple tricks to improve Retrieval Augmented Generation (RAG) systems

Motivation