Engineering Tricks

2 minute read

Published:

Compilation of tricks I find useful.

LLMs Training

Memory efficient attention

  • Flash Attention 2 which has a HF integration to llama 2 by fastchat here
  • [Xformers] which also has a HF integration to llama2 by fastchat here

Fused Kernels

  • matmul + bias, matmul + bias + gelu, cross-entropy loss, rotary embedding, fused dropout + residual + layer norm, implemented by FA2 here
  • FusedAdam, implemented by torch natively or via deepspeed here
  • Fused SwiGLU from xformers.

Parallelism

  • Difference between 3D parallelism and ZeRO-3, explained here.

Misc

  • If LLM is pre-trained in bf16, finetune it with bf16 (not fp16). See more [hhere
  • Packing with attention segment masking. See this twitter thread here. Credits to the twitters in thread.
  • 4 bit optimiser PEFT training. See more here.

LLM Loading

  • Deepspeed: need to use deepspeed.zero.init
  • HF: model.from_pretrained(low_cpu_mem_usage=True)

LLM Inference

  • Use vLLM - paged attention and 2x speed up compared to vanilla transformers on internal benchmark
  • FP16 inference.
  • 8 bit inferece.
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME_OR_PATH,
                                             device_map="auto",
                                             load_in_8bit=True,)
model.eval();

To support 8 bit inference for your own models, follow this discussion.

  • Generate text excluding prompt.
def remove_input_ids_from_output_ids(input_ids, output_ids, tokenizer):
    """ Remove input_ids from output_ids. Applicable only for causalLM models, which output input_ids in outputs."""
    input_ids_lens = torch.tensor([
    tokenized.ne(tokenizer.pad_token_id).sum().item() for tokenized in input_ids])
    padding_lens = torch.tensor([(tokenized == tokenizer.pad_token_id).sum().item() for tokenized in input_ids])
    total_lens = input_ids_lens + padding_lens
    outputs = [op[total_lens[i]:] for i, op in enumerate(output_ids)]
    return outputs

Software Engineering

python3.9 -m venv name-of-venv-folder
# python3.9 -m venv name-of-venv-folder --system-site-packages  
source name-of-venv-folder/bin/activate
  • Find kernel for jupyter notebook. (general jupyter notebook, not jupyter notebook + VScode)
# activate env
source env/bin/activate

# ensures jupyter and python is in same environment. See https://stackoverflow.com/questions/48193822/import-of-package-works-in-ipython-shell-but-not-in-jupyter-notebook
pip install notebook --ignore-installed  

# these needs to be same
which python3
which jupyter

# use jupyter notebook in venv. See
ipython kernel install --user --name=venv

# to see env, simply refresh

For Jupyter notebook + VScode, ensure that you’ve installed the following extensions: jupyter, python.

HuggingFace

Nvidia Containers

  • Maping between Nvidia container version and CUDA Toolkit and Pytorch version: Link here
  • How to find appropriate container to download? Use nvidia-smi to find driver version. Then find latest nvidia container that still supports the driver version.
  • How to know if container is supported by driver? Can check forward compatibility here.