Engineering Tricks
Published:
Compilation of tricks I find useful.
LLMs Training
Memory efficient attention
- Flash Attention 2 which has a HF integration to llama 2 by fastchat here
- [Xformers] which also has a HF integration to llama2 by fastchat here
Fused Kernels
matmul + bias
,matmul + bias + gelu
,cross-entropy loss
,rotary embedding
,fused dropout + residual + layer norm
, implemented by FA2 here- FusedAdam, implemented by torch natively or via deepspeed here
- Fused SwiGLU from xformers.
Parallelism
- Difference between 3D parallelism and ZeRO-3, explained here.
Misc
- If LLM is pre-trained in bf16, finetune it with bf16 (not fp16). See more [hhere
- Packing with attention segment masking. See this twitter thread here. Credits to the twitters in thread.
- 4 bit optimiser PEFT training. See more here.
LLM Loading
- Deepspeed: need to use
deepspeed.zero.init
- HF:
model.from_pretrained(low_cpu_mem_usage=True)
LLM Inference
- Use vLLM - paged attention and 2x speed up compared to vanilla transformers on internal benchmark
- FP16 inference.
- 8 bit inferece.
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME_OR_PATH,
device_map="auto",
load_in_8bit=True,)
model.eval();
To support 8 bit inference for your own models, follow this discussion.
- Generate text excluding prompt.
def remove_input_ids_from_output_ids(input_ids, output_ids, tokenizer):
""" Remove input_ids from output_ids. Applicable only for causalLM models, which output input_ids in outputs."""
input_ids_lens = torch.tensor([
tokenized.ne(tokenizer.pad_token_id).sum().item() for tokenized in input_ids])
padding_lens = torch.tensor([(tokenized == tokenizer.pad_token_id).sum().item() for tokenized in input_ids])
total_lens = input_ids_lens + padding_lens
outputs = [op[total_lens[i]:] for i, op in enumerate(output_ids)]
return outputs
Software Engineering
- Set up passwordless ssh. If you still have issues, it’s likely file permissions. Set permissions according to this.
- Project Setup Boilerplate. Finally doing away with relative imports!
- Gitignore Boilerplate. Initialise gitignore from here.
- Commands to create venv
python3.9 -m venv name-of-venv-folder
# python3.9 -m venv name-of-venv-folder --system-site-packages
source name-of-venv-folder/bin/activate
- Find kernel for jupyter notebook. (general jupyter notebook, not jupyter notebook + VScode)
# activate env
source env/bin/activate
# ensures jupyter and python is in same environment. See https://stackoverflow.com/questions/48193822/import-of-package-works-in-ipython-shell-but-not-in-jupyter-notebook
pip install notebook --ignore-installed
# these needs to be same
which python3
which jupyter
# use jupyter notebook in venv. See
ipython kernel install --user --name=venv
# to see env, simply refresh
For Jupyter notebook + VScode, ensure that you’ve installed the following extensions: jupyter, python.
HuggingFace
- Sharing datasets. Guides on how to share my own HF datasets.
Nvidia Containers
- Maping between Nvidia container version and CUDA Toolkit and Pytorch version: Link here
- How to find appropriate container to download? Use
nvidia-smi
to find driver version. Then find latest nvidia container that still supports the driver version. - How to know if container is supported by driver? Can check forward compatibility here.