LLM Study Diary #2: Tokenization (opens in new tab)
Background I did some research online and found a nice course that teach how to build LLM from scratch. The course is shared public online and all the assignment resources are here: In the following series, I will put the summary and notes starting from lession 1. Tokenization Tokenization is at the very beginning of the LLM. There were many different tokenization algorithm, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization and Byte Pair Encoding (BPE). Ch...
Read the original article