Q1 CNNs and RNNs don’t use positional embeddings. Why do transformers use positional embeddings? Answer Q2 Tell me the basic steps involved in running an inference query on an LLM. Answer Q3 Explain how KV Cache accelerates LLM inference. Answer Q4 How does quantization affect inference speed and memory requirements? Answer Q5 How do you handle the large memory r…
Q1 CNNs and RNNs don’t use positional embeddings. Why do transformers use positional embeddings? Answer Q2 Tell me the basic steps involved in running an inference query on an LLM. Answer Q3 Explain how KV Cache accelerates LLM inference. Answer Q4 How does quantization affect inference speed and memory requirements? Answer Q5 How do you handle the large memory requirements of KV cache in LLM inference? Answer Q6 After tokenization, how are tokens converted into embeddings in the Transformer model? Answer Q7 Explain why subword tokenization is preferred over word-level tokenization in the Transformer model. Answer Q8 Explain the trade-offs in using a large vocabulary in LLMs. Answer Q9 Explain how self-attention is computed in the Transformer model step by step. Answer Q10 What is the computational complexity of self-attention in the Transformer model? Answer Q11 How do Transformer models address the vanishing gradient problem? Answer Q12 What is tokenization, and why is it necessary in LLMs? Answer Q13 Explain the role of token embeddings in the Transformer model. Answer Q14 Explain the working of the embedding layer in the Transformer model. Answer Q15 What is the role of self-attention in the Transformer model, and why is it called “self-attention”? Answer Q16 What is the purpose of the encoder in a Transformer model? Answer Q17 What is the purpose of the decoder in a Transformer model? Answer Q18 How does the encoder-decoder structure work at a high level in the Transformer model? Answer Q19 What is the purpose of scaling in the self-attention mechanism in the Transformer model? Answer Q20 Why does the Transformer model use multiple self-attention heads instead of a single self-attention head? Answer Q21 How are the outputs of multiple heads combined and projected back in the multi-head attention in the Transformer model? Answer Q22 How does masked self-attention differ from regular self-attention, and where is it used in a Transformer? Answer Q23 Discuss the pros and cons of the self-attention mechanism in the Transformer model. Answer Q24 What is the purpose of masked self-attention in the Transformer decoder? Answer Q25 Explain how masking works in masked self-attention in Transformer. Answer Q26 Explain why self-attention in the decoder is referred to as cross-attention. How does it differ from self-attention in the encoder? Answer Q27 What is the softmax function, and where is it applied in Transformers? Answer Q28 What is the purpose of residual (skip) connections in Transformer layers? Answer Q29 Why is layer normalization used, and where is it applied in Transformers? Answer Q30 What is cross-entropy loss, and how is it applied during Transformer training? Answer Q31 Compare Transformers and RNNs in terms of handling long-range dependencies. Answer Q32 What are the fundamental limitations of the Transformer model? Answer Q33 How do Transformers address the limitations of CNNs and RNNs? Answer Q34 How do Transformer models address the vanishing gradient problem? Answer Q35 What is the purpose of the position-wise feed-forward sublayer? Answer Q36 Can you briefly explain the difference between LLM training and inference? Answer Q37 What is latency in LLM inference, and why is it important? Answer Q38 What is batch inference, and how does it differ from single-query inference? Answer Q39 How does batching generally help with LLM inference efficiency? Answer Q40 Explain the trade-offs between batching and latency in LLM serving. Answer Q41 How can techniques like mixture-of-experts (MoE) optimize inference efficiency? Answer Q42 Explain the role of decoding strategy in LLM text generation. Answer Q43 What are the different decoding strategies in LLMs? Answer Q44 Explain the impact of the decoding strategy on LLM-generated output quality and latency. Answer Q45 Explain the greedy search decoding strategy and its main drawback. Answer Q46 How does Beam Search improve upon Greedy Search, and what is the role of the beam width parameter? Answer Q47 When is a deterministic strategy (like Beam Search) preferable to a stochastic (sampling) strategy? Provide a specific use case. Answer Q48 Discuss the primary trade-off between the computational cost and the output quality when comparing Greedy Search and Beam Search. Answer Q49 When you set the temperature to 0.0, which decoding strategy are you using? Answer Q50 How is Beam Search fundamentally different from a Breadth-First Search (BFS) or Depth-First Search (DFS)? Answer Q51 Explain the criteria for choosing different decoding strategies. Answer Q52 Compare deterministic and stochastic decoding methods in LLMs. Answer Q53 What is the role of the context window during LLM inference? Answer Q54 Explain the pros and cons of large and small context windows in LLM inference. Answer Q55 What is the purpose of temperature in LLM inference, and how does it affect the output? Answer Q56 What is autoregressive generation in the context of LLMs? Answer Q57 Explain the strengths and limitations of autoregressive text generation in LLMs. Answer Q58 Explain how diffusion language models (DLMs) differ from Large Language Models (LLMs). Answer Q59 Do you prefer DLMs or LLMs for latency-sensitive applications? Answer Q60 Explain the concept of token streaming during inference. Answer Q61 What is speculative decoding, and when would you use it? Answer Q62 What are the challenges in performing distributed inference across multiple GPUs? Answer Q63 How would you design a scalable LLM inference system for real-time applications? Answer Q64 Explain the role of Flash Attention in reducing memory bottlenecks. Answer Q65 What is continuous batching, and how does it differ from static batching? Answer Q66 What is mixed precision, and why is it used during inference? Answer Q67 Differentiate between online and offline LLM inference deployment scenarios and discuss their respective requirements. Answer Q68 Explain the throughput vs latency trade-off in LLM inference. Answer Q69 What are the various bottlenecks in a typical LLM inference pipeline when running on a modern GPU? Answer Q70 How do you measure LLM inference performance? Answer Q71 What are the different LLM inference engines available? Which one do you prefer? Answer Q72 What are the challenges in LLM inference? Answer Q73 What are the possible options for accelerating LLM inference? Answer Q74 What is Chain-of-Thought prompting, and when is it useful? Answer Q75 Explain the reason behind the effectiveness of Chain-of-Thought (CoT) prompting. Answer Q76 Explain the trade-offs in using CoT prompting. Answer Q77 What is prompt engineering, and why is it important for LLMs? Answer Q78 What is the difference between zero-shot and few-shot prompting? Answer Q79 What are the different approaches for choosing examples for few-shot prompting? Answer Q80 Why is context length important when designing prompts for LLMs? Answer Q81 What is a system prompt, and how does it differ from a user prompt? Answer Q82 What is In-Context Learning (ICL), and how is few-shot prompting related? Answer Q83 What is self-consistency prompting, and how does it improve reasoning? Answer Q84 Why is context important in prompt design? Answer Q85 Describe a strategy for reducing hallucinations via prompt design. Answer Q86 How would you structure a prompt to ensure the LLM output is in a specific format, like JSON? Answer Q87 Explain the purpose of ReAct prompting in AI agents. Answer Q88 What are the different phases in LLM development? Answer Q89 What are the different types of LLM fine-tuning? Answer Q90 What role does instruction tuning play in improving an LLM’s usability? Answer Q91 What role does alignment tuning play in improving an LLM’s usability? Answer Q92 How do you prevent overfitting during fine-tuning? Answer Q93 What is catastrophic forgetting, and why is it a concern in fine-tuning? Answer Q94 What are the strengths and limitations of full fine-tuning? Answer Q95 Explain how parameter efficient fine-tuning addresses the limitations of full fine-tuning. Answer Q96 When might prompt engineering be preferred over task-specific fine-tuning? Answer Q97 When should you use fine-tuning vs RAG? Answer Q98 What are the limitations of using RAG over fine-tuning? Answer Q99 What are the limitations of fine-tuning compared to RAG? Answer Q100 When should you prefer task-specific fine-tuning over prompt engineering? Answer Q101 What is LoRA, and how does it work? Answer Q102 Explain the key ingredient behind the effectiveness of the LoRA technique. Answer Q103 What is QLoRA, and how does it differ from LoRA? Answer Q104 When would you use QLoRA instead of standard LoRA? Answer Q105 How would you handle LLM fine-tuning on consumer hardware with limited GPU memory? Answer Q106 Explain different preference alignment methods and their trade-offs. Answer Q107 What is gradient accumulation, and how does it help with fine-tuning large models? Answer Q108 What are the possible options to speed up LLM fine-tuning? Answer Q109 Explain the pretraining objective used in LLM pretraining. Answer Q110 What is the difference between casual language modeling and masked language modeling? Answer Q111 How do LLMs handle out-of-vocabulary (OOV) words? Answer Q112 In the context of LLM pretraining, what is scaling law? Answer Q113 Explain the concept of Mixture-of-Experts (MoE) architecture and its role in LLM pretraining. Answer Q114 What is model parallelism, and how is it used in LLM pre-training? Answer Q115 What is the significance of self-supervised learning in LLM pretraining? Answer