LLM

✅ CHAPTER 1 — THEORY ANSWERS (All 26)

1. Language AI

Language AI is a branch of AI that focuses on enabling machines to understand, process, and generate human language using computational methods.

2. Large Language Model (LLM)

An LLM is a deep neural network trained on massive text data to perform tasks like generation, translation, summarization, and reasoning.

3. NLP

Natural Language Processing is the field that handles computational techniques to understand and manipulate human language.

4. Embeddings

Embeddings are dense numerical vectors that represent words or tokens based on meaning and context.

5. Applications of LLMs

Text generation, chatbots, translation, summarization, sentiment analysis, question answering, and code generation.


---

6. Bag-of-Words

A text representation where each sentence/document is converted into a vector of word frequencies, ignoring order and grammar.

7. Limitations of BoW

Ignores word order and meaning, creates sparse vectors, and cannot handle synonyms or context.

8. Dense Embeddings

Compact vectors capturing the semantic meaning of words, unlike sparse one-hot vectors.

9. BoW vs Word2Vec

BoW counts word frequency; Word2Vec learns semantic meaning by predicting word contexts.

10. How Word2Vec captures meaning

By learning which words appear in similar contexts, causing similar words to have similar vectors.


---

11. Neural Network

A multi-layer computational model where neurons process inputs and learn patterns by adjusting weights.

12. Hidden Layers

Intermediate layers that transform inputs into deeper features useful for prediction.

13. RNNs

Recurrent Neural Networks process sequences by carrying information from previous steps via hidden states.

14. Limitations of RNNs

Slow training, cannot parallelize, and struggle with long-range dependencies (vanishing gradients).


---

15. Attention Mechanism

Attention decides which words in a sequence are most relevant to each other during processing.

16. Self-Attention

Each token attends to every other token in the same sequence to gather contextual information.

17. Self-Attention vs Traditional Attention

Traditional attention works between encoder–decoder; self-attention works within the same sequence.

18. Problem solved by Attention

Long-range dependency and context understanding across entire sentences.

19. Why Transformers are faster

They remove recurrence and allow parallel processing of all tokens at once.

20. Transformer Architecture

Consists of stacked encoder and/or decoder layers containing attention and feed-forward networks.


---

21. Encoder-only model

Uses only the encoder block; used for classification and representation tasks (e.g., BERT).

22. Decoder-only model

Uses only the decoder block; used for text generation (e.g., GPT).

23. Encoder–Decoder model

Separate encoder and decoder; used for translation/summarization (e.g., T5).

24. Use of BERT

Used for classification, embeddings, and understanding tasks (not generation).

25. Why GPT is autoregressive

It generates one token at a time using previous outputs as inputs.

26. Context Window

Maximum number of tokens the model can process at once.


---

✅ CHAPTER 2 — THEORY ANSWERS (All 28)

27. Tokenization

The process of splitting text into smaller units (tokens) for model processing.

28. Token

A basic unit of text such as a word, subword, character, or byte.

29. Why token IDs?

LLMs only understand numbers; token IDs convert text into numerical form.

30. Tokens vs Words

Tokens may be full words or subwords; words are natural language units.


---

31. Word-level tokenization

Splits text on spaces into whole words.

32. One advantage + disadvantage

Advantage: simple and intuitive.
Disadvantage: OOV words cannot be handled.

33. Subword tokenization

Breaks words into meaningful parts like prefixes, stems, and suffixes.

34. Why LLMs prefer subwords?

Solves OOV, keeps vocabulary small, handles rare words smoothly.

35. Character-level tokenization

Splits text into individual characters; no OOV problem.

36. Byte-level tokenization

Represents text as raw bytes, allowing universal multilingual coverage.


---

37. BPE

Byte Pair Encoding merges most frequent adjacent character pairs to form subwords.

38. BPE merge step

Find the highest-frequency pair and replace it with a new token.

39. Advantages of BPE

Efficient vocabulary, handles rare words, deterministic.

40. BPE Example

(Answered via steps in numericals—here theory only: BPE merges frequent pairs iteratively.)


---

41. WordPiece

A subword tokenizer that selects merges based on likelihood scoring.

42. WordPiece Score Formula

Score = freq(pair) / (freq(first) × freq(second))

43. Difference BPE vs WordPiece

BPE merges by frequency; WordPiece merges by probability score.

44. WordPiece Tokenization Steps

Match longest substring in vocabulary → emit token → continue.


---

45. SentencePiece

A tokenizer that trains directly on raw text without whitespace pre-splitting.

46. BPE vs SentencePiece

SentencePiece doesn’t need pre-tokenized text; BPE does.

47. Unigram LM

A probabilistic model with a large vocabulary that gets pruned to maximize likelihood.

48. Viterbi in Unigram

Finds the most probable segmentation of a word.

49. EM Algorithm Use

Updates subword probability distribution by maximizing corpus likelihood.


---

50. Embedding vector

A dense numerical representation capturing semantic properties of a token.

51. Embedding matrix

A V×d matrix where each row is the embedding for a token.

52. Static vs Contextual embeddings

Static: one vector per word.
Contextual: vector changes with sentence context.

53. Meaning example

“Bank” in “river bank” differs from “bank” in “money bank” because context modifies embeddings.


✅ CHAPTER 3 — THEORY ANSWERS (All Questions Fully Answered)


---

54. What is a forward pass?

A forward pass is the process where input token embeddings flow through all transformer layers and the LM head to produce the next-token probability distribution.


---

55. Explain autoregressive generation.

In autoregressive generation, the model predicts one token at a time and feeds the generated token back into the input to predict the next token.


---

56. What is the LM head?

The LM head is the final linear layer that converts the transformer's output vectors into a probability distribution over the entire vocabulary.


---

57. Why does GPT add the newly generated token back to input?

Because GPT is autoregressive; each new prediction depends on all previously generated tokens, so the model needs the updated sequence to continue generating.


---

58. Components of a Transformer block.

A transformer block consists of:

1. A self-attention layer


2. A feed-forward network (FFN)
Both wrapped with residual connections and layer normalization.




---

59. Role of Feed Forward Network (FFN).

The FFN expands and transforms the representations learned by attention, providing non-linear processing and most of the model’s computational capacity.


---

60. What are Query, Key, and Value (Q-K-V)?

Q, K, and V are learned projections of input tokens:

Query: what the token is searching for

Key: what information the token offers

Value: the information passed to the next layer



---

61. Why do we apply Softmax in attention?

Softmax normalizes attention scores into probabilities so the model can assign meaningful weights to different tokens.


---

62. Why scale dot product by √d?

Scaling by √d prevents extremely large dot-products when dimensionality is high, stabilizing gradients and improving training.


---

63. What is an attention score?

It is the similarity between a token’s query and another token’s key, computed using a dot-product and used to weight the value vectors.


---

64. Write the attention formula.

\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V


---

65. How does attention capture long-range dependencies?

Attention allows tokens to directly attend to any other token in the sequence regardless of distance, avoiding the step-by-step limitations of RNNs.


---

66. Real example where attention helps.

In “The dog chased the squirrel because it was fast,” attention helps identify whether “it” refers to "dog" or "squirrel" by examining context tokens.


---

67. Why are Transformers parallelizable?

Because they process all tokens simultaneously using self-attention instead of sequentially like RNNs, enabling parallel computation.


---

68. What is context length?

It is the maximum number of tokens the model can read at once. Longer context allows LLMs to handle bigger documents.


---

69. What is KV caching?

KV caching stores the previously computed Key and Value matrices so the model doesn’t recompute them for every new generated token.


---

70. Why does caching speed up generation?

By reusing past K/V states, the model only computes attention for the newest token instead of recomputing for all past tokens, drastically reducing time.


---

71. Why are positional embeddings needed?

Transformers treat input tokens as unordered, so positional embeddings provide information about each token’s position in the sequence.


---

72. What is Rotary Positional Embedding (RoPE)?

RoPE encodes position by rotating token vectors with trigonometric functions, allowing the model to learn relative positions.


---

73. How does RoPE encode relative positions?

RoPE rotates Q and K vectors based on position indices such that the angle between rotated vectors represents relative position differences.


----


✅ CHAPTER 4 — TEXT CLASSIFICATION (All Answers, 2-mark style)

────────────────────────

74. Define text classification.

Text classification is the process of assigning predefined labels to text based on content, such as sentiment, topic, intent, or language.


---

75. Two applications of text classification.

Sentiment analysis for reviews, spam detection in emails, intent detection in chatbots, or language identification.


---

76. What is a classifier head?

A classifier head is a final layer added on top of a model (like BERT) that outputs probabilities over classes for a classification task.


---

77. What is zero-shot classification?

Zero-shot classification assigns labels without any training data by comparing text embeddings with label-description embeddings.


---

78. What is a task-specific model?

A pre-trained language model fine-tuned on a particular downstream task (e.g., sentiment analysis). Example: "bert-base-uncased-finetuned-SST2".


---

79. What is an embedding model?

A model that outputs dense semantic vectors for text, which can be used as features for classification using ML algorithms.


---

80. Difference: task-specific vs embedding model.

Task-specific models directly output class labels; embedding models produce vectors that require another classifier (e.g., Logistic Regression).


---

81. How generative models perform classification?

By prompting them with questions like “Is this positive or negative?” and interpreting the generated text (e.g., “positive”) as the label.


---

82. Precision

Precision = proportion of predicted positives that are actually correct.

Precision = \frac{TP}{TP + FP}


---

83. Recall

Recall = proportion of actual positives correctly identified.

Recall = \frac{TP}{TP + FN}


---

84. Accuracy

Accuracy = proportion of all predictions that the model got correct.

Accuracy = \frac{TP + TN}{Total}


---

85. F1 Score

F1 is the harmonic mean of precision and recall.

F1 = \frac{2PR}{P+R}


---

86. Confusion matrix

A table showing TP, FP, TN, FN used to evaluate classification performance.


---

87. Why F1 better than accuracy?

F1 handles class imbalance by balancing precision and recall, whereas accuracy can be misleading with skewed datasets.


---

88. Steps to train a classifier using embeddings.

1. Convert text into embeddings using an embedding model.


2. Train a classifier (LogReg, SVM, Random Forest) on the embedding vectors.




---

89. Cosine similarity

A measure of similarity between two vectors based on angle, ranging from –1 to 1.


---

90. Cosine vs Euclidean distance

Cosine measures direction similarity; Euclidean measures magnitude/distance in space.


---

91. k-NN classification

k-NN assigns the label of the majority of the k closest neighbors based on a distance metric (e.g., Euclidean).


---

92. Small cosine similarity explanation

Compute dot product → compute magnitudes → divide. Higher value means more similar. (Theory only.)


---

93. Euclidean distance explanation

Distance between two points found by square root of sum of squared coordinate differences.


---

────────────────────────

✅ CHAPTER 5 — TEXT CLUSTERING & TOPIC MODELING (All Answers, 2-mark style)

────────────────────────

94. Define text clustering.

Grouping similar documents based on semantic similarity without using labels.


---

95. Why use embeddings for clustering?

Embeddings capture semantic meaning of text, allowing similar documents to be close in vector space.


---

96. Why dimensionality reduction?

To reduce noise, simplify high-dimensional vectors, and improve cluster separation.


---

97. PCA vs UMAP

PCA preserves global variance linearly.

UMAP preserves both local and global structure with non-linear mapping.



---

98. What is HDBSCAN?

A density-based clustering algorithm that automatically determines the number of clusters and identifies outliers.


---

99. Why HDBSCAN better than k-means?

Does not require pre-setting number of clusters, and can mark points as noise instead of forcing membership.


---

100. What is a topic?

A coherent theme represented by a set of keywords that appear frequently together in documents.


---

101. Define topic modeling.

An unsupervised method to discover hidden thematic structures within large text collections.


---

102. Text clustering vs topic modeling

Clustering groups documents; topic modeling groups words into themes.


---

103. What is LDA?

Latent Dirichlet Allocation is a probabilistic model where each document is represented as a mixture of topics, and each topic as a mixture of words.


---

104. Steps of BERTopic

1. Embed documents → reduce dimensions → cluster (HDBSCAN).


2. Extract topic keywords using c-TF-IDF.




---

105. What is c-TF-IDF?

A class-based TF-IDF variant that computes term importance within each cluster instead of per document.


---

106. Why c-TF-IDF instead of TF-IDF?

Because it captures words representative of an entire cluster, not individual documents.


---

107. How topic keywords extracted in BERTopic?

By computing c-TF-IDF scores for each word in the cluster and selecting the highest-scoring words.


---

108. What is modularity in BERTopic?

Each step (embedding, reduction, clustering, topic extraction) can be replaced independently for flexibility.


---

109. How BERTopic uses clustering + BoW?

BERTopic clusters documents using embeddings, then uses bag-of-words counts within each cluster to extract topics.

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.