ML Fundamentals for AI Architects

How to Use This Document

AI Architect interviews test ML knowledge at the level of an engineer who can make informed deployment and design decisions — not at the level of a researcher implementing novel training algorithms. The questions here cover what an AI Architect must understand to reason about model selection, evaluation, failure modes, and trade-offs.

The depth expected: you should be able to explain why things work, not just that they work.

Section 1 — Transformers and LLMs

Q1: Explain how attention works and why it matters for long-context understanding.

Category: ML Fundamentals Difficulty: Senior Role: AI Architect / ML Engineer

Answer:

Attention is the mechanism that allows a Transformer to weigh the relevance of every token in the input when computing the representation of any given token. For a token at position i, attention computes a weighted sum over all other token positions, where the weight reflects how semantically relevant each other token is to token i.

Mechanism: Given an input sequence, the Transformer projects each token into three vectors: Query (Q), Key (K), and Value (V). The attention score between token i and token j is:

text

attention(i, j) = softmax(Q_i · K_j / sqrt(d_k))

The output for token i is the weighted sum of all Value vectors: sum(attention_weights * V).

Why this matters for long-context understanding: Unlike RNNs, which process sequences left-to-right and must compress all prior context into a fixed-size hidden state (which degrades over long sequences), attention can directly connect any two tokens regardless of their distance in the sequence. A Transformer reading a 128K-token clinical document can directly attend from a conclusion at position 128,000 back to a test result at position 1,000 without the information being compressed or lost through 127,000 intermediate steps.

The scaling challenge: Attention is O(n²) in sequence length — computing attention scores between all pairs of tokens. A 128K context window requires 128,000² = 16 billion attention score computations. Flash Attention and other efficient attention implementations reduce memory access patterns to make this tractable, but the fundamental compute cost remains a reason why long-context inference is expensive.

Practical implication for architects: Every token in the context window costs money. Long-context requests (RAG with large retrieved documents, long conversation history) are expensive at price/token. Design systems to control context window usage — don't just throw everything into the context.

Q2: What is the difference between instruction fine-tuning, RLHF, and RAG? When would you use each?

Category: ML Fundamentals Difficulty: Senior Role: AI Architect

Answer:

These are fundamentally different techniques addressing different problems.

Instruction fine-tuning (SFT — Supervised Fine-Tuning): Adjusts the model's weights by training on examples of (instruction, response) pairs. Changes what the model "knows how to do" — its behavioral style and task-specific format adherence. Use when: you need the model to consistently follow a specific output format, adopt domain-specific terminology, or perform a task type not well-covered by the base model's pretraining. Example: fine-tuning on clinical notes to improve SOAP note format adherence.

RLHF (Reinforcement Learning from Human Feedback): Trains a reward model from human preference data (which of two responses is better?) and then fine-tunes the LLM to maximize the reward model's score via RL (typically PPO). Changes the model's preferences — making it more helpful, harmless, and honest. RLHF is what converts a pretrained base model into an instruction-following assistant model. As an AI Architect, you almost never implement RLHF; you consume models that have been trained with it (Claude, GPT-4). Use RLHF when: you control the model training pipeline and want to instill specific preferences at scale.

RAG (Retrieval-Augmented Generation): Does not change the model at all. Instead, retrieves relevant information at inference time and injects it into the prompt. Addresses the knowledge limitation problem — the model's weights contain knowledge up to the training cutoff, but RAG provides current, organization-specific knowledge. Use when: the knowledge domain is frequently updated (clinical guidelines, formulary), the knowledge is organization-specific (internal policies), or hallucination on specific facts must be addressed with citations.

Decision framework:

Need	Technique
Model doesn't follow the right output format	SFT
Model doesn't know about recent events or internal docs	RAG
Model behavior is too verbose, unsafe, or unhelpful	RLHF (or use a model already trained with it)
Model doesn't use domain-specific terminology correctly	SFT or domain-specific embedding model
Reduce hallucination on factual queries	RAG (citations)

Key insight for architects: RAG is almost always the right first answer for knowledge gaps because it requires no training, is immediately updatable, and provides citations. SFT is additive — use it on top of RAG when format adherence or domain vocabulary is still inadequate after RAG.

Q3: What is temperature and how does it affect LLM outputs? What setting would you use for clinical AI?

Category: ML Fundamentals Difficulty: Mid-level Role: ML Engineer / AI Architect

Answer:

Temperature controls the randomness of the LLM's token sampling. At each generation step, the model produces a probability distribution over all vocabulary tokens. Temperature is a scaling factor applied before softmax:

text

adjusted_logit_i = logit_i / temperature

Temperature = 0: The model always selects the highest-probability token (greedy decoding). Outputs are deterministic and reproducible.
Temperature = 1: The model samples from the raw probability distribution — expected diversity in outputs.
Temperature > 1: The distribution is flattened — lower-probability tokens become more likely. More creative and unpredictable outputs; also higher hallucination risk.
Temperature < 1: The distribution is sharpened — the highest-probability token dominates. Less diversity, more consistent outputs.

For clinical AI: Temperature = 0 or very low (0.1 or below) for all clinical output generation.

Reasoning:

Clinical documentation must be reproducible — if a physician sees the AI's draft summary and approves it, re-running the same prompt should not produce a materially different output
Clinical decisions should not vary based on random sampling — two nurses who trigger the same CDS alert on the same patient should receive the same AI recommendation
Hallucination risk increases with temperature — for clinical content, factual accuracy is non-negotiable

Where higher temperature is appropriate:

Brainstorming use cases: "What other diagnoses should we consider?" — moderate creativity is appropriate
Patient communication drafts: slight variation in phrasing is acceptable; determinism less critical

Q4: Explain token count, context window, and how they interact with cost.

Category: ML Fundamentals Difficulty: Mid-level Role: ML Engineer / AI Architect

Answer:

Token: The unit of text that LLMs process. Roughly, 1 token ≈ 4 characters of English text ≈ 0.75 words. But tokenization is vocabulary-dependent: medical abbreviations, code, and non-English text may tokenize differently.

Context window: The maximum total number of tokens (input + output) the model can process in a single inference call. As of current models, context windows range from tens of thousands to hundreds of thousands of tokens. A 200K token context window can hold approximately 150,000 words — roughly a book-length document.

Cost interaction: Managed LLM APIs charge per token: separate rates for input tokens and output tokens. Input (prompt) tokens are cheaper than output (generated) tokens at most providers.

text

cost = (input_tokens × price_per_input_token) + (output_tokens × price_per_output_token)

For architects — what drives cost:

System prompt length: A 2,000-token system prompt is paid on every API call. Over 1 million calls/day, this is 2 billion tokens of system prompt cost. Prompt caching eliminates most of this.
Retrieved context (RAG chunks): Each retrieved chunk is input tokens. 5 chunks × 500 tokens = 2,500 additional input tokens per query.
Conversation history: In multi-turn conversations, all prior turns are re-sent on each turn. A 20-turn conversation has exponentially growing input cost. Context compression (summarizing older turns) controls this.
Output length: maxtokens is a ceiling, not a target. Set it to the appropriate ceiling for your use case — setting maxtokens=4096 for tasks that generate 200-token responses wastes nothing (output is billed for actual tokens generated, not max_tokens), but setting it too low truncates outputs.

Practical measurement: Always log prompt<em>tokens and completion</em>tokens from the API response before building cost optimization strategies. Most teams discover their largest token consumer is not what they expected.

Q5: What is embedding dimensionality and how does it affect retrieval quality and performance?

Category: ML Fundamentals Difficulty: Senior Role: ML Engineer / AI Architect

Answer:

An embedding is a dense vector representation of a piece of text, where semantic similarity in the original text space corresponds to geometric proximity in the vector space. Dimensionality is the length of this vector (e.g., 384, 768, 1536, 3072 dimensions).

Higher dimensionality:

More expressive: the vector can encode finer-grained semantic distinctions
Better at capturing nuanced meaning in complex domain-specific text
Higher storage cost: a 3072-dimensional vector is 4× larger than a 768-dimensional vector (in float32)
Slower ANN search: more dimensions = more computation per distance calculation
Curse of dimensionality: as dimensionality increases, all vectors in the space become approximately equidistant — mitigated in practice by product quantization or dimensionality reduction

Lower dimensionality:

Faster ANN search
Lower storage and memory footprint
May lose fine-grained semantic distinctions for specialized domains

Model-specific examples (verify current benchmarks against your use case):

Small/fast models (e.g., 384d): appropriate for large-scale, cost-sensitive retrieval where semantic precision is less critical
Standard models (e.g., 768d–1024d): good balance for most enterprise knowledge bases
Large/specialized models (e.g., 1536d–3072d): appropriate for clinical or legal domains where domain-specific semantic precision matters

Matryoshka embeddings: Some modern embedding models support dimensionality truncation — a 1536-dimensional embedding can be truncated to 768 dimensions with minimal quality loss. This allows applications to trade retrieval quality for search speed without re-embedding the entire corpus.

Practical guidance: Evaluate retrieval quality (MRR@5 on your golden query set) across 2–3 embedding models before committing. The difference between models matters more than dimensionality alone. For clinical AI, use a model with strong medical terminology performance (benchmark on clinical queries specifically).

Section 2 — Evaluation and Benchmarking

Q6: What is the difference between ROUGE, BERTScore, and LLM-as-judge evaluation? When do you use each?

Category: ML Fundamentals Difficulty: Senior Role: AI Architect / ML Engineer

Answer:

These metrics evaluate the quality of generated text at different levels of semantic sophistication.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Compares n-gram overlap between the generated text and a reference text. ROUGE-1 is unigram overlap; ROUGE-2 is bigram overlap; ROUGE-L is the longest common subsequence.

Use when: you have reference outputs (summaries, translations) and need a cheap, fast metric
Limitation: does not capture meaning. "The patient is not getting better" and "The patient is improving" have zero ROUGE overlap but opposite meanings
Do not use for: open-ended generation where paraphrase quality matters, or as the sole quality metric for clinical text

BERTScore: Computes similarity using contextual embeddings (typically from a BERT-family model). Instead of n-gram overlap, it computes token-level embedding similarity between generated and reference text. Captures semantic similarity even when exact phrasing differs.

Use when: paraphrase quality matters; the generated text should convey the same meaning as the reference, not use identical words
Limitation: still requires a reference output; does not evaluate factual accuracy or clinical safety

LLM-as-judge: A separate LLM call evaluates the generated output on specific criteria: faithfulness (is every claim in the response supported by the retrieved context?), relevance (does the response address the question?), safety (does the response contain anything clinically dangerous?). The evaluating LLM is given a rubric and produces a structured score.

Use when: you need criterion-specific evaluation; you need to evaluate faithfulness (hallucination detection); reference outputs are unavailable or too costly to create at scale
Limitation: evaluator LLM has its own biases and failure modes; expensive (additional LLM calls per evaluation); requires careful prompt design for the evaluator

Recommendation for clinical RAG: Use all three in a tiered evaluation pipeline:

ROUGE/BERTScore against reference answers (fast, cheap, catches obvious quality drops)
LLM-as-judge for faithfulness (critical for clinical — every claim must be in the retrieved context)
Human clinical expert review on a random sample (the gold standard, used periodically, not on every deployment)

Q7: What is hallucination in LLMs and what are the root causes?

Category: ML Fundamentals Difficulty: Senior Role: AI Architect

Answer:

Hallucination is when an LLM generates plausible-sounding but factually incorrect or unsupported text. It is not a bug in the usual sense — it is an emergent property of how LLMs are trained.

Root causes:

1. Autoregressive generation optimizes for fluency, not factual accuracy. LLMs are trained to predict the next token given the preceding context. The training objective rewards generating text that is stylistically consistent with training data — not text that is factually correct. A model that generates a confident-sounding fabricated fact is, from the training signal's perspective, doing exactly what it was trained to do (if the fabrication looks like plausible text).

2. Knowledge compression. The model must compress world knowledge into a fixed number of parameters. Not everything in the training corpus is memorized with equal fidelity. Rare facts, specific numbers (dosages, dates, citations), and low-frequency entities are more susceptible to hallucination because they have fewer training signal occurrences to encode reliably.

3. Distribution shift at inference time. When asked about topics, time periods, or entities outside the training distribution, the model cannot retrieve what it doesn't have — but its training incentives push it toward generating a confident-sounding response rather than saying "I don't know."

Mitigation strategies and their effectiveness:

RAG: Grounds the generation in retrieved context. Reduces closed-domain hallucination significantly; does not eliminate it (model can still misinterpret context)
Citation enforcement: Requiring the model to cite every claim against retrieved chunks enables post-hoc verification
Temperature = 0: Reduces randomness in generation; does not eliminate hallucination from knowledge gaps
Constrained generation: Instructing the model to say "I cannot find this information in the provided context" when the retrieved content is insufficient — effective when the model is well-calibrated to follow such instructions

For clinical AI: The combination of RAG + citation enforcement + output validation is the minimum viable defense against clinical hallucination. None of these is sufficient alone.

Q8: Explain the difference between precision and recall in the context of an AI retrieval system, and why both matter.

Category: ML Fundamentals Difficulty: Mid-level Role: ML Engineer

Answer:

In retrieval:

Precision: Of the documents retrieved, what fraction were actually relevant? Precision = relevantretrieved / totalretrieved
Recall: Of all the relevant documents that exist, what fraction did the system retrieve? Recall = relevantretrieved / totalrelevant

Why both matter and the trade-off: A system can achieve perfect recall by returning every document in the knowledge base — but precision would be near zero. A system can achieve perfect precision by only returning documents it is extremely confident about — but recall would be low (many relevant documents missed).

For RAG in clinical contexts, the recall/precision trade-off has clinical consequences:

Low recall (missing relevant clinical evidence): The AI's answer is based on incomplete evidence. A drug interaction check that misses a relevant contraindication guideline has the same clinical impact as if the guideline didn't exist.
Low precision (retrieving irrelevant documents): The irrelevant chunks dilute the context. The model may generate answers based on misleading context. Prompt context fills up with noise, leaving less room for the genuinely relevant content.

Practical measurement: For a knowledge base with 10,000 clinical guidelines, manually annotate 100 golden queries with their relevant documents. Run retrieval at k=5 and k=10. Measure:

Precision@5: What fraction of the top-5 retrieved documents are relevant?
Recall@5: What fraction of all relevant documents were in the top-5?
MRR (Mean Reciprocal Rank): What is the average reciprocal of the rank at which the first relevant document appears? Higher MRR means relevant documents appear earlier in the ranked list.

For clinical retrieval, prioritize recall for safety-critical information (drug interactions, contraindications) and precision for reference queries (protocol retrieval where context dilution is the main risk).

Section 3 — Fine-Tuning

Q9: When does fine-tuning a model make sense and what are the risks of doing it wrong?

Category: ML Fundamentals Difficulty: Senior Role: AI Architect

Answer:

Fine-tuning is expensive to do right (data curation, training compute, evaluation) and risky to do wrong (catastrophic forgetting, overfitting, regulatory non-compliance). Most teams fine-tune too early before exhausting what is achievable through better prompting, RAG, and output constraints.

When fine-tuning is justified:

Format adherence is insufficient with prompting: If the model consistently generates outputs in the wrong format for your use case (wrong structure, wrong terminology, wrong length) despite well-engineered prompts, fine-tuning on examples corrects the behavior.

Domain vocabulary gap: General-purpose models trained predominantly on web text may not reliably produce accurate domain vocabulary. Fine-tuning on clinical text improves medical terminology use.

High-volume, cost-sensitive use case: A smaller fine-tuned model can match a larger general-purpose model on a narrow task — at lower inference cost. Fine-tuning to compress capability into a smaller model is a valid cost strategy.

Latency constraint: Smaller fine-tuned models have lower inference latency. If a use case requires sub-second response and a large model cannot meet that SLA, fine-tuning a smaller model is the path.

Risks of doing it wrong:

Catastrophic forgetting: Fine-tuning on a narrow task can cause the model to lose general capabilities. A model fine-tuned only on SOAP notes may become less capable at other tasks. Mitigate: use LoRA/QLoRA (parameter-efficient fine-tuning) which updates a small adapter rather than base weights; evaluate capability on held-out general tasks after fine-tuning.

Data quality drives model quality: A fine-tuned model is only as good as its training data. Low-quality, inconsistent, or biased training data produces a model that faithfully reproduces those flaws at scale. Invest in data curation before training.

Memorization of PHI: If training data contains PHI, the fine-tuned model may memorize and reproduce specific patient data in other contexts. Training on PHI requires: de-identification before training, DP-SGD (differential privacy) in training, memorization audits before deployment.

Evaluation of fine-tuned models requires its own evaluation pipeline: The standard benchmark scores (MMLU, HellaSwag) do not measure domain-specific quality. Build a domain-specific evaluation set before fine-tuning to measure whether the fine-tuned model actually improved on your target task.

Q10: What is LoRA and why is it preferred over full fine-tuning for enterprise AI?

Category: ML Fundamentals Difficulty: Senior Role: ML Engineer / AI Architect

Answer:

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes the original model weights and adds small, low-rank decomposition matrices to specific weight matrices in the Transformer. Instead of updating all weights (full fine-tuning), LoRA updates only the adapter parameters — orders of magnitude fewer parameters.

Mechanism: For a weight matrix W (dimension d × k), LoRA adds two matrices A (d × r) and B (r × k) where r << min(d, k). The adapted weight is W + AB. Only A and B are updated during training.

Enterprise advantages:

Compute cost: LoRA trains 0.1–1% of the parameters of full fine-tuning. A 70B model full fine-tune requires ~5× model size in gradient memory; LoRA reduces this to a fraction of that.

Storage and serving: The LoRA adapter is tiny (tens to hundreds of MB vs. 70+ GB for the full model). Multiple LoRA adapters for different use cases can be swapped in/out on the same base model without storing multiple full model copies.

Avoids catastrophic forgetting: Since base model weights are frozen, general capabilities are preserved. The adapter learns the task-specific behavior.

Multi-task serving with adapter switching: In vLLM, multiple LoRA adapters can be loaded simultaneously, with adapter selection per request. A single 70B base model can serve 5 different clinical specialties with 5 different LoRA adapters.

When full fine-tuning is preferred: If the task requires deep model behavioral change (not just format/vocabulary adaptation), or if the fine-tuning dataset is very large and compute is available, full fine-tuning produces marginally better results. For most enterprise use cases, LoRA quality is indistinguishable from full fine-tuning at the task level.

Section 4 — Quantization and Inference Efficiency

Q11: What are the main quantization approaches for LLMs and what are the accuracy trade-offs?

Category: ML Fundamentals Difficulty: Senior Role: ML Engineer / AI Infrastructure

Answer:

Quantization reduces the numerical precision of model weights (and sometimes activations), reducing memory footprint and increasing throughput at the cost of some accuracy.

FP16 (16-bit floating point): The standard serving precision. 2 bytes per parameter. This is the baseline — essentially no quantization relative to FP32 training precision. For most tasks, FP16 is indistinguishable from FP32.

INT8 (8-bit integer): 1 byte per parameter. Reduces VRAM requirement by 50% vs. FP16. Most models maintain acceptable quality in INT8 because the fine-grained representation provided by INT8 is sufficient for the model's weight distribution. Some accuracy degradation on complex reasoning tasks. Supported natively by recent NVIDIA GPUs (TensorRT, bitsandbytes).

GPTQ (4-bit, post-training quantization): 0.5 bytes per parameter (approximately). Reduces VRAM to 25% of FP16. Quality degradation is more noticeable — GPTQ is appropriate for use cases where throughput and memory are the primary constraints and moderate quality degradation is acceptable. Not recommended for safety-critical clinical tasks without thorough evaluation.

AWQ (Activation-aware Weight Quantization, 4-bit): Improved 4-bit quantization method that calibrates quantization based on activation magnitudes. Produces better quality than naive INT4 or GPTQ at the same bit width. Preferred for production 4-bit deployment when available.

GGUF (for CPU/mixed inference): File format used by llama.cpp for CPU and mixed CPU/GPU inference. Supports multiple quantization levels (Q4KM, Q5KM, Q8_0). Used when GPU VRAM is insufficient for the model and CPU offloading is acceptable.

Decision framework for production:

Fit FP16 in available VRAM? → Use FP16
Need to reduce VRAM by 50%? → INT8 (minimal quality cost)
Need to fit a 70B model on 2× A100 40GB? → INT8 quantization makes 70B fit in ~70 GB vs. 140 GB FP16
Extremely constrained VRAM or cost-sensitive? → AWQ or GPTQ with thorough evaluation on your task

Always: Evaluate quantized model quality on your specific task before deploying. Generic benchmarks do not predict task-specific quality degradation.

Section 5 — Safety and Alignment

Q12: What is differential privacy in the context of ML training and when is it required?

Category: ML Fundamentals Difficulty: Principal Role: AI Architect

Answer:

Differential privacy (DP) is a mathematical guarantee about how much information a model training algorithm leaks about any individual training example. A DP-trained model satisfies the guarantee that an adversary who can query the model arbitrarily cannot determine, with high confidence, whether any specific individual's data was in the training set.

DP-SGD (Differentially Private Stochastic Gradient Descent): The standard mechanism for DP training. At each training step, gradients are clipped to a maximum norm (controlling the maximum influence of any single example) and Gaussian noise is added to the aggregated gradients before the weight update. The privacy guarantee is parameterized by epsilon (ε): smaller epsilon means stronger privacy (more noise, less accuracy).

When DP training is required for AI Architects:

Fine-tuning on patient clinical notes (de-identified): Even after de-identification, rare events in the dataset may re-identify individuals if a model memorizes verbatim text. DP-SGD with ε ≤ 8 (commonly cited clinical threshold) provides formal protection against this.

Models deployed for inference against adversarial users: A model fine-tuned on internal documents that is externally accessible may leak internal data if a sophisticated adversary probes it. DP training limits the leakage.

Regulatory or compliance requirements: HIPAA does not mandate DP training, but organizational privacy officers may require it for any model trained on clinical data as a risk mitigation measure.

Trade-offs: DP-SGD requires more training steps to achieve the same loss as non-private training (the added noise reduces gradient signal). Models trained with strong privacy guarantees (ε ≤ 1) show significant quality degradation on complex tasks. ε ≤ 8 is a commonly accepted practical compromise between privacy and utility in healthcare AI research.

Architect's role: Commission DP training from the ML team; validate that the privacy budget (epsilon, delta) meets organizational requirements; ensure memorization audit is performed before deployment.

Summary — Core ML Concepts for Architects

Concept	What to Know	What to Leave to ML Scientists
Attention	How it enables long-context; O(n²) cost implication	Flash Attention implementation details
Temperature	How it affects output diversity; clinical setting = 0	Exact logit scaling math
Tokens	Cost implications; context window management	Subword tokenization algorithms
Embedding dimensionality	Quality vs. performance trade-off; model selection	HNSW index construction parameters
Hallucination	Root causes; mitigation strategies (RAG, citations)	Loss function design for factual grounding
ROUGE / BERTScore / LLM-as-judge	When to use each; clinical evaluation pipeline	ROUGE correlation with human judgment studies
Fine-tuning vs. RAG	Decision framework; fine-tuning risks	LoRA rank selection, learning rate schedules
LoRA	Why enterprise prefers it; adapter serving pattern	SVD decomposition math
Quantization	Which precision for which constraint	Quantization calibration algorithms
Differential privacy	When required; epsilon as privacy budget	DP proof construction

ML Fundamentals for AI Architects#

How to Use This Document#

Section 1 — Transformers and LLMs#

Q1: Explain how attention works and why it matters for long-context understanding.#

Q2: What is the difference between instruction fine-tuning, RLHF, and RAG? When would you use each?#

Q3: What is temperature and how does it affect LLM outputs? What setting would you use for clinical AI?#

Q4: Explain token count, context window, and how they interact with cost.#

Q5: What is embedding dimensionality and how does it affect retrieval quality and performance?#

Section 2 — Evaluation and Benchmarking#

Q6: What is the difference between ROUGE, BERTScore, and LLM-as-judge evaluation? When do you use each?#

Q7: What is hallucination in LLMs and what are the root causes?#

Q8: Explain the difference between precision and recall in the context of an AI retrieval system, and why both matter.#

Section 3 — Fine-Tuning#

Q9: When does fine-tuning a model make sense and what are the risks of doing it wrong?#

Q10: What is LoRA and why is it preferred over full fine-tuning for enterprise AI?#

Section 4 — Quantization and Inference Efficiency#

Q11: What are the main quantization approaches for LLMs and what are the accuracy trade-offs?#

Section 5 — Safety and Alignment#

Q12: What is differential privacy in the context of ML training and when is it required?#

Summary — Core ML Concepts for Architects#

Further Reading#

ML Fundamentals for AI Architects

How to Use This Document

Section 1 — Transformers and LLMs

Q1: Explain how attention works and why it matters for long-context understanding.

Q2: What is the difference between instruction fine-tuning, RLHF, and RAG? When would you use each?

Q3: What is temperature and how does it affect LLM outputs? What setting would you use for clinical AI?

Q4: Explain token count, context window, and how they interact with cost.

Q5: What is embedding dimensionality and how does it affect retrieval quality and performance?

Section 2 — Evaluation and Benchmarking

Q6: What is the difference between ROUGE, BERTScore, and LLM-as-judge evaluation? When do you use each?

Q7: What is hallucination in LLMs and what are the root causes?

Q8: Explain the difference between precision and recall in the context of an AI retrieval system, and why both matter.

Section 3 — Fine-Tuning

Q9: When does fine-tuning a model make sense and what are the risks of doing it wrong?

Q10: What is LoRA and why is it preferred over full fine-tuning for enterprise AI?

Section 4 — Quantization and Inference Efficiency

Q11: What are the main quantization approaches for LLMs and what are the accuracy trade-offs?

Section 5 — Safety and Alignment

Q12: What is differential privacy in the context of ML training and when is it required?

Summary — Core ML Concepts for Architects

Further Reading