ML Fundamentals for AI Architects
How to Use This Document
AI Architect interviews test ML knowledge at the level of an engineer who can make informed deployment and design decisions โ not at the level of a researcher implementing novel training algorithms. The questions here cover what an AI Architect must understand to reason about model selection, evaluation, failure modes, and trade-offs.
The depth expected: you should be able to explain why things work, not just that they work.
Section 1 โ Transformers and LLMs
Q1: Explain how attention works and why it matters for long-context understanding.
Category: ML Fundamentals Difficulty: Senior Role: AI Architect / ML Engineer
Answer:
Attention is the mechanism that allows a Transformer to weigh the relevance of every token in the input when computing the representation of any given token. For a token at position i, attention computes a weighted sum over all other token positions, where the weight reflects how semantically relevant each other token is to token i.
Mechanism: Given an input sequence, the Transformer projects each token into three vectors: Query (Q), Key (K), and Value (V). The attention score between token i and token j is:
attention(i, j) = softmax(Q_i ยท K_j / sqrt(d_k))The output for token i is the weighted sum of all Value vectors: sum(attention_weights * V).
Why this matters for long-context understanding: Unlike RNNs, which process sequences left-to-right and must compress all prior context into a fixed-size hidden state (which degrades over long sequences), attention can directly connect any two tokens regardless of their distance in the sequence. A Transformer reading a 128K-token clinical document can directly attend from a conclusion at position 128,000 back to a test result at position 1,000 without the information being compressed or lost through 127,000 intermediate steps.
The scaling challenge: Attention is O(nยฒ) in sequence length โ computing attention scores between all pairs of tokens. A 128K context window requires 128,000ยฒ = 16 billion attention score computations. Flash Attention and other efficient attention implementations reduce memory access patterns to make this tractable, but the fundamental compute cost remains a reason why long-context inference is expensive.
Practical implication for architects: Every token in the context window costs money. Long-context requests (RAG with large retrieved documents, long conversation history) are expensive at price/token. Design systems to control context window usage โ don't just throw everything into the context.
Q2: What is the difference between instruction fine-tuning, RLHF, and RAG? When would you use each?
Category: ML Fundamentals Difficulty: Senior Role: AI Architect
Answer:
These are fundamentally different techniques addressing different problems.
Instruction fine-tuning (SFT โ Supervised Fine-Tuning): Adjusts the model's weights by training on examples of (instruction, response) pairs. Changes what the model "knows how to do" โ its behavioral style and task-specific format adherence. Use when: you need the model to consistently follow a specific output format, adopt domain-specific terminology, or perform a task type not well-covered by the base model's pretraining. Example: fine-tuning on clinical notes to improve SOAP note format adherence.
RLHF (Reinforcement Learning from Human Feedback): Trains a reward model from human preference data (which of two responses is better?) and then fine-tunes the LLM to maximize the reward model's score via RL (typically PPO). Changes the model's preferences โ making it more helpful, harmless, and honest. RLHF is what converts a pretrained base model into an instruction-following assistant model. As an AI Architect, you almost never implement RLHF; you consume models that have been trained with it (Claude, GPT-4). Use RLHF when: you control the model training pipeline and want to instill specific preferences at scale.
RAG (Retrieval-Augmented Generation): Does not change the model at all. Instead, retrieves relevant information at inference time and injects it into the prompt. Addresses the knowledge limitation problem โ the model's weights contain knowledge up to the training cutoff, but RAG provides current, organization-specific knowledge. Use when: the knowledge domain is frequently updated (clinical guidelines, formulary), the knowledge is organization-specific (internal policies), or hallucination on specific facts must be addressed with citations.
Decision framework:
| Need | Technique |
|---|---|
| Model doesn't follow the right output format | SFT |
| Model doesn't know about recent events or internal docs | RAG |
| Model behavior is too verbose, unsafe, or unhelpful | RLHF (or use a model already trained with it) |
| Model doesn't use domain-specific terminology correctly | SFT or domain-specific embedding model |
| Reduce hallucination on factual queries | RAG (citations) |
Key insight for architects: RAG is almost always the right first answer for knowledge gaps because it requires no training, is immediately updatable, and provides citations. SFT is additive โ use it on top of RAG when format adherence or domain vocabulary is still inadequate after RAG.
Q3: What is temperature and how does it affect LLM outputs? What setting would you use for clinical AI?
Category: ML Fundamentals Difficulty: Mid-level Role: ML Engineer / AI Architect
Answer:
Temperature controls the randomness of the LLM's token sampling. At each generation step, the model produces a probability distribution over all vocabulary tokens. Temperature is a scaling factor applied before softmax:
adjusted_logit_i = logit_i / temperature- Temperature = 0: The model always selects the highest-probability token (greedy decoding). Outputs are deterministic and reproducible.
- Temperature = 1: The model samples from the raw probability distribution โ expected diversity in outputs.
- Temperature > 1: The distribution is flattened โ lower-probability tokens become more likely. More creative and unpredictable outputs; also higher hallucination risk.
- Temperature < 1: The distribution is sharpened โ the highest-probability token dominates. Less diversity, more consistent outputs.
For clinical AI: Temperature = 0 or very low (0.1 or below) for all clinical output generation.
Reasoning:
- Clinical documentation must be reproducible โ if a physician sees the AI's draft summary and approves it, re-running the same prompt should not produce a materially different output
- Clinical decisions should not vary based on random sampling โ two nurses who trigger the same CDS alert on the same patient should receive the same AI recommendation
- Hallucination risk increases with temperature โ for clinical content, factual accuracy is non-negotiable
Where higher temperature is appropriate:
- Brainstorming use cases: "What other diagnoses should we consider?" โ moderate creativity is appropriate
- Patient communication drafts: slight variation in phrasing is acceptable; determinism less critical
Q4: Explain token count, context window, and how they interact with cost.
Category: ML Fundamentals Difficulty: Mid-level Role: ML Engineer / AI Architect
Answer:
Token: The unit of text that LLMs process. Roughly, 1 token โ 4 characters of English text โ 0.75 words. But tokenization is vocabulary-dependent: medical abbreviations, code, and non-English text may tokenize differently.
Context window: The maximum total number of tokens (input + output) the model can process in a single inference call. As of current models, context windows range from tens of thousands to hundreds of thousands of tokens. A 200K token context window can hold approximately 150,000 words โ roughly a book-length document.
Cost interaction: Managed LLM APIs charge per token: separate rates for input tokens and output tokens. Input (prompt) tokens are cheaper than output (generated) tokens at most providers.
cost = (input_tokens ร price_per_input_token) + (output_tokens ร price_per_output_token)For architects โ what drives cost:
- System prompt length: A 2,000-token system prompt is paid on every API call. Over 1 million calls/day, this is 2 billion tokens of system prompt cost. Prompt caching eliminates most of this.
- Retrieved context (RAG chunks): Each retrieved chunk is input tokens. 5 chunks ร 500 tokens = 2,500 additional input tokens per query.
- Conversation history: In multi-turn conversations, all prior turns are re-sent on each turn. A 20-turn conversation has exponentially growing input cost. Context compression (summarizing older turns) controls this.
- Output length: maxtokens is a ceiling, not a target. Set it to the appropriate ceiling for your use case โ setting maxtokens=4096 for tasks that generate 200-token responses wastes nothing (output is billed for actual tokens generated, not max_tokens), but setting it too low truncates outputs.
Practical measurement: Always log prompt<em>tokens and completion</em>tokens from the API response before building cost optimization strategies. Most teams discover their largest token consumer is not what they expected.
Q5: What is embedding dimensionality and how does it affect retrieval quality and performance?
Category: ML Fundamentals Difficulty: Senior Role: ML Engineer / AI Architect
Answer:
An embedding is a dense vector representation of a piece of text, where semantic similarity in the original text space corresponds to geometric proximity in the vector space. Dimensionality is the length of this vector (e.g., 384, 768, 1536, 3072 dimensions).
Higher dimensionality:
- More expressive: the vector can encode finer-grained semantic distinctions
- Better at capturing nuanced meaning in complex domain-specific text
- Higher storage cost: a 3072-dimensional vector is 4ร larger than a 768-dimensional vector (in float32)
- Slower ANN search: more dimensions = more computation per distance calculation
- Curse of dimensionality: as dimensionality increases, all vectors in the space become approximately equidistant โ mitigated in practice by product quantization or dimensionality reduction
Lower dimensionality:
- Faster ANN search
- Lower storage and memory footprint
- May lose fine-grained semantic distinctions for specialized domains
Model-specific examples (verify current benchmarks against your use case):
- Small/fast models (e.g., 384d): appropriate for large-scale, cost-sensitive retrieval where semantic precision is less critical
- Standard models (e.g., 768dโ1024d): good balance for most enterprise knowledge bases
- Large/specialized models (e.g., 1536dโ3072d): appropriate for clinical or legal domains where domain-specific semantic precision matters
Matryoshka embeddings: Some modern embedding models support dimensionality truncation โ a 1536-dimensional embedding can be truncated to 768 dimensions with minimal quality loss. This allows applications to trade retrieval quality for search speed without re-embedding the entire corpus.
Practical guidance: Evaluate retrieval quality (MRR@5 on your golden query set) across 2โ3 embedding models before committing. The difference between models matters more than dimensionality alone. For clinical AI, use a model with strong medical terminology performance (benchmark on clinical queries specifically).
Section 2 โ Evaluation and Benchmarking
Q6: What is the difference between ROUGE, BERTScore, and LLM-as-judge evaluation? When do you use each?
Category: ML Fundamentals Difficulty: Senior Role: AI Architect / ML Engineer
Answer:
These metrics evaluate the quality of generated text at different levels of semantic sophistication.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Compares n-gram overlap between the generated text and a reference text. ROUGE-1 is unigram overlap; ROUGE-2 is bigram overlap; ROUGE-L is the longest common subsequence.
- Use when: you have reference outputs (summaries, translations) and need a cheap, fast metric
- Limitation: does not capture meaning. "The patient is not getting better" and "The patient is improving" have zero ROUGE overlap but opposite meanings
- Do not use for: open-ended generation where paraphrase quality matters, or as the sole quality metric for clinical text
BERTScore: Computes similarity using contextual embeddings (typically from a BERT-family model). Instead of n-gram overlap, it computes token-level embedding similarity between generated and reference text. Captures semantic similarity even when exact phrasing differs.
- Use when: paraphrase quality matters; the generated text should convey the same meaning as the reference, not use identical words
- Limitation: still requires a reference output; does not evaluate factual accuracy or clinical safety
LLM-as-judge: A separate LLM call evaluates the generated output on specific criteria: faithfulness (is every claim in the response supported by the retrieved context?), relevance (does the response address the question?), safety (does the response contain anything clinically dangerous?). The evaluating LLM is given a rubric and produces a structured score.
- Use when: you need criterion-specific evaluation; you need to evaluate faithfulness (hallucination detection); reference outputs are unavailable or too costly to create at scale
- Limitation: evaluator LLM has its own biases and failure modes; expensive (additional LLM calls per evaluation); requires careful prompt design for the evaluator
Recommendation for clinical RAG: Use all three in a tiered evaluation pipeline:
- ROUGE/BERTScore against reference answers (fast, cheap, catches obvious quality drops)
- LLM-as-judge for faithfulness (critical for clinical โ every claim must be in the retrieved context)
- Human clinical expert review on a random sample (the gold standard, used periodically, not on every deployment)
Q7: What is hallucination in LLMs and what are the root causes?
Category: ML Fundamentals Difficulty: Senior Role: AI Architect
Answer:
Hallucination is when an LLM generates plausible-sounding but factually incorrect or unsupported text. It is not a bug in the usual sense โ it is an emergent property of how LLMs are trained.
Root causes:
1. Autoregressive generation optimizes for fluency, not factual accuracy. LLMs are trained to predict the next token given the preceding context. The training objective rewards generating text that is stylistically consistent with training data โ not text that is factually correct. A model that generates a confident-sounding fabricated fact is, from the training signal's perspective, doing exactly what it was trained to do (if the fabrication looks like plausible text).
2. Knowledge compression. The model must compress world knowledge into a fixed number of parameters. Not everything in the training corpus is memorized with equal fidelity. Rare facts, specific numbers (dosages, dates, citations), and low-frequency entities are more susceptible to hallucination because they have fewer training signal occurrences to encode reliably.
3. Distribution shift at inference time. When asked about topics, time periods, or entities outside the training distribution, the model cannot retrieve what it doesn't have โ but its training incentives push it toward generating a confident-sounding response rather than saying "I don't know."
Mitigation strategies and their effectiveness:
- RAG: Grounds the generation in retrieved context. Reduces closed-domain hallucination significantly; does not eliminate it (model can still misinterpret context)
- Citation enforcement: Requiring the model to cite every claim against retrieved chunks enables post-hoc verification
- Temperature = 0: Reduces randomness in generation; does not eliminate hallucination from knowledge gaps
- Constrained generation: Instructing the model to say "I cannot find this information in the provided context" when the retrieved content is insufficient โ effective when the model is well-calibrated to follow such instructions
For clinical AI: The combination of RAG + citation enforcement + output validation is the minimum viable defense against clinical hallucination. None of these is sufficient alone.
Q8: Explain the difference between precision and recall in the context of an AI retrieval system, and why both matter.
Category: ML Fundamentals Difficulty: Mid-level Role: ML Engineer
Answer:
In retrieval:
- Precision: Of the documents retrieved, what fraction were actually relevant? Precision = relevantretrieved / totalretrieved
- Recall: Of all the relevant documents that exist, what fraction did the system retrieve? Recall = relevantretrieved / totalrelevant
Why both matter and the trade-off: A system can achieve perfect recall by returning every document in the knowledge base โ but precision would be near zero. A system can achieve perfect precision by only returning documents it is extremely confident about โ but recall would be low (many relevant documents missed).
For RAG in clinical contexts, the recall/precision trade-off has clinical consequences:
- Low recall (missing relevant clinical evidence): The AI's answer is based on incomplete evidence. A drug interaction check that misses a relevant contraindication guideline has the same clinical impact as if the guideline didn't exist.
- Low precision (retrieving irrelevant documents): The irrelevant chunks dilute the context. The model may generate answers based on misleading context. Prompt context fills up with noise, leaving less room for the genuinely relevant content.
Practical measurement: For a knowledge base with 10,000 clinical guidelines, manually annotate 100 golden queries with their relevant documents. Run retrieval at k=5 and k=10. Measure:
- Precision@5: What fraction of the top-5 retrieved documents are relevant?
- Recall@5: What fraction of all relevant documents were in the top-5?
- MRR (Mean Reciprocal Rank): What is the average reciprocal of the rank at which the first relevant document appears? Higher MRR means relevant documents appear earlier in the ranked list.
For clinical retrieval, prioritize recall for safety-critical information (drug interactions, contraindications) and precision for reference queries (protocol retrieval where context dilution is the main risk).
Section 3 โ Fine-Tuning
Q9: When does fine-tuning a model make sense and what are the risks of doing it wrong?
Category: ML Fundamentals Difficulty: Senior Role: AI Architect
Answer:
Fine-tuning is expensive to do right (data curation, training compute, evaluation) and risky to do wrong (catastrophic forgetting, overfitting, regulatory non-compliance). Most teams fine-tune too early before exhausting what is achievable through better prompting, RAG, and output constraints.
When fine-tuning is justified:
- Format adherence is insufficient with prompting: If the model consistently generates outputs in the wrong format for your use case (wrong structure, wrong terminology, wrong length) despite well-engineered prompts, fine-tuning on examples corrects the behavior.
- Domain vocabulary gap: General-purpose models trained predominantly on web text may not reliably produce accurate domain vocabulary. Fine-tuning on clinical text improves medical terminology use.
- High-volume, cost-sensitive use case: A smaller fine-tuned model can match a larger general-purpose model on a narrow task โ at lower inference cost. Fine-tuning to compress capability into a smaller model is a valid cost strategy.
- Latency constraint: Smaller fine-tuned models have lower inference latency. If a use case requires sub-second response and a large model cannot meet that SLA, fine-tuning a smaller model is the path.
Risks of doing it wrong:
- Catastrophic forgetting: Fine-tuning on a narrow task can cause the model to lose general capabilities. A model fine-tuned only on SOAP notes may become less capable at other tasks. Mitigate: use LoRA/QLoRA (parameter-efficient fine-tuning) which updates a small adapter rather than base weights; evaluate capability on held-out general tasks after fine-tuning.
- Data quality drives model quality: A fine-tuned model is only as good as its training data. Low-quality, inconsistent, or biased training data produces a model that faithfully reproduces those flaws at scale. Invest in data curation before training.
- Memorization of PHI: If training data contains PHI, the fine-tuned model may memorize and reproduce specific patient data in other contexts. Training on PHI requires: de-identification before training, DP-SGD (differential privacy) in training, memorization audits before deployment.
- Evaluation of fine-tuned models requires its own evaluation pipeline: The standard benchmark scores (MMLU, HellaSwag) do not measure domain-specific quality. Build a domain-specific evaluation set before fine-tuning to measure whether the fine-tuned model actually improved on your target task.
Q10: What is LoRA and why is it preferred over full fine-tuning for enterprise AI?
Category: ML Fundamentals Difficulty: Senior Role: ML Engineer / AI Architect
Answer:
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes the original model weights and adds small, low-rank decomposition matrices to specific weight matrices in the Transformer. Instead of updating all weights (full fine-tuning), LoRA updates only the adapter parameters โ orders of magnitude fewer parameters.
Mechanism: For a weight matrix W (dimension d ร k), LoRA adds two matrices A (d ร r) and B (r ร k) where r << min(d, k). The adapted weight is W + AB. Only A and B are updated during training.
Enterprise advantages:
- Compute cost: LoRA trains 0.1โ1% of the parameters of full fine-tuning. A 70B model full fine-tune requires ~5ร model size in gradient memory; LoRA reduces this to a fraction of that.
- Storage and serving: The LoRA adapter is tiny (tens to hundreds of MB vs. 70+ GB for the full model). Multiple LoRA adapters for different use cases can be swapped in/out on the same base model without storing multiple full model copies.
- Avoids catastrophic forgetting: Since base model weights are frozen, general capabilities are preserved. The adapter learns the task-specific behavior.
- Multi-task serving with adapter switching: In vLLM, multiple LoRA adapters can be loaded simultaneously, with adapter selection per request. A single 70B base model can serve 5 different clinical specialties with 5 different LoRA adapters.
When full fine-tuning is preferred: If the task requires deep model behavioral change (not just format/vocabulary adaptation), or if the fine-tuning dataset is very large and compute is available, full fine-tuning produces marginally better results. For most enterprise use cases, LoRA quality is indistinguishable from full fine-tuning at the task level.
Section 4 โ Quantization and Inference Efficiency
Q11: What are the main quantization approaches for LLMs and what are the accuracy trade-offs?
Category: ML Fundamentals Difficulty: Senior Role: ML Engineer / AI Infrastructure
Answer:
Quantization reduces the numerical precision of model weights (and sometimes activations), reducing memory footprint and increasing throughput at the cost of some accuracy.
FP16 (16-bit floating point): The standard serving precision. 2 bytes per parameter. This is the baseline โ essentially no quantization relative to FP32 training precision. For most tasks, FP16 is indistinguishable from FP32.
INT8 (8-bit integer): 1 byte per parameter. Reduces VRAM requirement by 50% vs. FP16. Most models maintain acceptable quality in INT8 because the fine-grained representation provided by INT8 is sufficient for the model's weight distribution. Some accuracy degradation on complex reasoning tasks. Supported natively by recent NVIDIA GPUs (TensorRT, bitsandbytes).
GPTQ (4-bit, post-training quantization): 0.5 bytes per parameter (approximately). Reduces VRAM to 25% of FP16. Quality degradation is more noticeable โ GPTQ is appropriate for use cases where throughput and memory are the primary constraints and moderate quality degradation is acceptable. Not recommended for safety-critical clinical tasks without thorough evaluation.
AWQ (Activation-aware Weight Quantization, 4-bit): Improved 4-bit quantization method that calibrates quantization based on activation magnitudes. Produces better quality than naive INT4 or GPTQ at the same bit width. Preferred for production 4-bit deployment when available.
GGUF (for CPU/mixed inference): File format used by llama.cpp for CPU and mixed CPU/GPU inference. Supports multiple quantization levels (Q4KM, Q5KM, Q8_0). Used when GPU VRAM is insufficient for the model and CPU offloading is acceptable.
Decision framework for production:
- Fit FP16 in available VRAM? โ Use FP16
- Need to reduce VRAM by 50%? โ INT8 (minimal quality cost)
- Need to fit a 70B model on 2ร A100 40GB? โ INT8 quantization makes 70B fit in ~70 GB vs. 140 GB FP16
- Extremely constrained VRAM or cost-sensitive? โ AWQ or GPTQ with thorough evaluation on your task
Always: Evaluate quantized model quality on your specific task before deploying. Generic benchmarks do not predict task-specific quality degradation.
Section 5 โ Safety and Alignment
Q12: What is differential privacy in the context of ML training and when is it required?
Category: ML Fundamentals Difficulty: Principal Role: AI Architect
Answer:
Differential privacy (DP) is a mathematical guarantee about how much information a model training algorithm leaks about any individual training example. A DP-trained model satisfies the guarantee that an adversary who can query the model arbitrarily cannot determine, with high confidence, whether any specific individual's data was in the training set.
DP-SGD (Differentially Private Stochastic Gradient Descent): The standard mechanism for DP training. At each training step, gradients are clipped to a maximum norm (controlling the maximum influence of any single example) and Gaussian noise is added to the aggregated gradients before the weight update. The privacy guarantee is parameterized by epsilon (ฮต): smaller epsilon means stronger privacy (more noise, less accuracy).
When DP training is required for AI Architects:
- Fine-tuning on patient clinical notes (de-identified): Even after de-identification, rare events in the dataset may re-identify individuals if a model memorizes verbatim text. DP-SGD with ฮต โค 8 (commonly cited clinical threshold) provides formal protection against this.
- Models deployed for inference against adversarial users: A model fine-tuned on internal documents that is externally accessible may leak internal data if a sophisticated adversary probes it. DP training limits the leakage.
- Regulatory or compliance requirements: HIPAA does not mandate DP training, but organizational privacy officers may require it for any model trained on clinical data as a risk mitigation measure.
Trade-offs: DP-SGD requires more training steps to achieve the same loss as non-private training (the added noise reduces gradient signal). Models trained with strong privacy guarantees (ฮต โค 1) show significant quality degradation on complex tasks. ฮต โค 8 is a commonly accepted practical compromise between privacy and utility in healthcare AI research.
Architect's role: Commission DP training from the ML team; validate that the privacy budget (epsilon, delta) meets organizational requirements; ensure memorization audit is performed before deployment.
Summary โ Core ML Concepts for Architects
| Concept | What to Know | What to Leave to ML Scientists |
|---|---|---|
| Attention | How it enables long-context; O(nยฒ) cost implication | Flash Attention implementation details |
| Temperature | How it affects output diversity; clinical setting = 0 | Exact logit scaling math |
| Tokens | Cost implications; context window management | Subword tokenization algorithms |
| Embedding dimensionality | Quality vs. performance trade-off; model selection | HNSW index construction parameters |
| Hallucination | Root causes; mitigation strategies (RAG, citations) | Loss function design for factual grounding |
| ROUGE / BERTScore / LLM-as-judge | When to use each; clinical evaluation pipeline | ROUGE correlation with human judgment studies |
| Fine-tuning vs. RAG | Decision framework; fine-tuning risks | LoRA rank selection, learning rate schedules |
| LoRA | Why enterprise prefers it; adapter serving pattern | SVD decomposition math |
| Quantization | Which precision for which constraint | Quantization calibration algorithms |
| Differential privacy | When required; epsilon as privacy budget | DP proof construction |
Further Reading
- Architecture Questions โ System design questions that apply these concepts
- AI Foundations โ Full chapter coverage of foundational concepts
- LLM Serving Infrastructure โ Production serving decisions
- Healthcare AI Safety โ Clinical safety implications