AI/ML Hiring Glossary

Key terms in AI, ML, and data science — with context for hiring managers and recruiting professionals evaluating technical candidates.

A

A/B testing: A controlled experiment where users are randomly split between a control group (A) and a treatment group (B) to measure the causal effect of a change. A core skill for data scientists who need to evaluate product decisions with statistical rigor. [Data Scientist]

Causal inference: A set of methods for estimating the causal effect of an intervention when randomization is not possible. Goes beyond correlation to ask why something happened, not just whether two things moved together. [Data Scientist]
Chunking: The process of splitting a large document into smaller segments before indexing them in a vector database for retrieval-augmented generation. Chunk size and overlap strategy significantly affect retrieval quality. [AI Engineer]
Cross-validation: A technique for estimating model performance on unseen data by splitting the training dataset into multiple folds and evaluating the model on each fold in turn. Reduces overfitting to any single train/test split. [ML Engineer] [Data Scientist]

Data drift: A change in the statistical properties of the input data over time that causes a production model to perform worse than it did at training time. A key reason ML models require ongoing monitoring after deployment. [ML Engineer]

Embedding: A numerical vector representation of text, an image, or another data type that captures semantic meaning in a form that machine learning models can operate on. Used extensively in RAG systems, search, and recommendation. [AI Engineer] [ML Engineer]
Experiment tracking: The practice of logging parameters, metrics, artifacts, and metadata for each machine learning experiment so that results are reproducible and comparable. Tools like MLflow and Weights & Biases support this. [ML Engineer]

Feature engineering: The process of transforming raw data into features — inputs that a machine learning model can learn from effectively. Good feature engineering often has more impact on model performance than model architecture choices. [ML Engineer] [Data Scientist]
Feature store: A centralized data platform that manages the creation, storage, and serving of features used to train and serve machine learning models. Ensures consistency between training-time and serving-time feature values. [ML Engineer]
Fine-tuning: The process of further training a pre-trained model on a smaller, domain-specific dataset to adapt it to a specific task. Improves performance on targeted tasks but requires labeled data, compute, and careful evaluation. [AI Engineer] [ML Engineer]

Guardrails: Safety mechanisms applied to the outputs of large language models to prevent harmful, off-topic, or unreliable responses from reaching users in a production application. Can be implemented via filtering, classifiers, or constitutional AI approaches. [AI Engineer]

Hallucination: When a large language model confidently generates text that is factually incorrect or unsupported by its context. A major reliability challenge in production AI applications, particularly in knowledge-intensive domains. [AI Engineer]
Hire/no-hire recommendation: A clear, evidence-based conclusion about whether a candidate should advance in the hiring process, delivered alongside a scorecard. Effective recommendations are specific, actionable, and grounded in the evaluation — not vague impressions.
Hypothesis testing: A statistical framework for deciding whether observed data is consistent with a null hypothesis or whether there is sufficient evidence to reject it. Fundamental to rigorous A/B testing and analytical decision-making. [Data Scientist]

LLM (Large Language Model): A neural network trained on large amounts of text data, capable of generating, summarizing, translating, and reasoning about natural language. Examples include GPT-4, Claude, and Llama. The foundation of most modern AI engineer work. [AI Engineer]

MLOps: Machine Learning Operations — the set of practices, tools, and cultural norms that make it possible to deploy, monitor, and maintain machine learning models reliably in production. Draws from DevOps principles applied to the ML lifecycle. [ML Engineer]
Model serving: The infrastructure responsible for taking a trained machine learning model and making it available for real-time or batch predictions. Involves trade-offs between latency, throughput, cost, and reliability. [ML Engineer]

Overfitting: When a model performs well on training data but fails to generalize to new, unseen data. Often caused by a model that is too complex for the amount of training data available. Detected through validation set performance. [ML Engineer] [Data Scientist]

Power analysis: A calculation performed before running an experiment to determine the minimum sample size needed to detect an effect of a given size with a specified level of statistical confidence. Critical for avoiding underpowered A/B tests. [Data Scientist]
Prompt engineering: The practice of crafting and refining instructions given to a large language model to produce more accurate, reliable, or appropriately formatted outputs. A core skill for AI engineers building LLM-powered features. [AI Engineer]

RAG (Retrieval-Augmented Generation): An architecture that combines a large language model with a retrieval system, allowing the model to access relevant documents or data at inference time rather than relying solely on its pre-trained knowledge. Reduces hallucination and improves factual accuracy. [AI Engineer]
RLHF (Reinforcement Learning from Human Feedback): A training technique that uses human preference ratings to fine-tune a language model to produce outputs that humans rate as more helpful, harmless, and honest. Used in the alignment of models like ChatGPT and Claude. [AI Engineer] [ML Engineer]

Scorecard: A structured evaluation tool used to rate a candidate across specific competencies after a technical interview. A well-designed scorecard replaces vague post-interview impressions with clear, comparable evidence — reducing bias and improving hiring consistency.
Statistical significance: A threshold used in hypothesis testing to determine whether an observed result is unlikely to have occurred by chance under the null hypothesis. Typically measured with a p-value compared against a pre-specified significance level (e.g., p < 0.05). [Data Scientist]

Transfer learning: The practice of starting with a model pre-trained on a large dataset and adapting it for a related task, rather than training from scratch. Reduces the data and compute requirements for building effective models in specialized domains. [ML Engineer] [AI Engineer]

Vector database: A database optimized for storing and querying high-dimensional vectors (embeddings), enabling fast similarity search. Commonly used as the retrieval component in RAG systems. [AI Engineer]

Get a structured technical evaluation delivered by a practitioner who knows the domain — not a generic screener.