Data Science

Step into the Data Science Glossary — a hub of definitions for the core concepts shaping this dynamic field. Whether you’re exploring statistical models, data pipelines, or machine learning algorithms, our explanations are designed to support both learners and data professionals in making sense of the data-driven world.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Big Data
Big Data refers to extremely large datasets that are too complex to be handled and processed by traditional data management...
See more...
Big Data Modeling
Big Data Modeling is the process of structuring large and complex datasets into models that are easier to understand, analyze,...
See more...
BigQuery
BigQuery is a cloud-based data analysis tool from Google that allows users to quickly process and analyze large datasets using...
See more...

CI/CD (Continuous Integration and Continuous Delivery)
CI/CD refers to the processes and tools that automate software development, testing, and deployment to ensure faster and more reliable...
See more...
Confidence Intervals
A confidence interval is a range of values used in statistics to estimate the uncertainty or variability of a measurement...
See more...

Data Lake
A data lake is a centralized repository that stores large amounts of raw data in its native format, including structured,...
See more...
Data Visualization
Data visualization is the practice of displaying data in a visual format, such as charts, graphs, or maps, to make...
See more...
Decision Science
Decision science is an interdisciplinary field that uses data, statistics, and behavioral insights to make informed decisions and solve complex...
See more...
Decision Tree Machine Learning
A decision tree is a visual model in machine learning that splits data into branches based on conditions, helping to...
See more...
Deep Learning Data
Deep learning data refers to the large and diverse datasets used to train deep neural networks, a type of machine...
See more...
Directed Acyclic Graph (DAG)
A directed acyclic graph (DAG) is a data structure consisting of nodes connected by directed edges, where the connections flow...
See more...
Docker
Docker is a platform that allows developers to build, package, and run applications in lightweight containers that are consistent across...
See more...

Ensemble Learning
Ensemble learning is a machine learning technique that combines the predictions of multiple models to improve accuracy, robustness, and performance.
See more...

FastAPI
FastAPI is a modern, high-performance web framework for building APIs in Python, designed for speed and developer efficiency.
See more...
Feature Engineering
Feature engineering is the process of creating, selecting, or transforming data attributes to improve the performance of machine learning models.
See more...
Feature Selection
Feature selection is the process of identifying and using only the most relevant attributes in a dataset to improve the...
See more...

Google Compute
Google Compute refers to Google Cloud's suite of compute services that provide scalable and flexible virtual machines, containers, and serverless...
See more...

Hugging Face
Hugging Face is an open-source platform and community that provides tools, models, and libraries for natural language processing (NLP) and...
See more...
Hypothesis Testing
Hypothesis testing is a statistical method used to determine whether a hypothesis about a dataset is supported by evidence or...
See more...

Infrastructure as Code (IaC)
Infrastructure as Code (IaC) is the practice of managing and provisioning IT infrastructure using machine-readable configuration files rather than manual...
See more...

Jupyter Notebooks
Jupyter Notebooks are interactive, open-source tools that allow users to write, run, and document Python code alongside visualizations and text...
See more...

K-Means Clustering
K-means clustering is an unsupervised machine learning algorithm that groups data points into a specified number of clusters based on...
See more...
Keras
Keras is a high-level, user-friendly library for building and training deep learning models, running on top of TensorFlow.
See more...
KNN algorithm
KNN, or K-Nearest Neighbors, is a machine learning algorithm that classifies data points based on the "nearest" data points it...
See more...

LDA (Latent Dirichlet Allocation) Model
Latent Dirichlet Allocation (LDA) is a statistical model used for topic modeling, which identifies abstract topics within a collection of...
See more...
Linear Regression
Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables...
See more...
Linear Regression (Machine Learning)
Linear regression in machine learning is an algorithm used to predict numerical values by learning a linear relationship between input...
See more...
LLM (Large Language Model)
Large Language Models (LLMs) are advanced AI models trained on massive text datasets to understand, generate, and interact using human-like...
See more...
Logistic Regression
Logistic regression is a statistical model used to predict binary outcomes (e.g., yes/no) based on input features, using a sigmoid...
See more...

Machine Learning (Engineering)
Machine learning engineering involves building, deploying, and maintaining machine learning systems that solve real-world problems using data-driven algorithms.
See more...
Matplotlib
Matplotlib is a popular Python library for creating static, interactive, and animated visualizations such as line graphs, bar charts, scatter...
See more...
MLflow
MLflow is an open-source platform that manages the machine learning lifecycle, including experiment tracking, model deployment, and reproducibility.
See more...
MLOps (Machine Learning Operations)
MLOps is the practice of combining machine learning development with software engineering and operations to streamline the deployment, monitoring, and...
See more...
Multivariate Regression
Multivariate regression is a statistical method used to predict the outcome of a target variable based on multiple input variables....
See more...

Naive Bayes Classifier
The Naive Bayes classifier is a simple probabilistic algorithm for classification that assumes features are independent, making it fast and...
See more...
Neural Network
A neural network is a machine learning model inspired by the structure of the human brain, consisting of layers of...
See more...
NumPy
NumPy is a Python library for numerical computing, providing tools to handle large, multi-dimensional arrays and perform mathematical operations efficiently.
See more...

Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP) is a programming paradigm that organizes code into "objects," which combine data (attributes) and behaviors (methods) into...
See more...

Pandas Python
Pandas is a Python library used for data manipulation and analysis, providing data structures like DataFrames to organize, clean, and...
See more...
PCA (Principal Component Analysis)
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into fewer dimensions while preserving as much...
See more...
Plotly
Plotly is a Python library for creating interactive, web-based visualizations, such as 3D plots, dashboards, and maps.
See more...

SARIMAX Model (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables)
SARIMAX is a statistical model used for time series forecasting, incorporating both seasonality and the influence of external (exogenous) variables.
See more...
Scikit-learn
Scikit-learn is a Python library offering tools for machine learning, including algorithms for classification, regression, clustering, and dimensionality reduction.
See more...
Seaborn
Seaborn is a Python library for creating statistical data visualizations, built on top of Matplotlib, with a focus on attractive...
See more...
Secure Shell (SSH)
Secure Shell (SSH) is a protocol for securely accessing and managing remote computers over a network using encryption.
See more...
Shapley Values
Shapley values are a game theory concept used in machine learning to fairly distribute credit among features based on their...
See more...
Statistical Inference
Statistical inference is the process of using data from a sample to make generalizations about a larger population, often involving...
See more...
Statsmodels
Statsmodels is a Python library for performing statistical modeling, hypothesis testing, and data exploration.
See more...
Streamlit
Streamlit is an open-source Python library for building interactive web applications for data visualization, machine learning models, and dashboards quickly.
See more...
Structured Data
Structured data is highly organized information stored in a fixed format, such as rows and columns in a database or...
See more...
SVM (Support Vector Machine)
Support Vector Machine (SVM) is a machine learning algorithm used for classification and regression tasks by finding a hyperplane that...
See more...

Tabular Data
Tabular data is structured data stored in rows and columns, commonly found in spreadsheets and databases.
See more...
TensorFlow
TensorFlow is an open-source library for machine learning and deep learning, designed to build and train neural networks efficiently.
See more...
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a statistical method used in text analysis to evaluate how important a word is in a document relative...
See more...
Time Series Data
Time series data consists of observations recorded at regular intervals over time, often used to identify trends and patterns.
See more...
Transformer Neural Network
A transformer neural network is an advanced architecture in machine learning that uses attention mechanisms to process sequential data, like...
See more...

Unstructured Data
Unstructured data is information that doesn’t follow a predefined format or structure, such as text, images, videos, and emails.
See more...

Virtual Machine (VM)
A virtual machine (VM) is a software-based simulation of a physical computer, allowing multiple operating systems to run on a...
See more...
VS Code (Visual Studio Code)
VS Code is a lightweight, open-source code editor developed by Microsoft, offering support for multiple programming languages and extensive customization...
See more...

XGBoost
XGBoost (Extreme Gradient Boosting) is a machine learning library that implements a fast, scalable version of gradient boosting, primarily used...
See more...

Latest Blog articles

Bootcamps