Step into the Data Science Glossary — a hub of definitions for the core concepts shaping this dynamic field. Whether you’re exploring statistical models, data pipelines, or machine learning algorithms, our explanations are designed to support both learners and data professionals in making sense of the data-driven world.
Big Data refers to extremely large datasets that are too complex to be handled and processed by traditional data management...
See more...Big Data Modeling is the process of structuring large and complex datasets into models that are easier to understand, analyze,...
See more...BigQuery is a cloud-based data analysis tool from Google that allows users to quickly process and analyze large datasets using...
See more...CI/CD refers to the processes and tools that automate software development, testing, and deployment to ensure faster and more reliable...
See more...A confidence interval is a range of values used in statistics to estimate the uncertainty or variability of a measurement...
See more...A data lake is a centralized repository that stores large amounts of raw data in its native format, including structured,...
See more...Data visualization is the practice of displaying data in a visual format, such as charts, graphs, or maps, to make...
See more...Decision science is an interdisciplinary field that uses data, statistics, and behavioral insights to make informed decisions and solve complex...
See more...A decision tree is a visual model in machine learning that splits data into branches based on conditions, helping to...
See more...Deep learning data refers to the large and diverse datasets used to train deep neural networks, a type of machine...
See more...A directed acyclic graph (DAG) is a data structure consisting of nodes connected by directed edges, where the connections flow...
See more...Docker is a platform that allows developers to build, package, and run applications in lightweight containers that are consistent across...
See more...Ensemble learning is a machine learning technique that combines the predictions of multiple models to improve accuracy, robustness, and performance.
See more...FastAPI is a modern, high-performance web framework for building APIs in Python, designed for speed and developer efficiency.
See more...Feature engineering is the process of creating, selecting, or transforming data attributes to improve the performance of machine learning models.
See more...Feature selection is the process of identifying and using only the most relevant attributes in a dataset to improve the...
See more...Google Compute refers to Google Cloud's suite of compute services that provide scalable and flexible virtual machines, containers, and serverless...
See more...Hugging Face is an open-source platform and community that provides tools, models, and libraries for natural language processing (NLP) and...
See more...Hypothesis testing is a statistical method used to determine whether a hypothesis about a dataset is supported by evidence or...
See more...Infrastructure as Code (IaC) is the practice of managing and provisioning IT infrastructure using machine-readable configuration files rather than manual...
See more...Jupyter Notebooks are interactive, open-source tools that allow users to write, run, and document Python code alongside visualizations and text...
See more...K-means clustering is an unsupervised machine learning algorithm that groups data points into a specified number of clusters based on...
See more...Keras is a high-level, user-friendly library for building and training deep learning models, running on top of TensorFlow.
See more...KNN, or K-Nearest Neighbors, is a machine learning algorithm that classifies data points based on the "nearest" data points it...
See more...Latent Dirichlet Allocation (LDA) is a statistical model used for topic modeling, which identifies abstract topics within a collection of...
See more...Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables...
See more...Linear regression in machine learning is an algorithm used to predict numerical values by learning a linear relationship between input...
See more...Large Language Models (LLMs) are advanced AI models trained on massive text datasets to understand, generate, and interact using human-like...
See more...Logistic regression is a statistical model used to predict binary outcomes (e.g., yes/no) based on input features, using a sigmoid...
See more...Machine learning engineering involves building, deploying, and maintaining machine learning systems that solve real-world problems using data-driven algorithms.
See more...Matplotlib is a popular Python library for creating static, interactive, and animated visualizations such as line graphs, bar charts, scatter...
See more...MLflow is an open-source platform that manages the machine learning lifecycle, including experiment tracking, model deployment, and reproducibility.
See more...MLOps is the practice of combining machine learning development with software engineering and operations to streamline the deployment, monitoring, and...
See more...Multivariate regression is a statistical method used to predict the outcome of a target variable based on multiple input variables....
See more...The Naive Bayes classifier is a simple probabilistic algorithm for classification that assumes features are independent, making it fast and...
See more...A neural network is a machine learning model inspired by the structure of the human brain, consisting of layers of...
See more...NumPy is a Python library for numerical computing, providing tools to handle large, multi-dimensional arrays and perform mathematical operations efficiently.
See more...Object-Oriented Programming (OOP) is a programming paradigm that organizes code into "objects," which combine data (attributes) and behaviors (methods) into...
See more...Pandas is a Python library used for data manipulation and analysis, providing data structures like DataFrames to organize, clean, and...
See more...Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into fewer dimensions while preserving as much...
See more...Plotly is a Python library for creating interactive, web-based visualizations, such as 3D plots, dashboards, and maps.
See more...SARIMAX is a statistical model used for time series forecasting, incorporating both seasonality and the influence of external (exogenous) variables.
See more...Scikit-learn is a Python library offering tools for machine learning, including algorithms for classification, regression, clustering, and dimensionality reduction.
See more...Seaborn is a Python library for creating statistical data visualizations, built on top of Matplotlib, with a focus on attractive...
See more...Secure Shell (SSH) is a protocol for securely accessing and managing remote computers over a network using encryption.
See more...Shapley values are a game theory concept used in machine learning to fairly distribute credit among features based on their...
See more...Statistical inference is the process of using data from a sample to make generalizations about a larger population, often involving...
See more...Statsmodels is a Python library for performing statistical modeling, hypothesis testing, and data exploration.
See more...Streamlit is an open-source Python library for building interactive web applications for data visualization, machine learning models, and dashboards quickly.
See more...Structured data is highly organized information stored in a fixed format, such as rows and columns in a database or...
See more...Support Vector Machine (SVM) is a machine learning algorithm used for classification and regression tasks by finding a hyperplane that...
See more...Tabular data is structured data stored in rows and columns, commonly found in spreadsheets and databases.
See more...TensorFlow is an open-source library for machine learning and deep learning, designed to build and train neural networks efficiently.
See more...TF-IDF is a statistical method used in text analysis to evaluate how important a word is in a document relative...
See more...Time series data consists of observations recorded at regular intervals over time, often used to identify trends and patterns.
See more...A transformer neural network is an advanced architecture in machine learning that uses attention mechanisms to process sequential data, like...
See more...Unstructured data is information that doesn’t follow a predefined format or structure, such as text, images, videos, and emails.
See more...A virtual machine (VM) is a software-based simulation of a physical computer, allowing multiple operating systems to run on a...
See more...VS Code is a lightweight, open-source code editor developed by Microsoft, offering support for multiple programming languages and extensive customization...
See more...XGBoost (Extreme Gradient Boosting) is a machine learning library that implements a fast, scalable version of gradient boosting, primarily used...
See more...