Process of detecting, correcting, and transforming raw data into a clean dataset suitable for analysis., Data Cleaning and Preprocessing, No data value stored for a variable in an observation. Appear as NULL, NaN, blank, ?, N/A., Missing Values, Replacing missing values instead of removing them., Imputation, Replace missing value with the average of the column., Mean Imputation, Replace missing value with the median; useful when data has outliers., Median Imputation, Replace missing value with the most frequent value; used for categorical data., Mode Imputation, Use adjacent row values to fill missing data; common in time-series., Forward / Backward Fill, Same data appearing more than once; leads to incorrect analysis, biased results., Duplicate Data, Data point that differs significantly from other observations (e.g., salary of 500,000 in a list of ~30,000)., Outlier, Statistical measure to detect outliers; values with Z-score > 3 are usually outliers., Z-score, Changing data into a different format or structure (e.g., Male/Female → 0/1)., Data Transformation, Scaling features to similar ranges so ML algorithms perform better., Feature Scaling, Rescales data to range 0–1. Formula: (X - Xmin) / (Xmax - Xmin)., Normalization, Transforms data to Mean = 0, Std Dev = 1. Formula: z = (x - μ) / σ., Standardization, Improve data quality | Remove errors and inconsistencies | Prepare data for analysis and modeling | Ensure reliable results, Objectives of Data Cleaning (4), Human error | Incomplete data entry | Sensor malfunction | Data corruption | Optional survey questions, Causes of Missing Data (5), Remove Rows | Remove Columns | Fill Missing Values (Imputation), Methods for Handling Missing Values (3 main), Data entry errors | Measurement errors | Natural variation | Fraud or anomalies, Causes of Outliers (4), Visualization (Boxplot, Scatterplot, Histogram) | Statistical Methods (Z-score), Outlier Detection Methods (2), Remove them | Transform the data | Cap the values, Handling Outliers (3), Creating new variables (features) from existing data to improve model performance., Feature Engineering, Selecting most important variables and removing irrelevant ones. Reduces noise, improves accuracy, reduces computation time., Feature Selection, Measures linear relationship using Pearson Correlation Coefficient (-1 to 1). Detects redundancy (multicollinearity)., Correlation Analysis, Problem when two features are highly correlated and provide the same information, which confuses models., Multicollinearity, Survival of the fittest for features. Train → Rank → Prune → Repeat. Computationally expensive., RFE (Recursive Feature Elimination), Built-in model intuition (Random Forest, XGBoost). Tracks which features reduce error most. Fast scorecard., Feature Importance from Models, Used when datasets have too many features., Dimensionality Reduction, Bird's Eye View. Linear, unsupervised. Rotates data to capture maximum variance. Best for speed, denoising, ML prep., PCA (Principal Component Analysis), Microscope. Non-linear, unsupervised. Keeps nearby points close in 2D. For visualization only. Stochastic., t-SNE, PCA with a Purpose. Supervised. Maximizes distance between classes. Requires target variable (y)., LDA (Linear Discriminant Analysis), Assigns numbers to categories (e.g., Red=0, Blue=1)., Label Encoding, Creates binary columns for each category., One-Hot Encoding, Replaces category with target mean., Target Encoding, Uses median and IQR instead of mean/std dev. Best for datasets with significant outliers., Robust Scaling, Synthetic Minority Over-sampling Technique. Creates synthetic data points along lines between minority class points., SMOTE, Less is more. Deletes majority class examples until classes balance. Wastes good data., Undersampling, Changes punishment: minority misclassification = more penalty. Use class_weight='balanced'., Class Weighting, ML approach for outliers. Randomly splits data; outliers get isolated quickly. Best for high-dimensional data., Isolation Forest, Capping extreme outlier values at a defined limit., Winsorization, Feature Engineering | Feature Selection | Dimensionality Reduction | Encoding Categorical Variables | Data Scaling Techniques | Handling Imbalanced Data | Outlier Treatment, Advanced Preprocessing Topics (7), Correlation Analysis | Recursive Feature Elimination (RFE) | Feature Importance from Models, Feature Selection Techniques (3), PCA | t-SNE | LDA, Dimensionality Reduction Techniques (3), Label Encoding | One-Hot Encoding | Target Encoding, Encoding Methods (3), Standardization (Z-score) | Min-Max Scaling | Robust Scaling, Data Scaling Methods (3), Oversampling (SMOTE) | Undersampling | Class Weighting, Imbalanced Data Methods (3), Z-score | IQR Method | Isolation Forest, Outlier Detection Methods (3), Examining datasets using both statistical summaries and visual techniques., EDA (Exploratory Data Analysis), Summarize main characteristics of your data., Descriptive Statistics, Average value. Central tendency measure., Mean, Middle value. Less affected by outliers., Median, Most frequent value. Useful for categorical data., Mode, Difference between max and min. Simplest measure of dispersion., Range, Average of squared differences from the mean. Mathematical foundation of spread., Variance, Square root of variance. Most intuitive spread measure. Empirical Rule: 68-95-99.7%., Standard Deviation, Asymmetry of data distribution. Positively skewed = tail right; Negatively skewed = tail left., Skewness, Peakedness or flatness. Leptokurtic = sharp peak / fat tails; Mesokurtic = normal; Platykurtic = flat., Kurtosis, Bell curve. Perfectly symmetrical. Mean = Median = Mode at center., Normal Distribution, Data leans to one side creating a long tail., Skewed Distribution, Every outcome has the same probability. Looks like a flat rectangle., Uniform Distribution, Measures how strongly two variables are related. Range: -1 to +1., Correlation, Measures linear relationships., Pearson Correlation, Rank-based relationship measurement., Spearman Correlation, Show how data changes over time or across variables., Trends, Understand structure | Detect missing values and outliers | Identify patterns and relationships | Guide feature selection and modeling, Objectives of EDA (4), Mean | Median | Mode, Key Measures — Central Tendency (3), Range | Variance | Standard Deviation, Key Measures — Dispersion (3), Skewness | Kurtosis, Key Measures — Shape (2), Leptokurtic | Mesokurtic | Platykurtic, Types of Kurtosis (3), Normal Distribution | Skewed Distribution | Uniform Distribution, Common Distribution Types (3), Pearson (linear) | Spearman (rank-based), Correlation Methods (2), Histogram | Box Plot | Density Plot, Visualization Tools for Distributions (3), Line charts (time series) | Scatter plots, Trend Visualization Tools (2), Communicating insights effectively — not just making charts., Data Visualization, Goal is understanding, not decoration. Avoid 3D charts, excessive colors., Clarity over aesthetics, Remove clutter. Less is more., Simplicity, Represent data truthfully. Avoid misleading scales / truncated axes., Accuracy, Use consistent colors, scales, formats., Consistency, Compare categories. Best for discrete comparisons., Bar Chart, Shows changes over time. Best for time series data., Line Chart, Shows parts of a whole. Use sparingly., Pie Chart, Shows frequency distribution., Histogram, Shows spread, median, outliers., Box Plot, Shows correlation between variables. Best for finding patterns/relationships., Scatter Plot, Color-coded matrix. Used for correlation matrices., Heatmap, Multi-variable comparison., Bubble Chart, Combines Data + Visuals + Narrative to communicate insights., Data Storytelling, What is the problem? Why does it matter?, Context (Storytelling), The story connecting insights., Narrative (Storytelling), Clarity over aesthetics | Simplicity | Accuracy | Consistency | Focus on the message, Core Visualization Principles (5), Use color intentionally | Label axes and titles clearly | Use readable fonts | Avoid overplotting, Best Practices (4), Comparison (Bar) | Trend (Line) | Composition (Pie) | Distribution (Histogram, Box Plot) | Relationship (Scatter Plot), Chart Types by Category (5 categories), Context | Data | Narrative, Data Storytelling Components (3), Beginning (Setup) | Middle (Insight) | End (Conclusion), Storytelling Structure (3 parts), Show Trend | Highlight Drop | Compare Regions | Explain Cause | Suggest Solution, Example Flow Steps (5), Backbone of data science; allows making predictions under uncertainty., Probability, A process that produces outcomes., Experiment, All possible outcomes of an experiment., Sample Space, A subset of outcomes., Event, Based on equally likely outcomes., Classical Probability, Based on observed data., Empirical Probability.

etter

Ledertavle

Visuell stil

Alternativer

Bytt mal

)
Gjenopprett automatisk lagring: ?