AI & Machine Learning Basics

Data Science Concepts:

1. Fundamental of Data Science

Definition

Data science is a multidisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

Components

Statistics: Understanding data distribution, correlation probability .
Mathematics: Linear algebra, calculus, optimization.
Programming: Python, R, SQL for manipulating and analyzing data.
Domain knowledge: Understanding the field you’re applying data science in.
Machine learning: Algorithms that improve from experience.
Data engineering: Handling, storing, and retrieving large volumes of data.
Visualization: Communicating insights clearly (e.g. with matplotlib, seaborn, tableau)

2. Data Collection & Cleaning

Data Collection

Sources: API’s, web scraping, databases, CSV/Excel files, surveys.
Tools: Python libraries (e.g. requests, BeautifulSoup, Pandas), SQL .

Data Cleaning

Missing values: Impute, drop, or flag.
Outliers: Detected using Z-score, IQR or visual inspection.
Data types: Ensuring correct types (e.g. int, float, datetime).
Encoding: Converting categorical to numerical (one-hot, Label encoding).
Normalisation: scaling data (MinMax, StandardScaler).

3. Exploratory Data Analysis (EDA)

Goals

Understand patterns, spot anomalies, test assumptions.

Tools

Visualization: Histograms, Scatter plots, box plots.
Statistics: Correlation matrices, mean, median, mode, skewness, Kurtosis,
Tools: matplotlib, seaborn, pandas-profiling.

4. Feature Engineering

Techniques

Creation: Combining or transforming existing features.
Selection: Improving irrelevant or redundant features.
Dimensionality Reduction: PCA, t-SNE.
Binning: Grouping continuous variables into intervals.

5. Machine Learning

Types

Supervised Learning: With labeled data (Regression, Classification).
Unsupervised Learning: Without Labeled (Clustering, Association).
Reinforcement Learning: Agent learns through rewards/punishments.

Common Algorithms

Regression: linear regression, Ridge, Lasso.
Classification: logistic regression, Decision trees, SVM, Random forest, XGBoost.
Clustering: K-Means, DBSCAN, Hierarchical Clustering.
Neural Networks: Deep learning with TensorFlow/keras/PyTorch.

6. Model evaluation

Metrics

Classification: Accuracy, Precision, Recall, F1 score, ROC-AUC.
Regression: MSE, RMSE, MAE, R² score.
Cross Validation: K-fold, Stratified K-fold.

7. Model Deployment

Steps

Convert model to a production-ready format.
Tools: Flask, fastAPI, Docker, cloud platforms (AWC, GCP, Azure)
Monitor model performance over time (model drift, retraining)

8. Data visualization

Goals

Make complex data understandable.
Communicate insights to stakeholders.

Tools

Python: matplotlib, seaborn, plotly.
Bl tools: Tableau, power Bl.

9. Big data & Cloud Computing

Big data Technologies

Hadoop, spark, kafka, hive.

Cloud Platforms

AWS (S3, Redshift, SageMaker), Azure, Google Cloud.

10. Ethics & Privacy in Data Science

Bias: Avoid Discriminatory models.
Fairness: Equitable outcomes of all groups.
Privacy: GDPR, anonymization, secure data handling.

Python Libraries

1. NumPy (Numerical Python)

Core Concept

Provides fast, efficient operation on large arrays and matrices of numeric data.
Adds powerful mathematical functions (linear Algebra, Fourier transforms, Statics).

Key ideas

ndarray: A powerful N-dimensional array object.

Vectorization: Operate on entire arrays without writing loops.

Broadcasting: Allows operations on arrays of different shapes.

Example

import numpy as np

# Create a 1D array

arr = np.array([1, 2, 3, 4, 5])

# Do an operation on the array

print(arr * 2) # [ 2 4 6 8 10]

Concept shown: Arrays + Vectorized operations.

2. Pandas

Core Concept:

Built on top of NumPy to handle labeled data and tabular data easily (like spreadsheets or SQL tables).

Designed for data manipulation and analysis.

Key Ideas:

DataFrame: 2D table (rows and columns) with labels.

Series: 1D labeled array.

Data cleaning, filtering, aggregation: Easier than using raw arrays

Example

import pandas as pd

# Create a DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35]}

df = pd.DataFrame(data)

# Display the DataFrame

print(df)

Concept shown: Creating and viewing tabular data.

3. TensorFlow

Core Concept:

An end-to-end open-source library for machine learning and deep learning.

Originally developed by Google Brain team.

Key Ideas:

Tensor: Multi-dimensional array (similar to NumPy arrays).

Computational Graphs: Operations are represented as nodes; data flows through them.

Automatic Differentiation: Needed for training neural networks (e.g., backpropagation).

GPU/TPU acceleration: For very large models and datasets.

Example

import tensorflow as tf

# Create two tensors

a = tf.constant(2)

b = tf.constant(3)

# Perform a computation

c = a + b

# Print result

print(c.numpy()) # 5

Concept shown: Tensors + Basic Computation.