AI & Machine Learning Basics
Data Science Concepts:
1. Fundamental of Data Science
Definition
Data science is a multidisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
Components
Statistics: Understanding data distribution, correlation probability .
Mathematics: Linear algebra, calculus, optimization.
Programming: Python, R, SQL for manipulating and analyzing data.
Domain knowledge: Understanding the field you’re applying data science in.
Machine learning: Algorithms that improve from experience.
Data engineering: Handling, storing, and retrieving large volumes of data.
Visualization: Communicating insights clearly (e.g. with matplotlib, seaborn, tableau)
2. Data Collection & Cleaning
Data Collection
Sources: API’s, web scraping, databases, CSV/Excel files, surveys.
Tools: Python libraries (e.g. requests, BeautifulSoup, Pandas), SQL .
Data Cleaning
Missing values: Impute, drop, or flag.
Outliers: Detected using Z-score, IQR or visual inspection.
Data types: Ensuring correct types (e.g. int, float, datetime).
Encoding: Converting categorical to numerical (one-hot, Label encoding).
Normalisation: scaling data (MinMax, StandardScaler).
3. Exploratory Data Analysis (EDA)
Goals
Understand patterns, spot anomalies, test assumptions.
Tools
Visualization: Histograms, Scatter plots, box plots.
Statistics: Correlation matrices, mean, median, mode, skewness, Kurtosis,
Tools: matplotlib, seaborn, pandas-profiling.
4. Feature Engineering
Techniques
Creation: Combining or transforming existing features.
Selection: Improving irrelevant or redundant features.
Dimensionality Reduction: PCA, t-SNE.
Binning: Grouping continuous variables into intervals.
5. Machine Learning
Types
Supervised Learning: With labeled data (Regression, Classification).
Unsupervised Learning: Without Labeled (Clustering, Association).
Reinforcement Learning: Agent learns through rewards/punishments.
Common Algorithms
Regression: linear regression, Ridge, Lasso.
Classification: logistic regression, Decision trees, SVM, Random forest, XGBoost.
Clustering: K-Means, DBSCAN, Hierarchical Clustering.
Neural Networks: Deep learning with TensorFlow/keras/PyTorch.
6. Model evaluation
Metrics
Classification: Accuracy, Precision, Recall, F1 score, ROC-AUC.
Regression: MSE, RMSE, MAE, R² score.
Cross Validation: K-fold, Stratified K-fold.
7. Model Deployment
Steps
Convert model to a production-ready format.
Tools: Flask, fastAPI, Docker, cloud platforms (AWC, GCP, Azure)
Monitor model performance over time (model drift, retraining)
8. Data visualization
Goals
Make complex data understandable.
Communicate insights to stakeholders.
Tools
Python: matplotlib, seaborn, plotly.
Bl tools: Tableau, power Bl.
9. Big data & Cloud Computing
Big data Technologies
Hadoop, spark, kafka, hive.
Cloud Platforms
AWS (S3, Redshift, SageMaker), Azure, Google Cloud.
10. Ethics & Privacy in Data Science
Bias: Avoid Discriminatory models.
Fairness: Equitable outcomes of all groups.
Privacy: GDPR, anonymization, secure data handling.
Python Libraries
1. NumPy (Numerical Python)
Core Concept
Provides fast, efficient operation on large arrays and matrices of numeric data.
Adds powerful mathematical functions (linear Algebra, Fourier transforms, Statics).
Key ideas
ndarray: A powerful N-dimensional array object.
Vectorization: Operate on entire arrays without writing loops.
Broadcasting: Allows operations on arrays of different shapes.
Example
import numpy as np
# Create a 1D array
arr = np.array([1, 2, 3, 4, 5])
# Do an operation on the array
print(arr * 2) # [ 2 4 6 8 10]
Concept shown: Arrays + Vectorized operations.
2. Pandas
Core Concept:
Built on top of NumPy to handle labeled data and tabular data easily (like spreadsheets or SQL tables).
Designed for data manipulation and analysis.
Key Ideas:
DataFrame: 2D table (rows and columns) with labels.
Series: 1D labeled array.
Data cleaning, filtering, aggregation: Easier than using raw arrays
Example
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
Concept shown: Creating and viewing tabular data.
3. TensorFlow
Core Concept:
An end-to-end open-source library for machine learning and deep learning.
Originally developed by Google Brain team.
Key Ideas:
Tensor: Multi-dimensional array (similar to NumPy arrays).
Computational Graphs: Operations are represented as nodes; data flows through them.
Automatic Differentiation: Needed for training neural networks (e.g., backpropagation).
GPU/TPU acceleration: For very large models and datasets.
Example
import tensorflow as tf
# Create two tensors
a = tf.constant(2)
b = tf.constant(3)
# Perform a computation
c = a + b
# Print result
print(c.numpy()) # 5
Concept shown: Tensors + Basic Computation.