Course Materials for Big Data Analytics - PSAU
π AboutΒΆ
Comprehensive course materials for teaching Big Data Analytics using modern Python-based tools and technologies. This repository contains:
π 11 Chapters covering foundations to advanced topics
π» Interactive Jupyter Notebooks for hands-on learning
π§ͺ 11 Practical Labs with real-world datasets
π³ Docker Environment for easy setup
π Visualization Examples using Matplotlib, Seaborn, and Plotly
β‘ Big Data Processing with Apache Spark
π Quick StartΒΆ
Using Docker (Recommended)ΒΆ
# Clone the repository
git clone https://github.com/chebil/BigData.git
cd BigData
# Start all services (Jupyter, Spark, PostgreSQL)
docker-compose up -d
# Access Jupyter Lab at http://localhost:8888
# Access Spark UI at http://localhost:8080Local InstallationΒΆ
# Create conda environment
conda env create -f environment.yml
conda activate bigdata-course
# Or use pip
pip install -r requirements.txt
# Start Jupyter Lab
jupyter labBuild the BookΒΆ
# Install Jupyter Book
pip install jupyter-book
# Build the book
jupyter-book build .
# Open _build/html/index.html in your browserπ Course StructureΒΆ
Part I: FoundationsΒΆ
Introduction to Big Data - Concepts, lifecycle, data types
Data Analytics Lifecycle - Six-phase approach
Statistical Foundations - Python, NumPy, Pandas, visualization
Part II: Machine LearningΒΆ
Clustering - K-means, hierarchical, DBSCAN
Association Rules - Market basket analysis, Apriori
Regression - Linear, multiple, regularization
Classification - Logistic regression, NaΓ―ve Bayes, decision trees
Time Series - ARIMA, forecasting, Prophet
Text Analytics - NLP, sentiment analysis, topic modeling
Part III: Big Data TechnologiesΒΆ
Distributed Computing - Hadoop, Spark, PySpark
Advanced Topics - Deep learning, deployment, cloud platforms
π§ͺ LabsΒΆ
| Lab | Topic | Duration |
|---|---|---|
| Lab 0 | Environment Setup | 30 min |
| Lab 1 | Data Exploration | 2 hours |
| Lab 2 | Python & Pandas | 2 hours |
| Lab 3 | Statistics & Visualization | 3 hours |
| Lab 4 | Clustering | 2.5 hours |
| Lab 5 | Association Rules | 2 hours |
| Lab 6 | Regression | 2.5 hours |
| Lab 7 | Classification | 3 hours |
| Lab 8 | Time Series | 2.5 hours |
| Lab 9 | Text Analytics | 3 hours |
| Lab 10 | Apache Spark | 3 hours |
| Lab 11 | Capstone Project | 10+ hours |
π οΈ TechnologiesΒΆ
Core Stack:
Python 3.10+
Jupyter Lab
NumPy, Pandas, SciPy
scikit-learn
Matplotlib, Seaborn, Plotly
Big Data:
Apache Spark 3.4
PySpark
Dask
Machine Learning:
XGBoost, LightGBM
TensorFlow, Keras, PyTorch
Prophet, statsmodels
NLP:
NLTK, spaCy, Gensim
Transformers
Infrastructure:
Docker & Docker Compose
PostgreSQL
Git
π DatasetsΒΆ
All labs use real-world datasets:
US Census 2020 data
Retail transactions
Customer segmentation data
Time series (stocks, weather)
Text corpora (reviews, social media)
Image datasets
π Learning OutcomesΒΆ
After completing this course, students will be able to:
β
Apply the data analytics lifecycle to real-world problems
β
Perform exploratory data analysis using Python
β
Implement machine learning algorithms from scratch
β
Build and evaluate classification and regression models
β
Process large datasets using Apache Spark
β
Perform text analytics and sentiment analysis
β
Deploy machine learning models
β
Work with big data technologies
π AssessmentΒΆ
Labs: 50% (10 labs Γ 5% each)
Midterm: 20%
Capstone Project: 25%
Participation: 5%
π€ ContributingΒΆ
Contributions are welcome! Please:
Fork the repository
Create a feature branch
Make your changes
Submit a pull request
π LicenseΒΆ
This project is licensed under the MIT License - see LICENSE file.
π¨βπ« InstructorΒΆ
Dr. Chebil Khalil
Department of Computer Science
Prince Sattam bin Abdulaziz University (PSAU)
Email: chebilkhalil@gmail
π LinksΒΆ
π Course Website
π¬ Discussions
π Issues
π Documentation
β Star HistoryΒΆ
If you find this repository helpful, please consider giving it a star!
Built with β€οΈ using Jupyter Book and MyST Markdown