01
AI · LLM · NLP
StarCoder2 Self-Alignment Pipeline
Implemented the SelfOSSInstruct methodology from the StarCoder2 paper to generate TypeScript instruction-tuning datasets. Stage 1 extracts TypeScript functions from The Stack v2 using tree-sitter AST parsing and TypeScript compiler type-checking — filtering 30,000 raw files down to 5,791 quality seed functions. Stage 2 runs an S→C→I→R chain (Seed → Concepts → Instructions → Responses) via StarCoder2-3B. Stage 3 filters outputs using model-based quality assessment. Deployed on Google Colab T4 GPU with vLLM batched inference (batch size 32).
Python
tree-sitter
vLLM
Hugging Face
StarCoder2
Google Colab
5,791
Quality-validated TypeScript seed functions
30,000 raw files → 8,372 TypeScript functions → 5,791 seeds after return-filter + type-check
S→C→I→R chain: 5,791 concept annotations and instructions generated, 448 response pairs completed
StarCoder2-3B via vLLM (batch size 32) on Google Colab T4 GPU using HuggingFace Datasets + Arrow format
View on GitHub ↗
02
Machine Learning · Healthcare
Heart Failure Prediction
Compared Random Forest, LSTM, and KNN on 918 patient records (12 clinical features) using 10-fold cross-validation. Computed 15+ metrics per fold — accuracy, F1, TSS, HSS, Brier Score, and AUC. Random Forest achieved best performance: 86.8% accuracy, F1 0.883, and AUC 0.94 vs. LSTM at 84.2% and KNN at 71.0%.
RF 86.8% Acc · AUC 0.94 · 15+ Metrics
PythonRandom ForestLSTMTensorFlow10-Fold CV
View on GitHub ↗
03
Machine Learning · Finance
Loan Approval Prediction
End-to-end ML pipeline on 20,000 loan applications with 36 features and severe class imbalance (76.1% rejected). Applied SMOTE on training data only to prevent leakage, compared 6 models — Logistic Regression, Decision Tree, Random Forest, SVM, KNN, and ANN — with GridSearchCV tuning, and applied SHAP to surface top predictors: CreditScore, AnnualIncome, and DebtToIncomeRatio.
6 Models · SHAP · GridSearchCV
PythonSMOTESHAPTensorFlowGridSearchCVScikit-learn
View on GitHub ↗
04
Data Mining · Algorithms
Frequent Itemset Mining
Implemented and benchmarked Brute Force (from scratch), Apriori, and FP-Growth on 5 retail datasets — Amazon, BestBuy, Walmart, Target, Kroger — with configurable support and confidence thresholds.
3 Algorithms · 5 Datasets
PythonmlxtendAprioriFP-Growth
View on GitHub ↗
05
AI · Knowledge Representation
Cluedo — AI Logical Deduction Agent
Complete Cluedo board game with an AI player powered by a custom KnowledgeBase. Uses process-of-elimination inference across refutation patterns and only makes an accusation when the solution is 100% certain. Supports 3–6 mixed human/AI players.
Full Deduction Engine · Mixed H/AI Play
PythonOOPKnowledge BaseLogical Inference
View on GitHub ↗
06
Big Data · Distributed Systems
Amazon Reviews Big Data Analysis
Built a 4-node Hadoop cluster on AWS EC2 (1 NameNode + 3 DataNodes, Hadoop 2.6.5) with HDFS. Developed a MapReduce job in Java to parse a 1.2 GB TSV dataset of Amazon video game reviews and compute star-rating distribution across 1.78 million records. Optimized throughput by tuning HDFS block size from 64 MB to 128 MB.
1.78M Records · 1.2 GB Dataset
Hadoop 2.6.5MapReduceJavaHDFSAWS EC2
View on GitHub ↗
07
Backend · DevOps · CI/CD
User Management System
FastAPI + PostgreSQL backend with JWT OAuth2 authentication, role-based access control (Admin, Manager, User), and profile completion tracking. Diagnosed and resolved 5 critical production bugs — DockerHub CI failures, unique constraint violations, routing 404s, nested transaction errors, and test mocking issues. Added 10 new edge-case tests (138 passing total) and built a full CI/CD pipeline with GitHub Actions and Docker.
138 Tests · JWT Auth · CI/CD
FastAPIPostgreSQLDockerGitHub ActionsSQLAlchemyPytest
View on GitHub ↗
08
Big Data · Distributed Systems · Finance
Cryptocurrency Market Analysis on Hadoop
Built three distributed MapReduce jobs in Java to analyze 2 GB of historical OHLCV tick data across 100+ cryptocurrency pairs (Binance, Apr–Aug 2024) on a multi-node AWS EC2 cluster. Jobs surface volatility rankings, worst-performing assets by open-to-close change, and cumulative volume leaders with peak timestamps.
100+ Crypto Pairs · 2 GB HDFS Dataset
JavaHadoopMapReduceHDFSAWS EC2
View on GitHub ↗