Data ScientistMS · NJITOpen to Work

Prabhath

Vipparthi

SQL Analytics

Data Pipelines

Feature Engineering

Machine Learning

Statistical Testing

PySpark ETL

Data Warehousing

SHAP Explainability

Hypothesis Testing

MLOps

RAG Systems

Distributed Systems

Dashboarding

Model Deployment

SQL Analytics

Data Pipelines

Feature Engineering

Machine Learning

Statistical Testing

PySpark ETL

Data Warehousing

SHAP Explainability

Hypothesis Testing

MLOps

RAG Systems

Distributed Systems

Dashboarding

Model Deployment

Manifesto

IANALYZEDATA,SHIPPIPELINES,ANDTRAINMODELSTHATEXPLAINTHEMSELVES.

Master's in Data Science at NJIT (GPA 3.7, May 2026). I work across the full data stack — SQL analytics and hypothesis testing on real e-commerce data, 5.97M-row medallion pipelines with PySpark and dbt, machine-learning models with SHAP explainability on clinical and financial datasets, and production RAG systems on pgvector. Whatever the layer, the constants stay the same: the math holds up, the tests catch the regressions, and the output explains itself.

By the numbers

Records processed

5.97M

NYC TLC trip data · medallion architecture

Automated tests

625+

Unit, integration, and end-to-end coverage

Symbols / day

~130

FinSight pre-market briefing pipeline

Projects shipped

Analytics, ML, data engineering, and backend

Selected work · 7 projects

My Projects

From production RAG platforms to distributed data pipelines and LLM-alignment research — every build here solves a real problem with math, tests, and shipping discipline.

Featured · Production

AI / RAG

FinSight — Pre-Market Intelligence Platform

Production pre-market briefing system that runs every weekday morning. Ingests ~130 symbols, computes RSI and moving-average indicators, generates commentary with Gemini 2.5 Flash, stores everything in Postgres with pgvector, and emails confirmed subscribers via Resend. A retrieval-augmented Q&A layer answers plain-English questions over the full briefing history with inline date citations.

FastAPIPostgreSQL · pgvectorGemini 2.5 FlashRAGGitHub ActionsRender · Resend

Symbols

~130 / day

Tests

68 passing

LLM

Gemini 2.5

Vector store

pgvector

Featured · Research

LLM Alignment

StarCoder2 Self-Alignment Pipeline

Implemented the SelfOSSInstruct methodology from the StarCoder2 paper to generate a TypeScript instruction-tuning dataset. Extracted functions from The Stack v2 using tree-sitter AST parsing and TypeScript compiler type-checking, then ran an S→C→I→R chain (Seed → Concepts → Instructions → Responses) via StarCoder2-3B on a T4 GPU. Filtered final outputs with model-based quality scoring.

vLLMHugging Facetree-sitterStarCoder2-3BArrow / DatasetsPython

Quality seeds

5,791

Raw files

30,000

Inference

vLLM · bs 32

Hardware

Colab T4

Medallion Architecture

Data Engineering

NYC Taxi Medallion Data Pipeline

End-to-end data engineering pipeline over 5.97M real NYC TLC Yellow Taxi trips (Jan–Feb 2024): raw Parquet → PySpark cleaning → dbt gold marts → 23 data-quality tests → daily Airflow DAG. Built on a DuckDB warehouse with a Streamlit dashboard layer. Analysis surfaced Manhattan as 75% of total revenue ($111.9M of $149.1M).

PySparkdbtDuckDBAirflowStreamlit

Clean trips

5.44M

DQ tests

Revenue

$149.1M

Warehouse

DuckDB

Experimentation

Analytics

Olist SQL Analysis & Hypothesis Testing

End-to-end SQL analysis on real Olist Brazilian e-commerce data (99,441 orders across 96,096 customers). Six DuckDB window-function queries power a monthly cohort retention matrix. A formal power analysis and two-proportion z-test confirmed a payment-method retention hypothesis as null (p = 0.76) with 80% statistical power on a 0.34 pp minimum detectable effect.

DuckDBstatsmodelsscipypandasWindow functions

Customers

96k

p-value

0.76

MDE

0.34 pp

Power

80%

Finance

Machine Learning

Loan Approval Risk Prediction

End-to-end classification pipeline on 20,000 loan applications (36 features, 76.1% rejection baseline). Applied SMOTE on training folds only to avoid data leakage. Compared 6 models — Logistic Regression, Decision Tree, Random Forest, SVM, KNN, and ANN — with GridSearchCV. SHAP identified CreditScore, AnnualIncome, and DebtToIncomeRatio as top predictors, packaged for compliance-ready reporting.

SMOTESHAPScikit-learnTensorFlowGridSearchCV

Records

20k

Models

Features

Class imbalance

76%

Distributed Systems

Big Data

Cryptocurrency Market Analysis on Hadoop

Three distributed MapReduce jobs in Java analyzing 2 GB of historical OHLCV tick data across 100+ cryptocurrency pairs (Binance, Apr–Aug 2024) on a multi-node AWS EC2 Hadoop cluster. Jobs surface volatility rankings, worst-performing assets by open-to-close change, and cumulative volume leaders with peak timestamps.

HadoopMapReduceJavaHDFSAWS EC2

Data

2 GB

Pairs

100+

MR jobs

Cluster

Multi-node

DevOps · CI/CD

Backend

User Management System

FastAPI + PostgreSQL backend with JWT OAuth2 authentication, role-based access control (Admin, Manager, User), and profile-completion tracking. Diagnosed and resolved 5 critical production bugs across CI failures, unique-constraint violations, routing 404s, nested transaction errors, and mocking issues. Added 10 edge-case tests (138 passing) and shipped full CI/CD with GitHub Actions and Docker.

FastAPIPostgreSQLDockerGitHub ActionsSQLAlchemy

Tests

138

Auth

JWT OAuth2

Roles

Bugs fixed

Toolkit

My Stack

01Languages

Python·SQL·TypeScript·Java·R

02Machine Learning

PyTorch·TensorFlow·Scikit-learn·Hugging Face·vLLM·spaCy·SHAP·SMOTE

03Data & Pipelines

PySpark·dbt·Airflow·DuckDB·Hadoop·MapReduce·Pandas·Tree-sitter

04Cloud · Backend

FastAPI·AWS·PostgreSQL·pgvector·Docker·GitHub Actions·Render·React / Vite

05Analytics · Visualization

Streamlit·Tableau·Power BI·Matplotlib·Seaborn·statsmodels·scipy

06Tooling

Git·Linux·Jupyter·pytest·REST APIs

Experience & Education

The Journey

01Experience

Jan — May 2026

Newark, NJ

Graduate Data Science Capstone

NJIT — Learning & Development Initiative (LDI)

Built a production classification system (FastAPI, React/Vite, spaCy, SQLite) automating NJIT's institutional taxonomy across 3 dimensions, validated with rule-level traceability and 8-element plain-English decision explanations for every output.
Designed a 4-layer text extraction pipeline — 130+ lexicon phrase patterns, 44 regex rules, spaCy verb extraction, and an LLM stub — handling OBv3 JSON, guided forms, and free-text inputs.
Implemented a deterministic 3-stage rule engine (19 rules across category, type, and cognitive level) with immutable governance audit logs and human-in-the-loop override workflows.
Delivered a 351-test automated validation suite (100% passing) across unit, integration, and 7 end-to-end workflow scenarios.

View repository

Sep 2025 — May 2026

Newark, NJ

Office Assistant

New Jersey Institute of Technology

Developed a web-based study-room booking system to automate space reservation workflows for departmental staff and students, replacing manual scheduling.

02Education

Class of 2026

Master of Science in Data Science

New Jersey Institute of Technology · Ying Wu College of Computing

Graduated May 2026 · Newark, NJ · GPA 3.7

Class of 2023

Bachelor of Technology, Electrical & Electronics Engineering

Lendi Institute of Engineering & Technology

Graduated April 2023 · Andhra Pradesh, India

Certifications

Validated by the best

Featured

Supervised Machine Learning: Regression and Classification

DeepLearning.AI · Stanford · Andrew Ng

Covers linear and logistic regression, gradient descent, regularization, and neural network fundamentals — the core curriculum from the original Coursera Machine Learning course.

Feb 2025 · CourseraVerify