AI | ML Portfolio
I’m a Machine Learning R&D Engineer currently working at a startup on bringing intelligence to everyday objects. These are some of my independent projects, including a project I completed as part of CIS 519: Applied Machine Learning at UPenn with Dr. Eric Eaton, where we researched Transfer Learning, NLP, and word embedding methods.
Diabetes Risk Predictor
Tools: FLASK, Docker, MLFLow, Jupyter Notebook, VSCode, XGBoost, GitHub, AWS ECR, Render
A Containerized, cloud-deployed FLASK-based web application that predicts the risk of diabetes based on the answers to a questionnaire. The app currently uses an XGBoost (Extreme Gradient Boosted Model) model trained on a Kaggle Dataset of CDC Data.
Document ChatBot: Retrieval Augmented Generation (RAG)
Tools: OpenAI API, GPT-3.5, PineCone, Streamlit, Python, VSCode, Github
A Streamlit App that uses Langchain, OpenAI Embeddings, GPT 3.5-Turbo, and Pinecone Vector Databases to process a user-provided document. The document is chunked, and then converted to word embeddings using OpenAI Embeddings. The embeddings are inserted into a Pinecone Index which is deleted after runtime. Langchain is used to retrieve information through the QA
Research Project @UPenn
Transfer Learning: Opinion Mining in Product Reviews
Tools: Python, Jupyter Notebook, GPU-accelerated Machine Learning
Team Project, on which I worked alongside Tien Pham and Grace Boatman.
We compared transfer learning techniques built on word embeddings to evaluate classification performance for opinion mining. We use transfer learning techniques at two tiers: First, we use word embedding methods such as GloVe, BERT, and ULMFiT that have been extensively trained on huge repositories of data.
Secondly, we trained models built on these embeddings at different instance-size combinations of two datasets: a ”source” dataset of Amazon Tech Product reviews and a ”target”. dataset of TripAdvisor reviews. We subsequently evaluated predictive performance on the target TripAdvisor dataset, and compare the ability of the model to generalize across non i.i.d. datasets.