Hi, I'm Vaibhav. I'm a hands-on technology professional with experience in software engineering (front-end and back-end), data science and finance & accounting. I'm an experienced practitioner with proven record in design, development and delivery of complex software systems and machine learning models.
In my current role, I work as a Data Scientist. I have created and maintained supervised and unsupervised machine learnings models, including working on some interesting natural language processing use cases.
I am also experienced in making data visualization dashboards using Tableau, D3.js, Javascript and jQuery and can create full-stack enterprise level websites with deep knowledge of data engineering pipelines, integrating back-end as well as front-end modules.
I've also experienced deep involvement with the annual finance planning & budgeting cycles, month-end accounting close and FP&A processes, and have expertise of automating those tasks with Anaplan, a cloud based EPM software.
Data science capstone
Twitch is a live streaming service and global community for content spanning gaming, entertainment, music, sports, and more. As of August 2022, it had 9.2 million monthly streamers, 140 million active monthly users. Twitch's top streamers earned over $30,000 a month from their streaming and content channels on the site.
Our capstone project titled "MakeSense" at UC Berkeley's MIDS coursework in fall 2022 attempted to provide additional insights to Twitch’s analytics by (1) interpreting chat behavior and running sentiment analysis on livestream chats and scoring those chats (2) translating performance metrics of Twitch streams into a scoring system and (3) allowing flexibility to customize content creation to maximize growth in a single, or multiple areas.
For an interactive and agile system as ours, we emphasized on a scalable data engineering pipeline to consume and process Twitch chats, a BERT based masked language modeling Sentiment Classification layer and a jQuery/Ajax/D3.js based front-end.
We used two different sets of google cloud platform (GCP) instances: (a) GPU based instance to train to BERT NLP models and (b) a non-GPU instance to host real time Twitch listener scripts along with the back-end webserver code. On the GPU layer, We used a double-fine tuning Transfer Learning approach to train a Base-BERT model with masked language modeling Sentiment Classification tasks using PyTorch and the Transformers library by Hugging Face.
Finally, we used a combination of D3 library and bootstrap container layouts for our interactive plotting and visualization needs.
Unsupervised natural language processing
Google's Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.
In this project, I used the pre-trained sentence transformer encoder based model from TensorFlow-Hub to create embedding vectors of a large corpus of text documents and then use those vectors to match with a set of queries thereby performing classification by finding semantically similar sentences.
The idea here was to download news media articles from Dow Jones, and then classify those documents to a set a service lines offered by the company. Each service line was annonated by a set of keywords. The keywords were also encoded into vectors using the sentence encoder model and matched to the media article vectors using cosine similarity. In this manner, we were able to classify most recent news articles to their specific "topics" and make informed decisions.
Deep learning and computer vision
In our final project for the deep learning on the edge (w251) course at Berkeley, we developed a model to identify the presence and species of backyard birds. We constructed a compact model for rational edge deployment using Yolov4 Darknet model and fine tuned it on a large dataset of most common 700 species of birds found in North America (NABirds database by allaboutbirds.com)
We narrowed the Yolo model training to 15 most common species in New England, Mid-Atlantic and North West for our training dataset. The dataset preloaded with Bounding Boxes - which proved to be an excellent time saver for training neural nets such as Yolo!
We used CudNN with half precision on a V100 GPU instance on the IBM cloud for training our Yolov4 model. We took approximately 118.6 hours for the entire model training. We used Keras with TensorFlow for inference of model results. For this, we converted the YoloV4 weights to .h5 for use in Keras and for simplicity of deploying the final model on a Cubietruck 3 edge device.
We used OpenCV in combination with Keras (Tensorflow in the backend) to read image streams, and then used the converted Yolov4 model weights from training to make predictions
Our final output was a text file for all birds that were detected by the model along with their species and a model accuracy prediction number.
Natural language processing
Generalizing improvements in Aspect-Based Sentiment Analysis using SpanBERT, RoBERTa and BaseBERT models.
In this project, we delved deeper into the under-explored, yet key subfield of sentiment analysis - Aspect Based Sentiment Analysis (ABSA). We examined a recent paper by Karimi et al (2021) and hypothesized that changes to their architecture could still yield equivalent (or even enhanced) results. In addition we also hypothesized that switching the underlying pre-trained BERT model to a more enhanced and task specific BERT (such as SpanBERT), could produce better results.
Data visualization using D3.js
In this project we used D3-Force library to visualize and simulate Star Wars interactions. The D3 library creates “physical simulation of charged particles to generate a network layout”
For the visualization, we used a dataset with node and link annotations for each episode. This included: Social network with links between characters defined by characters speaking in the same scene. Social network with links between characters defined by the number of times they are mentioned in the same scene Social network with character links and mentions for all 6 episodes
Overall, we aimed to help understand the “social network” of the Star Wars franchise and answer questions related to character interactions and importance.
Created unsupervised information extraction NLP model using Python and Universal Sentence Encoder on TensorFlow using Snowflake, Python, Flask, jQuery and D3.js. Created and maintained random forest and logistic regression based supervised learning models on R Studio and Python. Supervised, coordinated and lead discussions on various ad hoc data science discussions and prospective engagements using project management skills, Agile (with scrum) methodologies and tools such as Jira.
Prepared finance materials for CFO, monthly operating committee meetings, quarterly board and risk meetings and other ad-hoc projects, with timely analysis of various business lines for corporate finance decision making. Automated legacy Excel models using VBA Macro and SQL for planning, budgeting, and forecasting of short-term and long-term net income.
Developed client-side portfolio management portal (http://360.gs.com) and various trading tools for Goldman Sachs Asset Management (GSAM) division using advanced Java, Unix, Perl, shell scripts, Sybase SQL and various open-source technologies. Designed and developed a real-time trading solution, as well as an advanced risk model, to assist the fixed-income trading desk with mortgage-backed security trading, portfolio tracking and performance monitoring tasks using IT tools. \
Lead 4-member offshore team on various order, revenue & customer management enterprise level technology initiatives using advanced Java, J2EE, Unix, MySQL, open-source tools (CVS, Apache Struts, JUnit, Hibernate ORM). Developed, researched, and documented technical project artifacts and software design documents. Lead a “first in company” novel proof of concept of an end-to-end enterprise web-based system using Hibernate object relational mapping framework for an Africa based telecom client. Customer was highly appreciative of the effort and was engaged in long-term contract.