top of page

Research Interests

A pressing question is sarcasm in the natural text and social slang. Social texting and opinions have evolved and comments like "wow really impressed!" can be interpreted in two ways:  an appreciation as well as a sarcastic comment. The question is to detect the tone of the comment. Such is the case with slang like "FOMO", "LOL", "LMAO", etc.

Hence, my interest lies in Extracting useful Information based on Context, Analyzing the Sentiments of the Text based on Tone as well as the Content from social media free texts

Further, The overall feel of the text changes with the language change. Hence, I want to work with different languages in the text to develop techniques for low-resourced languages with varied vocab using SOTA transfer learning methods.

As a Data Scientist and ML Engineer, I have elaborated experience in mining insights from unstructured and structured data from various sources and applying it to businesses to deploy end-to-end ML-based pipelines to automate processes. 

​

Research Projects

1. Anti-Money laundering System Using NLP and ML

University of Toronto, IMI Big Data and AI x Scotiabank Case Competition

​

  • Worked on deploying AI and ML-based high-risk transactions to fortify Scotiabank's AML defenses as an initiative towards sex trafficking as a part of the AI-based case competition. I was selected as a finalist amongst the pool of  15000+  participants.

  • Identified High-risk customers using Transformer-based Quering by deploying DistilBERT and Regrex-based quering. 

  • Once the target was identified, I built a customer segmentation model using tree-based clustering and U-net architectures which gave us an accuracy of 97% against our transaction flagging task.

  • Built a graph-based network model to improve the customer segmentation model utilizing transaction routes and connections from source to target with complete information to track the transactions at the customer level using decision trees and libraries like graphviz, etc.  

Screen Shot 2023-03-28 at 6.26.56 PM.png
Screen Shot 2023-03-28 at 6.28.52 PM.png

Training data

Screen Shot 2023-03-28 at 6.28.30 PM.png

Validation data

Test data

Screen Shot 2023-03-28 at 6.28.30 PM.png
Screen Shot 2023-03-28 at 6.28.30 PM.png
  • To further optimize the solution and improve the computational efficiency, I performed data sentivity to explore the optimal proportion of data to achieve comparable results. I found out that we only need 10-20% of the data to performthe tasks and decreased the computational load by 80%.

Screen Shot 2023-03-28 at 6.29.37 PM.png
  • Worked on one of the first and important stages in a natural language processing (NLP) pipeline, is to identify mentions of entities (e.g. persons, locations and organisations) within the unstructured text.  

  • Retrieved Sentences and Corresponding Tags and Defined Mappings between Sentences and Tags. 

  • Padded Input Sentences and trained the data on the Bidirectional-LSTM model. 

  • researched and implemented one of the best approaches for NER. CRFs are used for predicting the sequences that use the contextual information to add information which will be used by the model to make a correct prediction.

1. Named Entity Recognition using RNN

Mentored by Prof. Minakshi Tomer, MSIT Delhi

​

BIDIRECTIONAL-LSTM 

Screenshot (77).png

BI LSTM-CRF

Capture1.PNG
table.PNG

2. Generative Adversarial Networks

​

DC​GAN

  • As an introductory project on GANs, I worked on DCGANs. Using the Fashion-MNIST dataset with 60000+ images of fashion apparels. 

  • Trained the model on noisy images and generated images. Was able to sample from a complex, high-dimensional training distribution of the data.

T-SNE on the final images produced by the GAN.

embedding.gif
dcgan.gif

Animate-GIF of the Images being rendering by the GAN.

​

CYCLEGAN

  • Applied the Cycle-consistent GANs on image translation on various datasets to analyze the the transitions on images without paired examples. The model wass able to use a collection of photographs from each domain and extract and harness the underlying style of images in the collection in order to perform the translation.

  • The same model architecture and configuration described in the paper was used across a range of image-to-image translation tasks. The architecture comprised of four models, two discriminator models, and two generator models.

  • A pattern of Convolutional-BatchNorm-LeakyReLU layers was used in the model. The discriminator used InstanceNormalization instead of BatchNormalization. The generator is an encoder-decoder model architecture. 

gan.PNG

Pix2Pix

  • Dataset comprised of satellite images of New York and their corresponding Google maps pages. The image translation problem involveed converting satellite photos to Google maps format, or the reverse.The implementation used the Keras deep learning framework based directly on the model described in the paper.

  • The generator model was trained via the discriminator model.The generator was trained via adversarial loss, which encourages the generator to generate plausible images in the target domain. The generator was also updated via L1 loss measured between the generated image and the expected output image.

Screenshot (88).png

SOURCED

Screenshot (87).png

GENERATED

Screenshot (87).png

EXPECTED

Screenshot (87).png
  • The discriminator design was based on the effective receptive field of the model, which defines the relationship between one output of the model to the number of pixels in the input image. This is called a PatchGAN model.

  • The generator was an encoder-decoder model using a U-Net architecture. The model takes a source image (e.g. satellite photo) and generates a target image (e.g. Google maps image). It did by encoding the input image down to a bottleneck layer, then decoding the bottleneck representation to the size of the output image. The U-Net architecture means that skip-connections are added between the encoding layers and the corresponding decoding layers, forming a U-shape.

Architecture-of-the-U-Net-Generator-Mode

3. Tone Analyzer - Sentiment Analysis

​

  • The Application displays the sentiments of the general public on the current situation of lockdown in India due to the 2020 pandemic. The data was collected wwas collected from twitter by using a crawler to scrape the tweets on the novel coronavirus during the period from April 2020 to June 2020. The model was trained on a corpus of 2M tweets. 

  • Wrote code in Python and used techniques like stemming, stopwords removal, n-grams, etc. to preprocess the data. and was passed to a model based on VADER to calculate the sentiment of the text.

  • The techniques like VADER, TextBlob and RNN were used to calculate fine grained sentiment scores depending upon the negation of sentenses.

  • Extending the project to develop an application for Sentiment Analysis of Live tweets of various languages filtered by keywords input from the user. 

1_5GW8LkEIZzfDespxVc3OEQ.png
Conducted survey on public opinion on Citizenship Amendment Act of 2020 as a part of research to understand Sentence level Negation and context based Sarcasm. 

Skills

Python   Machine Learning    Deep Learning    Data Science 
Java 
  Business Analytics     R Programming  

newplot (3).png
bottom of page