Research Interests

A pressing question is sarcasm in the natural text and social slang. Social texting and opinions have evolved and comments like "wow really impressed!" can be interpreted in two ways: an appreciation as well as a sarcastic comment. The question is to detect the tone of the comment. Such is the case with slang like "FOMO", "LOL", "LMAO", etc.

Hence, my interest lies in Extracting useful Information based on Context, Analyzing the Sentiments of the Text based on Tone as well as the Content from social media free texts.

Further, The overall feel of the text changes with the language change. Hence, I want to work with different languages in the text to develop techniques for low-resourced languages with varied vocab using SOTA transfer learning methods.

As a Data Scientist and ML Engineer, I have elaborated experience in mining insights from unstructured and structured data from various sources and applying it to businesses to deploy end-to-end ML-based pipelines to automate processes.

Download Full CV

GitHub Profile

Publications

Benchmarking Transformer networks
A. Singhal, P. Basu, and T. Rawat, “Benchmarking RoBERTa with SOTA Transformer Networks for Sexual Harassment Detection on Twitter”, IEEE- ISMAC (2021)

Template-based Natural Language Generation A. Singhal, Chng Eng A. A. Singhal, C. Siong, “Template-based domain-specific controlled Natural Language Generation for low-data settings”, NTU FYI Conference ( 2021)

Research Projects

1. Anti-Money laundering System Using NLP and ML

University of Toronto, IMI Big Data and AI x Scotiabank Case Competition

Worked on deploying AI and ML-based high-risk transactions to fortify Scotiabank's AML defenses as an initiative towards sex trafficking as a part of the AI-based case competition. I was selected as a finalist amongst the pool of 15000+ participants.
Identified High-risk customers using Transformer-based Quering by deploying DistilBERT and Regrex-based quering.
Once the target was identified, I built a customer segmentation model using tree-based clustering and U-net architectures which gave us an accuracy of 97% against our transaction flagging task.
Built a graph-based network model to improve the customer segmentation model utilizing transaction routes and connections from source to target with complete information to track the transactions at the customer level using decision trees and libraries like graphviz, etc.

Screen Shot 2023-03-28 at 6.26.56 PM.png

Screen Shot 2023-03-28 at 6.28.52 PM.png

Training data

Screen Shot 2023-03-28 at 6.28.30 PM.png

Validation data

Test data

To further optimize the solution and improve the computational efficiency, I performed data sentivity to explore the optimal proportion of data to achieve comparable results. I found out that we only need 10-20% of the data to performthe tasks and decreased the computational load by 80%.

Screen Shot 2023-03-28 at 6.29.37 PM.png

Worked on one of the first and important stages in a natural language processing (NLP) pipeline, is to identify mentions of entities (e.g. persons, locations and organisations) within the unstructured text.
Retrieved Sentences and Corresponding Tags and Defined Mappings between Sentences and Tags.
Padded Input Sentences and trained the data on the Bidirectional-LSTM model.
researched and implemented one of the best approaches for NER. CRFs are used for predicting the sequences that use the contextual information to add information which will be used by the model to make a correct prediction.

1. Named Entity Recognition using RNN

Mentored by Prof. Minakshi Tomer, MSIT Delhi

BIDIRECTIONAL-LSTM

BI LSTM-CRF

2. Generative Adversarial Networks

DCGAN

As an introductory project on GANs, I worked on DCGANs. Using the Fashion-MNIST dataset with 60000+ images of fashion apparels.
Trained the model on noisy images and generated images. Was able to sample from a complex, high-dimensional training distribution of the data.

T-SNE on the final images produced by the GAN.

Animate-GIF of the Images being rendering by the GAN.

CYCLEGAN

Applied the Cycle-consistent GANs on image translation on various datasets to analyze the the transitions on images without paired examples. The model wass able to use a collection of photographs from each domain and extract and harness the underlying style of images in the collection in order to perform the translation.
The same model architecture and configuration described in the paper was used across a range of image-to-image translation tasks. The architecture comprised of four models, two discriminator models, and two generator models.
A pattern of Convolutional-BatchNorm-LeakyReLU layers was used in the model. The discriminator used InstanceNormalization instead of BatchNormalization. The generator is an encoder-decoder model architecture.

Pix2Pix

Dataset comprised of satellite images of New York and their corresponding Google maps pages. The image translation problem involveed converting satellite photos to Google maps format, or the reverse.The implementation used the Keras deep learning framework based directly on the model described in the paper.
The generator model was trained via the discriminator model.The generator was trained via adversarial loss, which encourages the generator to generate plausible images in the target domain. The generator was also updated via L1 loss measured between the generated image and the expected output image.

SOURCED

GENERATED

EXPECTED

The discriminator design was based on the effective receptive field of the model, which defines the relationship between one output of the model to the number of pixels in the input image. This is called a PatchGAN model.
The generator was an encoder-decoder model using a U-Net architecture. The model takes a source image (e.g. satellite photo) and generates a target image (e.g. Google maps image). It did by encoding the input image down to a bottleneck layer, then decoding the bottleneck representation to the size of the output image. The U-Net architecture means that skip-connections are added between the encoding layers and the corresponding decoding layers, forming a U-shape.

Architecture-of-the-U-Net-Generator-Mode

3. Tone Analyzer - Sentiment Analysis

The Application displays the sentiments of the general public on the current situation of lockdown in India due to the 2020 pandemic. The data was collected wwas collected from twitter by using a crawler to scrape the tweets on the novel coronavirus during the period from April 2020 to June 2020. The model was trained on a corpus of 2M tweets.
Wrote code in Python and used techniques like stemming, stopwords removal, n-grams, etc. to preprocess the data. and was passed to a model based on VADER to calculate the sentiment of the text.
The techniques like VADER, TextBlob and RNN were used to calculate fine grained sentiment scores depending upon the negation of sentenses.
Extending the project to develop an application for Sentiment Analysis of Live tweets of various languages filtered by keywords input from the user.

Conducted survey on public opinion on Citizenship Amendment Act of 2020 as a part of research to understand Sentence level Negation and context based Sarcasm.

Skills

Python Machine Learning Deep Learning Data Science
Java Business Analytics R Programming