-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.json
1 lines (1 loc) · 63.3 KB
/
index.json
1
[{"authors":["admin"],"categories":null,"content":" #txt{ padding:15px; padding-bottom:5px; border: 5px double blue; border-radius: 15px; background-color: rgba(0,0,0,0.1); } #txt1{ padding-top:15px; } Hi there!! I am currently pursuing my Masters at the School of Computer Science, McGill University. I am also working as a Graduate Research Assistant in collaboration with Center for Intelligent Machines and AIPHL, Department of Diagnostic Radiology.\nI completed my Bachelors from Birla Institute of Technology, India in Electronics and Communication Engineering. Previously, I have worked as a developer in the AI team of a FinTech startup in India Signzy. My interests primarily lie in the domain of Deep Learning, specifically its applications to Computer Vision and Natural Language Processing tasks.\n Currently, I am working on Deep Learning for Digital Histopathology as part of my Master's thesis. It largely involves using sparse approximation based unsupervised instance segmentation in Whole Slide Images and cross modal feature representation learning.\nIn the past I have worked on a range of projects as part of coursework, internships and job encompassing diverse tasks in areas of Machine Learning like Face Detection, Optical Character Recognition, Information Retrieval from text, Activity Recognition, Image forensics, Speech Recognition among others.\nI intend to pursue a career in AI research, academic or industrial, in order to push the state of the art in this field and solve the contemporary set of real world problems. ","date":-62135596800,"expirydate":-62135596800,"kind":"term","lang":"en","lastmod":1597802042,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"https://mnishant2.github.io/author/nishant-mishra/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/author/nishant-mishra/","section":"authors","summary":"#txt{ padding:15px; padding-bottom:5px; border: 5px double blue; border-radius: 15px; background-color: rgba(0,0,0,0.1); } #txt1{ padding-top:15px; } Hi there!! I am currently pursuing my Masters at the School of Computer Science, McGill University.","tags":null,"title":"Nishant Mishra","type":"authors"},{"authors":["admin2"],"categories":null,"content":"","date":-62135596800,"expirydate":-62135596800,"kind":"term","lang":"en","lastmod":1597802042,"objectID":"ca4549b7186eec7214c27e470e158988","permalink":"https://mnishant2.github.io/author/nishant-mishra/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/author/nishant-mishra/","section":"authors","summary":"","tags":null,"title":"Nishant Mishra","type":"authors"},{"authors":["Nishant Mishra","Ankur Agarwal"],"categories":["multimodal learning","generative learning","deep learning"],"content":"Multimodal learning with latent space models has the potential to help learn deeper, more useful representations that help getting better performance, even in missing modality scenarios. In this project we leverage latent space based model to perform inference and reconstruction in all missing modality combinations. We trained a Multimodal Variational Auto Encoder which uses a product of Experts based inference network on three different modalities consisting of MNIST handwritten digit images in two languages and spoken digit recordings for our experiments. We trained the model in a subsampled training paradigm using an ELBO loss that comprised the modality reconstruction losses, label cross-entropy loss as well as the Kullback-Leibler divergence for the latent distribution. We evaluated the total ELBO loss , individual reconstruction losses, classification accuracy and visual reconstruction outputs as part of our analysis. We observed encouraging results both in terms of successful convergence as well as accurate reconstructions.\nWe approached the missing modality reconstruction and classification based problem using a Multimodal Variational Autoencoder(MVAE). Our model used a tree like graph where the different modalities define the observation nodes. It consists of parallel fully connected encoder and decoder networks associated with each modality as part of a VAE and a product of experts technique for late fusion of the respective latent distribution parameters from each encoder to get a final representation. An additional linear decoder branch was used for label classification.Each modality has its own inference network. This model was trained by optimizing an estimated lower bound (ELBO) on the marginal likelihood of observed data, i.e reconstructions of the modalities as well as the classification loss.\nWe also used a sampling based training scheme such that for each training example containing modalities, we obtained the loss for all combinations of modalities given to the model, this ensured the learned model generalized to perform well in reconstructing given any combination set of the modalities. We used three modalities for experimentation and trained the model on a MNIST dataset with images in two languages, Farsi and Kannada as first two modalities and speech utterances of the MNIST digits as the third modality.\nThe model performed well in terms of the convergence of ELBO loss, individual reconstruction losses, classification accuracy as well as the final visual reconstructions of the modalities. We also performed various analyses in terms of hyperparameter tuning, reconstruction under different modality combinations as well as analysis of disentanglement of representation property.\n","date":1608055645,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1608055645,"objectID":"80b0e8595f3dce02bc9bc9f0d799e397","permalink":"https://mnishant2.github.io/project/multimodalvae/","publishdate":"2020-12-15T13:07:25-05:00","relpermalink":"/project/multimodalvae/","section":"project","summary":"Training a latent variable based variational inference model on multimodal data in order to perform inference with all possible combinations of missing modalities. ","tags":["vae","featured","vision","multimodal learning","generative learning","pgm","variational inference"],"title":"Generative Multimodal Learning for Reconstructing Missing Modality","type":"project"},{"authors":null,"categories":null,"content":"My website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .\n ","date":1599951600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1599951600,"objectID":"53e892b8b41cc4caece1cfd5ef21d6e7","permalink":"https://mnishant2.github.io/license/","publishdate":"2020-09-13T00:00:00+01:00","relpermalink":"/license/","section":"","summary":"My website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .\n ","tags":null,"title":"LICENSE: CC-BY-SA","type":"page"},{"authors":["Nishant Mishra"],"categories":["Computer Vision","layout analysis"],"content":"This project involved an automatic highlighter tool for automatic highlighting and extraction of specific form fields from documents for further processing such as Optical Character Recognition, information retrieval from handwritten documents or even to facilitate semi manual digital population of records from forms using a user interface.\nThe tool utilizes document layout detection, classical Computer vision techniques like template matching and mathematical heuristics to create a generalizable automatic highlighting tool using only one sample of the concerned document.\nThe associated repository here is designed for handling a particular bank form and is a command line highlighting tool that can be appropriated/extended for other documents and interfaces.\n","date":1598015887,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1598015887,"objectID":"8a54824059eb2a01e2d09c8a3c54be6f","permalink":"https://mnishant2.github.io/project/highlighter/","publishdate":"2020-08-21T09:18:07-04:00","relpermalink":"/project/highlighter/","section":"project","summary":"A tool to highlight/extract specific form fields from documents using classical Computer Vision and heuristics","tags":["vision","ocr","text extration"],"title":"Highlighter(Auto field detection)","type":"project"},{"authors":null,"categories":null,"content":"","date":1597622400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1597802042,"objectID":"fd36605688ef45e10dc233c860158012","permalink":"https://mnishant2.github.io/cv/","publishdate":"2020-08-17T00:00:00Z","relpermalink":"/cv/","section":"","summary":"List of Projects,Talks and Publications","tags":null,"title":"CV","type":"widget_page"},{"authors":null,"categories":null,"content":"","date":1597622400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1597802042,"objectID":"475f2249a5c02879faf32697c0e89e6e","permalink":"https://mnishant2.github.io/portfolio/","publishdate":"2020-08-17T00:00:00Z","relpermalink":"/portfolio/","section":"","summary":"List of Projects,Talks and Publications","tags":null,"title":"Portfolio","type":"widget_page"},{"authors":["Nishant Mishra"],"categories":null,"content":"","date":1594825200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1594825200,"objectID":"2e58d69ae606bdfe9dfbc1115cd1e132","permalink":"https://mnishant2.github.io/talk/lca/","publishdate":"2020-09-07T21:04:49-04:00","relpermalink":"/talk/lca/","section":"talk","summary":"A detailed discussion on Locally Competitive algorithms used for Sparse Approximation","tags":["sparse coding","LCA","machine learning"],"title":"Locally Competitive Algorithms","type":"talk"},{"authors":["Nishant Mishra"],"categories":null,"content":"","date":1591455600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1591455600,"objectID":"b45d85327da280a44bd7279a4c0c068a","permalink":"https://mnishant2.github.io/talk/histopathology/","publishdate":"2020-09-07T21:04:43-04:00","relpermalink":"/talk/histopathology/","section":"talk","summary":"A detailed literature survey of applications and relevance of Deep Learning in Histopathology","tags":["deep learning","histopathology","vision"],"title":"Histopathology","type":"talk"},{"authors":["Shubham Chopra","Nishant Mishra"],"categories":["Reinforcement Learning","course project"],"content":"This project was done as part of my final project submission for COMP767: Reinforcement Learning course at McGill University\nIn the recent years, significant work has been done in the field of Deep Reinforcement Learning, to solve challenging problems in many diverse domains. One such example, are Policy gradient algorithms, which are ubiquitous in state-of-the-art continuous control tasks. Policy gradient methods can be generally divided into two groups: off-policy gradient methods, such as Deep Deterministic Policy Gradients (DDPG) , Twin Delayed Deep Deterministic (TD3) , Soft Actor Critic (SAC) and on-policy methods, such as Trust Region Policy Optimization (TRPO) .\nHowever, despite these successes on paper, reproducing deep RL results is rarely straightforward. There are many sources of possible instability and variance including extrinsic factors (such as hyper-parameters, noise-functions used) or intrinsic factors (such as random seeds, environment properties).\nIn this project, we perform two different analysis on these policy gradient methods: (i) Reproduction and Comparison: We implement a variant of DDPG, based on the original paper. We then attempt to reproduce the results of DDPG (our implementation) and TD3 and compare them with the well-established methods of REINFORCE and A2C. (ii) Hyper-Parameter Tuning: We also, study the effect of various Hyper-Parameters(namely Network Size, Batch Sizes) on the performance of these methods.\n","date":1588291200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1588291200,"objectID":"9a067d49836df8e7d7f1a8a570b04d0e","permalink":"https://mnishant2.github.io/project/policy_gradient/","publishdate":"2020-05-01T00:00:00Z","relpermalink":"/project/policy_gradient/","section":"project","summary":"Reproducibility and Analysis of Deep Policy Gradient methods for Reinforcement Learning Tasks","tags":["RL","Policy Gradients","featured"],"title":"Policy Gradient","type":"project"},{"authors":["Paniz Bertsch","Albert Orozco Camacho","Nishant Mishra"],"categories":["Graph Representation Learning"],"content":"This project was undertaken as part of the final project for COMP 766: Graph Representation Learning course at McGill University.\nFor many computer science sub-fields, knowledge graphs (KG) remain a constant abstraction whose usefulness relies in their representation power. However, dynamic environments, such as the temporal streams of social media information, brings a greater necessity of incorporating additional structures to KG’s.\nIn this project, we applied currently available solutions to address incremental knowledge graph embedding to several applications to test their efficiency. We also proposed an embedding model agnostic framework to make these models incremental. Firstly, we proposed a window-based incremental learning approach that discards least happening facts and performs link prediction on updated triples. Next, we presented experiments on a GCN model-agnostic meta-learning based approach.\nTo create edge embedding vectors, we experimented with two methods:\n Concatenating head and tail’s 128-dimensional Node2Vec embedding vectors to create 256-dimensional edge embedding Subtracting head embedding from tail embedding vector to create 128-dimensional edge embedding vector Our best model is the Window-based KG Incremental Learning, where edge representations, are calculated from subtraction of embedding vectors of head and tail nodes For the experiment, link prediction adjusted to a binary classification, with 0 and 1 representing link is present or absent respectively, was used, with Random-Forest model for training and prediction. Also, dataset is divided to training set and nine test sets as incremental updates, to generate 9 snapshots of graph with each snapshot, adding new nodes and updating edges compare to previous graph snapshot.\nThe second method we experimented with followed a model-agnostic meta-learning based approach with Graph Convolutional Networks(GCN). The idea here is to learn a GCN to predict the embeddings of new nodes given the old embeddings of its neighboring entities in the old graph and similarly obtain an updated representation of old entities based on the recently learned embedding of new entities. These two predictions are jointly iterated. This can be viewed as learning to learn problem (meta-learning). tsne visualization of top 40 entity embeddings cluster\n ","date":1586870617,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1586870617,"objectID":"c85051f0976a751798eef9680b451936","permalink":"https://mnishant2.github.io/project/online_learning/","publishdate":"2020-04-14T09:23:37-04:00","relpermalink":"/project/online_learning/","section":"project","summary":"In this project, we apply currently available solutions to address incremental knowledge graph embedding to several applications to test their efficiency.","tags":["graph","knowledge bases","gcn","fb20k","featured","cv"],"title":"Online Learning of temporal Knowledge Graphs","type":"project"},{"authors":["Nishant Mishra"],"categories":null,"content":"","date":1579700700,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1579700700,"objectID":"1244fe360cbb8336759d7d52ddafc1cc","permalink":"https://mnishant2.github.io/talk/transe/","publishdate":"2020-09-07T20:33:42-04:00","relpermalink":"/talk/transe/","section":"talk","summary":"In Class presentation for COMP 766:Graph Representation Learning, Winter 2020","tags":["graph representation learning","knowledge graph","embedding"],"title":"TransE","type":"talk"},{"authors":["Shubham Chopra","Nishant Mishra"],"categories":["vision","deep learning","generative modelling","game theory"],"content":"In this project, the final project for COMP551: Applied Machine Learning Course we study the 2014 published paper Generative Adversarial Networks . We have tried to reproduce a subset of the results obtained in the paper and performed ablation studies to understand the model\u0026rsquo;s robustness and evaluate the importance of the various model hyper-parameters. We also extended the model to include newer features in order to improve the model\u0026rsquo;s performance on the featured datasets, by making changes to the model\u0026rsquo;s internal structure, inspired by more recent works in the fi\u000celd.\nGenerative Adversarial Networks (GANs) were first described in this paper and are based on the zero-sum non-cooperative game between a Discriminator (D) and a Generator(G), analysed thoroughly in the field of Game Theory . The framework where both D and G networks are multilayer perceptrons, is referred to as Adversarial Networks.\nThe provided code was implemented using the now obsolete Theano framework and using python2, hence it was really difficult to recon\u000cfigure and get it setup on our system. Nevertheless we managed to hack the code and get it to execute for the task of reproducing the results on MNIST dataset but proceeded to use the much more interpretable and relevant pytorch implementation for ablation studies and extension of the model. The original paper trains the presented GAN network on the MNIST, CIFAR-10 and TFD images. However, the Toronto Faces Database (TFD) is not accessible without permission, and the provided code does not include scripts for it. Hence, we do not reproduce their results on the TFD database.\nGANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs. We decided to put this notion to test by tuning some of the hyperparameters involved in training the models. As part of the ablation studies, we experimented with different values for\n Learning Rates: We tuned the learning rates of both Generator and Discriminator models. Loss Functions: We decided to experiment with the L2 norm or Mean Squared error loss function. D_steps: Number of steps to apply for the Discriminator, i.e the number of times the Discriminator is trained before updating the Generators. We changed it from 1 to 2 as part of our experiment. As part of extensions of GAN we implemented two variants of GAN\n Deep Convolutional Generative Adversarial Networks or DCGAN are a variation of GAN where the vanilla GAN is upscaled using CNNs. Conditional Generative Adversarial Networks or cGAN which allows us to direct the generation process of the model by conditioning it on certain features, here, the class labels. ","date":1576433516,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1576433516,"objectID":"8a5c126575c0999f87cfbb8e190c9af6","permalink":"https://mnishant2.github.io/project/gan/","publishdate":"2019-12-15T14:11:56-04:00","relpermalink":"/project/gan/","section":"project","summary":"A reproducibility test, ablation studies and extension of the seminal Generative Adversarial Networks paper","tags":["deep learning","vision","game theory","DCGAN","cGAN","course"],"title":"Generative Adversarial Networks: Reproducibility Study","type":"project"},{"authors":["Priyesh Vijayan","Ashita Diwan","Nishant Mishra"],"categories":["graph representation learning","nlp","course project"],"content":"This project was directed towards the final course project requirement for COMP 550: Natural Language Processing course at McGill University.\nKnowledge graphs (KGs) succinctly represent real-world facts as multi-relational graphs. A plethora of work exists in embedding the information in KG to a continuous vector space in order to obtain new facts and facilitate multiple down-stream NLP tasks.\nDespite the popularity of the KG embedding problem, to the best of our knowledge, we find that no existing work handles dynamic/evolving knowledge graphs that incorporates facts about new entities.\nIn this project, we propose this problem as an incremental learning problem and propose solutions to obtain representations for new entities and also update the representations of old entities that share facts with these newer entities. The primary motive of this setup is to avoid relearning the knowledge graph embedding altogether with the occurrence of every new set of facts (triplets).\nWe build our solutions with TransE(Bordes et al.) as our base KG embedding model and evaluate the learned embeddings on facts associated with these new entities.\nTo this aim, we formulated two solutions; the first approach followed a finetuning based transfer-learning solution, and the second followed a model-agnostic meta-learning based approach with Graph Convolutional Networks (GCN). While our model-specific finetuning approach fared well, the proposed model independent approach failed to learn representations for a new entity.\nWe used OpenKE’s implementation for setting our model. For our task, we made changes to the TransE model, so that it can learn the representations of the new entities. We employed the FB20K dataset ( Xie et al., 2016 ) for our task. In addition to containing all the entities and relations from the FB15K dataset, this dataset also contains new entities which was required for our setup. We evaluate the models for link prediction, which aims to predict the missing h or t for a relation fact (h, r, t).\n","date":1576368000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1576368000,"objectID":"9292e43774e11c9f7cbc975f7c85919c","permalink":"https://mnishant2.github.io/project/transe/","publishdate":"2019-12-15T00:00:00Z","relpermalink":"/project/transe/","section":"project","summary":"In this project, we propose an incremental learning problem for Knowledge Graphs to obtain representations for new entities and also update the representations of old entities that share facts with these newer entities.","tags":["nlp","graph","GCN","TransE","featured"],"title":"Incremental Knowledge Graphs","type":"project"},{"authors":["Nishant Mishra"],"categories":["vision","classical computer vision","SIFT","Image Registration"],"content":"This was the final assignment of COMP558:Fundamentals of Computer Vision course, where we had to implement an image stitching(Panorama) algorithm from scratch. We were given a set of images taken by rotating the camera vertically and horizontally and the goal was to stitch them together to form a panorama exactly like how mobile devices do.\nWe used the SIFT algorithm implemented as part of this project with certain modifications(second order keypoint extraction) for feature extraction. Features along edges are eliminated using eigenvalues of the hessian matrix, and weak features along edges will have low eigenvalues along the edge and are therefore suppressed. The low contrast features are eliminated in this implementation using second order Taylor series based thresholding. Instead of 36 dimension feature histograms, now we had 128 dimensional feature vectors which are intuitively better descriptors.\nFor the extracted features, two different matching strategies viz matchFeatures(MATLAB function) and our own implementation of Bhattacharyya Distance that requires normalized histograms were compared. We decided to proceed with featureMatch for the relative simplicity, even though Bhattacharyya measure was more robust and rich.\nUsing the feature matches we implemented a least squares based Random Sample Consensus(RANSAC) algorithm to find a homography H between corresponding images that puts matched points in exact correspondence. This step is called Image Registration . The homography was found by solving the equation of the form Ax+B given below, using Singular Value Decomposition . Least Squares Estimation equation for finding Homography\n For solving this equation we just need 4 matches, so in our RANSAC algorithm we select 4 random points at each iteration to find homography and then using the Homography Matrix, we find a consensus set, i.e the matches in two images that agree to the homography calculated by using Euclidean Distance . We calculate the distance between transformed points for each match(using H) and corresponding actual matches and threshold them at 0.5 to filter inliers.\nFollowing the sequential image registration we use the matched features from consecutive images to learn geometric transformations between them in order to project them into a panoramic image. This process is called Image Stitching . In order to perform image stitching, an empty panorama is created, then the images are aligned and blended based on the learned homography after which they are warped on to the panorama canvas. Result of our Image stitching algorithm on Real images taken from my OnePlus phone\n ","date":1575396665,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1575396665,"objectID":"bce88feed07c403605a58da756991a47","permalink":"https://mnishant2.github.io/project/image_stitching/","publishdate":"2019-12-03T14:11:05-04:00","relpermalink":"/project/image_stitching/","section":"project","summary":"Implemented an image stitching algorithm for creating panoramas from successive images from a rotating camera from scratch.","tags":["vision","SIFT","RANSAC","course","descriptors","SVD","least squares","stitching","featured"],"title":"Image Stitching (Panorama)","type":"project"},{"authors":["Shubham Chopra","Nishant Mishra"],"categories":["vision","deep learning","course projects"],"content":"This was a competition hosted on Kaggle and was a miniproject for the COMP 551: Applied Machine Learning Course. We analyze different Machine Learning models to process a modified version of the MNIST dataset and develop a supervised classification model that can predict the number with the largest numeric value that is present in an Image.\nWe analyze Images from a modified version of the MNIST dataset (Yann Le Cunn, 2001) . MNIST is a dataset that contains handwritten numeric digits from 0-9 and the goal is to classify which digit is present in an image. The given dataset contains 50,000 modified MNIST images.The images are grayscale images of size 128*128. Each image contains three MNIST style randomly sampled numbers on custom grayscale backgrounds each at various positions and orientations in the image. The task was to train a model in order to identify the number with the highest numerical value in the image.\nWe experimented numerous models with different configurations for this task. The models chosen were primarily pretrained complex neural network models, such as ResNets, VGGNets and EfficientNets . After fine-tuning the best performing models’ hyper-parameters, to further boost the classification accuracy, we used various data augmentation techniques, including Affine Transformation Mappings, Scale-Space blurring, Contrast changes and Perspective transforms. By doing so, we were able to gain a higher accuracy on the test set, as compared to before data augmentation. The final fine-tuned model was able to achieve an accuracy of 99% on the validation data, and an accuracy of 99.166% on the test data in the public leaderboard of the competition. We finished 2nd and 4th out of 105 teams(Group 30) on the public and the private leaderboards of the competition respectively.\n","date":1573755089,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1573755089,"objectID":"e86e040442c1680ff02578cce5bc7379","permalink":"https://mnishant2.github.io/project/modified_mnist/","publishdate":"2019-11-14T14:11:29-04:00","relpermalink":"/project/modified_mnist/","section":"project","summary":"Identifying the highest number present in modified MNIST images containing multiple handwritten digits on random backgrounds using deep learning","tags":["vision","deep learning","kaggle","course","MNIST","CNN"],"title":"Modified MNIST [Kaggle]","type":"project"},{"authors":["Nishant Mishra"],"categories":["vision","classical computer vision","SIFT"],"content":"In this project, which was essentially an assignment in COMP558:Fundamentals of Computer Vision course, I implemented the Scale Invariant Feature Transform(SIFT) algorithm from scratch. SIFT is a traditional computer vision feature extraction technique. SIFT features are scale, space and rotationally invariant.\nSIFT is a highly involved algorithm and thus implementing it from scratch is an arduous tasks. At an abstract level the SIFT algorithm can be described in five steps\n Find Scale Space Extrema: We construct the Laplacian(Difference of Gaussian) pyramid for the given image and using this pyramid, we found local extremas in each level of the laplacian pyramid by taking a local area and comparing the intensities in that local region for the same scale as well as the adjacent(next and previous) levels in the pyramid. Two local neighbourhood sizes(33,55) were tried.\n Keypoint Localization: A large number of keypoints are generated by the first step which might not be useful. Corner cases and low contrast keypoints are discarded. Also a threshold was specified in order to select only strong extremas. A taylor series expansion of scale space is done to get a more accurate value of extrema and those falling below the threshold were discarded.\n Gradient Calculation: For each keypoint detected, a square neigborhood(17x17 in our case) was taken around them at their respective scales. Intensity gradients and orientation were calculated for the given neighborhood. A gaussian mask of the same size as our neighborhood was used as a weighting mask over gradient magnitude matrix.\n SIFT Feature Descriptors: SIFT feature descriptors are created by taking histograms of gradients orientations for each keypoint neighborhoods. Orientations are divided into bins of various ranges(36 bins of 10 deg in our case), and for each gradient falling in a bin the gradient magnitude value is added to that particular bin. Once we have the histogram we find the orientation with the highest weighted value. Its the principle orientation and the desriptors(orientation vectors) are shifted counterclockwise such that principle orientation becomes the first bin. This lends SIFT features their rotational invariance.\n Once we had the SIFT desriptors, I transformed the image and calculated SIFT vectors for the original and transformed images and matched them using bruteforce algorithm i.e Bhattacharyya Distance and visualised(as in figure above) the matches above a certain threshold to test the robustness of the SIFT algorithm.\n","date":1573495871,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1573495871,"objectID":"a25b47476ab29b4dd9dc92044eb90135","permalink":"https://mnishant2.github.io/project/sift/","publishdate":"2019-11-11T14:11:11-04:00","relpermalink":"/project/sift/","section":"project","summary":"Implementing Scale Invariant Feature Transform from scratch and feature matching","tags":["vision","SIFT","laplacian","course","descriptors"],"title":"SIFT","type":"project"},{"authors":["Aarash Feizi","Shubham Chopra","Nishant Mishra"],"categories":["nlp","sentiment analysis","ensemble model","deep learning","course projects"],"content":"This was a competition hosted on Kaggle and was a miniproject for the COMP 551: Applied Machine Learning Course.\nWe analyze text from the website Reddit, and develop a multilabel classification model to predict which subreddit (group) a queried comment came from. Reddit is an online forum, where people discuss various topics from sports to cartoons, technology and video-games. The dataset is a list of comments from 20 different subreddits (groups/topics). This problem can be formulated as a type of Sentiment analysis problem, which is quite well-known in the Natural Language Processing (NLP) literature. Sentiment analysis is a computational approach toward identifying opinion, sentiment, and subjectivity in text.\nFor this dataset, we implemented a Bernoulli Naive Bayes classifier, trained and tested it against the dataset. We also analyzed various models for improving the classification accuracy, including Support Vector Machines, Logistic Regression, k-Nearest Neighbours, the Ensemble method of Stacking and a Deep Learning model ULMFiT (J.Howard and S.Ruder, 2018) . We also tried using the FlairNLP library concatenating several combinations of embeddings such as FlairEmbeddings + BERT to get text features for classification\nWe compare the accuracy of these models for different Feature extraction methods, namely Term Frequency-Inverse document frequency (TF-IDF) , Binary and Non-Binary Count Vectorizer. We also analyze the performance gain/loss after applying Dimensionality reduction methods on the dataset. In particular, we explore the Principle Component Analysis (PCA) inspired method of Latent Semantic Analysis (LSA) .\nWe observed that the best results were obtained by stacking various combinations of the models described above. For the final submission, we used an ensemble classifier with ’soft’ voting by Stacking SVM, Naive Bayes and Logistic Regression at their optimum parameter settings.which gave an accuracy of 57.97% on our validation data and 58.011% on kaggle public leaderboard. Adding ULMFit to the stack and using a logistic regression on top as meta classifier further bolstered the accuracy to 60.1%. We finished 10th and 8th out of 105 teams(Group 60) on the public and the private leaderboards of the competition respectively.\n","date":1571681480,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1571681480,"objectID":"66e690e4ad0c1510d7034350bf75b67d","permalink":"https://mnishant2.github.io/project/reddit_comment/","publishdate":"2019-10-21T14:11:20-04:00","relpermalink":"/project/reddit_comment/","section":"project","summary":"We analyze different Machine Learning models to process Reddit data and develop a supervised classification model that can predict what community a certain comment came from.","tags":["nlp","sentiment analysis","course","PCA","ULMFit","naive bayes","stacking","SVM","machine learning"],"title":"Reddit Comment Classification [Kaggle]","type":"project"},{"authors":[],"categories":["deep learning","nlp"],"content":"The project at Signzy involved training a generalizable model for information retrieval from OCR output of Indian ID cards. We used both character level embeddings and word level embeddings( ELMO ) in a stacked manner for language modelling before passing the concatenated embeddings to a bidirectional Long Short Term Memory neural network with Conditional Random Field modelling on LSTM output ( Huang et al. ) for final classification.\nThe model was trained on a large corpus of text OCR outputs obtained from our own proprietary ID cards dataset for extracting non-trivial information such as Names, dates, numbers, addreses from any card. The training was done in a way to ensure the embeddings were also fine tuned. The FlairNLP library was used to create the preprocessing, text embedding, training and postprocessing pipeline and training was performed using pytorch framework. Multiple combinations of embeddings including FlairEmbeddings( Contextualized string embeddings for sequence labelling ), BERT, CharacterEmbeddings, ELMO, XLNet were benchmarked before settling on the final pair based on accuracy, compute and efficiency considerations.\nNot only did the model perform admirably well on unseen text from ID types part of training data irrespective of variations in OCR output and image layout, but it generalised well for out of sample ID types too when finetuned with just 1-5 samples of these cards.\nThe idea behind this was to build a generic, flexible information retrieval engine thats pretrained to extract important information from OCR output of all ID cards without specifically being trained on them or having seen them, without any rule based processing, that can be easily finetuned on a very small number of samples of any new card type for optimum performance. This was made into a rest API as a plug and play product for clients to finetune the model on their samples and then use it out of the box to extract information from IDs. The performance was measured using precision and recall figures.\n","date":1558444841,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1558444841,"objectID":"f652a66d3c7e33390d39fe1c6bdd2a7c","permalink":"https://mnishant2.github.io/project/gem/","publishdate":"2019-05-21T09:20:41-04:00","relpermalink":"/project/gem/","section":"project","summary":"Trained a biLSTM model using both word and character level embeddings for information retrieval from text OCR outputs of ID cards","tags":["nlp","featured","deep learning","flair","ELMO","cv"],"title":"Generic Extraction Module (G.E.M)","type":"project"},{"authors":["Nishant Mishra","Prasun Anand","Mahesh Chandra","Zainab Feroz"],"categories":[],"content":"","date":1548979200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1599616001,"objectID":"7b1e01b76f631c400242f0f35dfb23f0","permalink":"https://mnishant2.github.io/publication/mishra-performance-2019/","publishdate":"2020-09-09T01:46:41.54397Z","relpermalink":"/publication/mishra-performance-2019/","section":"publication","summary":"Speaker Recognition is one of the principle problems in Speech processing. The performance of speaker recognition systems can be improved by carefully choosing and calculating suitable features, which is an arduous task. Therefore, the learning based approach has been found to be simpler, more general and with the rapid growth in Artificial Intelligence, more accurate. This paper is a comparative study of the performance of different neural networks in speaker recognition. The focus of this work is to find which of these learning algorithms is more accurate, less complex, and more generic when it comes to speaker recognition. A database of 5000 utterances, 100 for each of the 50 different speakers, in both clean and noisy environment, with varying levels of noise was used. The MFCC (Mel Frequency Cepstral Coefficients) of these utterances were used as features to train and evaluate the neural networks. Accuracy of all neural networks was expectedly very high (textgreater90%) for clean data, large variations coming in with introduction and change in the level of noise. RBFNN has been shown to consistently perform well under all conditions. DNN was the other consistent performer and has the potential to outperform other techniques, if trained on more data.","tags":["\"Biological neural networks\"","\"DNN\"","\"Feature extraction\"","\"Mel frequency cepstral coefficient\"","\"mel frequency cepstral coefficients\"","\"MFCC\"","\"neural nets\"","\"Neural Network\"","\"neural networks\"","\"Noise measurement\"","\"PNN\"","\"RBFNN\"","\"SLFN\"","\"speaker recognition\"","\"Speaker recognition\"","\"Speaker Recognition\"","\"speaker recognition systems\"","\"Training\""],"title":"Performance Evaluation of Neural Networks for Speaker Recognition","type":"publication"},{"authors":["Nishant Mishra","AB Saravanan","Christen Miller"],"categories":["vision","ensemble learning"],"content":"Many of the vision based applications or APIs meant for information retrieval/data verification such as Text extraction or face recognition need a minimal quality of image for efficient processing and adequate performance. Hence it becomes imperative to implement an Image quality assessment layer before proceeding with further processing. This will ensure smooth applicaton of the vision algorithms, reliable performance and an overall time reduction by ensuring less redundant computations on oor quality images, and prventing multiple requests and passes through the algorithm.\nThis additional filter helps by ensuring only optimal quality images are passed on and poor quality images are screened at the client/user stage itself saving the users time and the server unnecessary processing, ensuring higher throughput and efficiency.\nWe implemented one such pipeline using an ensemble of models that qualitatively analysed images and produced a quantitative measure for image quality that could then be used as a threshold for decision on whether they are sent for downstream processing or the user is notified to repeat the request with better quality images. This quantitative score ensures flexibility for different tasks and different people tailored to their needs.\nThe model detects the blur in an image( BlurNet ), brightness of the image(a ResNet-18 model trained for binary classification i.e dark vs bright) and the text readability(based on performance of text detection and OCR algorithms along with other filtering and morphological operations on the image to estimate textual region) and a meta layer performed computation on their individual outputs to provide a final cumulative Image Quality Score.\nThe final meta learner was trained taking the outputs of individual models as input with the average image quality scores assigned to each image by annotators being the output score. The annotation was done by assigning each image to atleast five random users and asking them to score the image on the three parameters i.e Blur, Brightness and readability out of 10 solely on their personal discretion. These scores were then fit into a weighting formula to generate a cumulative score. This final score obtained from all the annotators for each image was averaged to output the final ground truth score for the image.\nThe clients get both the final score as well as outputs from each individual model along with a short description about the image quality based on the score for analysis.\n","date":1548028800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1548028800,"objectID":"ca1dccfb31083acf903d05d61fa19fa3","permalink":"https://mnishant2.github.io/project/iqa/","publishdate":"2019-01-21T00:00:00Z","relpermalink":"/project/iqa/","section":"project","summary":"An ensemble model to quantify image quality to filter poor quality images at the client end to prevent redundant processing","tags":["vision","blur","brightness","text readability","resnet18","ocr","featured"],"title":"Image Quality Assessment","type":"project"},{"authors":["Nishant Mishra","AB Saravanan"],"categories":["Deep Learning","vision"],"content":"","date":1538714041,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1597802042,"objectID":"39398fc5cb4903141e4c8ebf3f61553b","permalink":"https://mnishant2.github.io/project/dory-ocr/","publishdate":"2018-10-05T00:34:01-04:00","relpermalink":"/project/dory-ocr/","section":"project","summary":"We created a state of the art Optical Character Recognition Engine specifically for Indian ID cards using a pipeline for document layout detection, foreground extraction, text detection, recognition and postprocessing","tags":["vision","deep learning","object detection","ocr","featured","cv"],"title":"Dory OCR","type":"project"},{"authors":["Zainab Feroz","Prasun Anand","Nishant Mishra"],"categories":["vision","sign language detection","depth","segmentation"],"content":"This was our Undergrad Final Project where we set out to implement a speech Sign Language intercoversion system. More specifically it was Hindi speech- Indian sign Language interconversion system. The speech to sign language subsystem was essentially a derivative of our speech recognition project with detected speech being mapped to corresponding sign language visuals in real time. Here I shall be discussing our Indian Sign Language detection subsystem. Initially we just used a dataset of 7000 2D images of Indian sign language for classification as a proof of concept, we used a modified VGGNet for classification with a 99% accuracy. But using 2D data was impracticable for building a real time and realistic sign language recognition system. To accommodate more complex backgrounds that we could come across in everyday situation instead of the simple backgrounds as in 2-D dataset and also to account for occlusion, various angles arising due to Indian Sign Language being two handed, we decided to use kinect sensor and hence RGB-D dataset to leverage the depth information rendered by Kinect.\nWe collected RGB-D data for 48 different Indian Signs. These include both RGB and Depth images of digits, alphabets and a few common words. The dataset comprises of around 36 images per word in our vocabulary, contributed by 18 different people. We trained a Multivariate Gaussian Mixture Model(GMM) on the HSV pixel values of the data to segment skin region and intensify the skin pixel areas in the RGB-D images. Skin segmentation using Multivariate GMM\n Since per class data was significantly small for training a robust model, we performed significant data segmentation(blurring,affine transforms,colour adjustments) to multiply the data before training. Once we had the data, we adopted two different paradigms. In the first method we stacked the RGB and Depth image vertically before passing them on to a ResNet-50 classifier for training. This method reached a validation accuracy of 71%. Data sample along with Augmentation\n The second approach involved using a Bilinear CNN system, with two parallel ResNet architectures for RGB and Depth images separately followed by bilinear pooling of features output by them before being passed on to subsequent Dense layers. This approach performed better with a validation accuracy of 79% although it was computationally more expensive. Finally we passed the output of the sign language detection system through Google\u0026rsquo;s text to speech(TTS) generation API for getting the final speech output.\n","date":1525975940,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1525975940,"objectID":"8f630a59d464c719b0bbbbc2d9bad7be","permalink":"https://mnishant2.github.io/project/sign_language/","publishdate":"2018-05-10T14:12:20-04:00","relpermalink":"/project/sign_language/","section":"project","summary":"Deep Learning based Indian Sign Language detection for conversion to speech, a subsystem of the Hindi speech-Indian sign language interconversion system","tags":["vision","deep learning","bilinear CNN","kinect","RGB-D","GMM","HSV","speech","featured","cv"],"title":"Sign Language Classification [Bachelor Project]","type":"project"},{"authors":["Nishant Mishra"],"categories":["vision","deep learning","segmentation"],"content":"As the name suggests, this project involved training a model to crop out documents from a background. Essentially this can be classified as a segmentation task that would need massive annotation of data with a mask on the foreground object which is to be used for supervised segmentation training.\nWe decided to cast this into a regression based problem where we annotated only the four corner points of the foreground object as our training labels and then used them to train a regression model with 8 continuous valued outputs({x,y} coordinates of all four corners). Once we had these points we implemented a perspective transform to warp the object into a rectangular space for the final cropped output.\nFor training we used custom aggregated and crowdsourced dataset of ID cards and other documents in various background settings. We implemented our own annotation tool for the above mentioned ground truth annotation. In order to ensure variance, we used both natural camera taken images as well as synthetically generated data by superimposing the already available cropped samples on random backgrounds at different positions, scales and orientation.\nNot only this, we also implemented massive data augmentation in order to further multiply our training data that worked simultaneously on the image and the annotated keypoints. Some of the augmentation techniques used were blurring, rotation, scaling,grayscale, color adjustments, dropout, adding noise etc. We used the imgaug library for the whole augmentation pipeline.\nAnnotation, synthetic data generation, and augmentation were all done in such a way as to ensure the sequence of the four corner points with respect to the object remained same in order to ensure spatial and rotational invariance during training and prediction. The upper left point of the foreground object was always the first label followed by others in a clockwise manner.\nOnce we had sufficient annotated and augmented data, we trained the regression models. We experimented and benchmarked a number of different algorithms and learning paradigms. We benchmarked ResNet , Squeezenet , VGGNet , Shufflenet in both transfer learning and from scratch settings for a large range of hyperparameter values and benchmarked their performances.\nThe outputs, as mentioned above were eight continuous labels, hence the final layer was always a linear activation layer. The loss functions used were variations of Mean Squared Error values. This whole concept was applied and tested on a number of applications such as cropping ID cards from background for further processing in an Optical Character Recognition system , cropping Cheque MICR stub for MICR extraction for digital processing, Passport MRZ code extraction, scanned document layout detection etc and they all fit in perfectly in the overall pipeline and gave robust performances.\n","date":1523491200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1523491200,"objectID":"b6b6e6e8a325200162724a5a975cf700","permalink":"https://mnishant2.github.io/project/cropnet/","publishdate":"2018-04-12T00:00:00Z","relpermalink":"/project/cropnet/","section":"project","summary":"Regression based deep learning models for automatically cropping document as foreground extraction(segmentation) task","tags":["vision","regression","resnet","squeezenet","augmentation"],"title":"Cropnet","type":"project"},{"authors":["Nishant Mishra"],"categories":["Statistical Machine Learning","Deep Learning","Data Science"],"content":"The NCAA Division I Men\u0026rsquo;s and Womens Basketball Tournament , also known and branded as NCAA March Madness, is a single-elimination tournament played each spring in the United States, currently featuring 68 college basketball teams from the Division I level of the National Collegiate Athletic Association (NCAA), to determine the national championship. Every year Kaggle holds a March Madness data Science competition for ML practitioners to use historical tournament data to build models for forecasting outcomes of all possible matchups of the tournament.\nIn this project I decided to apply Machine Learning to the March Madness data for predicting the outcomes. The project is a highly involved one with extensive data related to every team, their players. The data includes tournament head-to-head, past records, form, player statistics, recent performances and results from previous tournament among others. I worked on extensive exploratory data analysis to both visualize the data as well as identify the most pertinent, discriminative stats.\nThe data analysis and visualization was followed by condensing different correlated data to form more complex yet non redundant stats. I implmented new intuitive yet qualitative feaatures such as current form, crowd support etc as quantitative measures using provided features. I also used sports websites like ESPN and NCAA official website to extract further data, from back in the past. after the EDA and feature engineering I experimented and benchmarked a number of Statistical Machine Learning as well as Deep Learning algorithms for the task of forecasting match and tournament results using the features with varying performances. All the code, steps and results have been lucidly explained in the associated jupyter notebooks for reference.\n","date":1512653053,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1512653053,"objectID":"0c3ff430fe61cd0572a41a32dc9d94db","permalink":"https://mnishant2.github.io/project/march-madness/","publishdate":"2017-12-07T09:24:13-04:00","relpermalink":"/project/march-madness/","section":"project","summary":"Applying Machine Learning to March Madness College Basketball tournament for predicting tournament match results","tags":["nlp","deep learning","data science","kaggle","exploratory data analysis"],"title":"March Madness [Kaggle]","type":"project"},{"authors":["Rohit Mohan","Nishant Mishra"],"categories":["vision","deep learning"],"content":"This work was selected for and presented at the final round of Smart India Hackathon 2017 by government of India. The project involved implementing a proof of concept system to detect anomalous activities from camera feed. For this purpose we used the Database for recognition of human actions from the Computer Science department at KTH Royal institute of technology.\nThe database consists of seven types of human actions (walking, jogging, running, boxing, hand waving, sliding and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors:s1, outdoors with scale variation:s2, outdoors with different clothes:s3 and indoors:s4. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average.\nThe SOP of the project was preprocessing and feature extraction from the sequences to be passed on for training. All the frames were smoothed with a gaussian filter. This was followed by contour detection. A novel approach of pooling extracted contours(green boxes in video) after Mixture of Gaussian based Background subtraction to get an aggregate binary boundary image of the foreground(contour of the subjects)(blue bounding box in video) as features was implemented.\nIn order to account for the temporal aspect, these final contour images were aggregated in batches of five consecutive frames to be passed on to the Neural Network for training. Additional quantities such as centroid, median topmost, bottommost coordinates of the contours, and squared differences of consecutive left and right coordinates were also claculated for the batch of five frames and passed on to represent the speed and posture. All of these features were concatenated and Principal Component Analysis was applied to them for reducing the dimensionality with n_components=100 that captured most of the variance in the feature space while minimizing the dimension and hence computation and storage requirements. The features were than stored using cPickle.\nBoth Fully connected neural network and CNN were used for training with comparable performance, with an accuracy of ~96% for classification of activities as anomaalous or normal. From the above mentioned actions, boxing and sliding were grouped as anomalous activities and the rest 5 as non anomalous.\n","date":1508025600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1508025600,"objectID":"2c27cf46762e06a5d720c1bdf91fbb40","permalink":"https://mnishant2.github.io/project/activity_recognition/","publishdate":"2017-10-15T00:00:00Z","relpermalink":"/project/activity_recognition/","section":"project","summary":"Using traditional computer vision with deep learning algorithms for Anomalous activity detection from CCTV camera feed","tags":["vision","deep learning","hackathon","SIH","KTH","PCA","opencv","featured","cv"],"title":"Activity Recognition","type":"project"},{"authors":["Nishant Mishra","Benjamin Bigot","Christine Senac"],"categories":["vision","speech recognition","optical character recognition"],"content":"Automatic Speech Recognition systems, especially those leveraging probabilistic modeling such as Hidden Markov Model based ASR systems rely a lot on the associated data/lexicon for optimum performance. In this project done as part of my undergrad summer research Internship at Institut de Recherche en Informatique de Toulouse (IRIT) , Universite Paul Sabatier, we intended to analyse the possible boost in ASR performance by incorporating output of Optical Character Recognition applied on associated visual components of the speech.\nWe set out to study the impact of populating the lexicon of speech processing system with OCR outputs obtained from their videos. To this end, we used the open source, readily available MOOC data for the experimentation. Performing Automatic Speech recognition on these lectures for transcription and indexing is a bit difficult because different videos have a specific set of words depending on the domain of the video called jargon,which are not present in general lexicons we use to train speech recognition models. But most of these videos also have text as part of slides or handwritten scribbles on screen which if used to populate the lexicon in realtime will benefit the speech recognition system.\nWe set out by creating a corpus of such videos along with their transcripts with timestamps and the slides used in pdf or other file formats. We used apache Tika to extract text from these slides as part of ground truth. We also implemented a semi automatic GUI to annotate the slide transitions with respective timestamps in the video for accurate temporal alignement with ground truth for benchmarking OCR performance.\nFor Video OCR we used the LOOV(Poignant et al.) tool that uses classical Computational techniques such as Sobel filtering, Sauvola Algorithm followed by text tracking over consecutive frame to ensure text persistence for text detection and then tesseract OCR engine for text detection. The text detections are averaged over shifted regions and Viterbi Algorithm applied for modelling the best OCR output using SRILM library . We reimplemented parts of LOOV in python by taking developers version of PyLOOV which had functional issues and optimised it for our own use case.\nWe benchmarked the performance of our video OCR using ground truth annotations obtained from the slides using Recall and precision as metrics. Now we identified some domain specific words that were present in the OCR output but not in the transcript to get a general ballpark of possible improvement. We found out that there were words in range of 2 to 20%(avg ~10%) of the total words in the OCR which were absent in the transcripts on a per slide basis. The HMM based speech Recognition model was trained with the old and updated lexicons using Kaldi toolkit and expectedly we observed a significant improvement of an average of 5% in performance of the ASR for our dataset which were heavily domain oriented course lectures from Online course websites such as Coursera, edX.\nSuch a tool when integrated in ASR systems to update lexicon real time would help tremendously improve the ASR output.\n","date":1496236800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1496236800,"objectID":"7b94549be334b87867e5c6d36b8845a5","permalink":"https://mnishant2.github.io/project/ocr_asr/","publishdate":"2017-05-31T09:20:00-04:00","relpermalink":"/project/ocr_asr/","section":"project","summary":"Optical Character Recognition in Lecture Videos for the enrichment of Automatic Speech Recognition(ASR) system","tags":["vision","speech","OCR","ASR","LOOV","HMM","IRIT","featured","cv"],"title":"OCR to enrich ASR","type":"project"},{"authors":["Zainab Feroz","Prasun Anand","Nishant Mishra"],"categories":["Speech Processing","neural network","speaker recognition","speaker verification"],"content":"At the Signal Processing lab, BIT Mesra, we worked on the automatic speaker recognition project to predict the speaker given a speech utterance. Speaker Recognition is one of the principle problems in Speech processing. The performance of speaker recognition systems can be improved by carefully choosing and calculating suitable features, which is an arduous task.\nThis project was done on a custom dataset containing hindi digit utterances by 50 speakers. The database consisted of 5000 utterances, 100 for each of the 50 different speakers, in both clean and noisy environment, with varying levels of noise from -5dB, 0dB, 5dB, 10dB, 20dB and 30dB.\nThe MFCC (Mel Frequency Cepstral Coefficients) of these utterances were used as features to train and evaluate the neural networks. We performed a comparative analysis of four different neural networks for this task viz. Single Hidden Layer Neural Network, Multi Layer Perceptron(Deep Neural Network), Radial Basis Function Neural Network(RBFNN) and Probabilistic Neural Network(PNN) . MATLAB was used for the implementation and experiments.\nAccuracy of all neural networks was expectedly very high (\u0026gt;90%) for clean data, large variations coming in with introduction and change in the level of noise. RBFNN has been shown to consistently perform well under all conditions. DNN was the other consistent performer and has the potential to outperform other techniques, if trained on more data. Accuracy Vs SNR (in dB) for 50 speakers\n The findings of this project were selected for publication in IEEE Explore and Scopus and presented in the proceedings of 3rd IEEE International Conference on Electrical, Computer and Communication Technologies(ICECCT) , 2019 after peer review.\nWe also worked with the same dataset for analysis of neural networks performance in Speech Recognition task where we compared DNN,RBFNN, PNN, Self Organizing Maps(SOM,unsupervised) for digit recognition, and Speaker Verification task where we compared Regularized RBFNN , Normalized RBFNN , and Deep Neural Networks to verify the identity of a speaker given his new utterance by nearest neighbour prediction on extracted representations.\n","date":1481462569,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1481462569,"objectID":"c527e142281e6d14536a4f6e73abbf68","permalink":"https://mnishant2.github.io/project/speaker_recognition/","publishdate":"2016-12-11T09:22:49-04:00","relpermalink":"/project/speaker_recognition/","section":"project","summary":"Comparative analysis of Neural Network performances for the task of Speaker Recognition using Hindi Digit database in clean and noisy environment","tags":["deep learning","speech","speaker recognition","mfcc","rbfnn","pnn"],"title":"Speaker Recognition","type":"project"},{"authors":["Zainab Feroz","Prasun Anand","Nishant Mishra"],"categories":["Speech Recognition","deep learning"],"content":"In this project we used the same custom database of Hindi Digit utterances by 50 different subjects 10 times each in various noise levels including ideal 0dB lab conditions as used in the speaker recognition project . But here instead of using the data to train and analyse neural network performances for speaker recognition/verification, we trained a speech(here digit) recognition model. Unlike speaker based learning where we had 100 samples per class(50 speakers), here we have 500 samples per class(10 digits), hence intuitively and practically the performance of all the models was better than that in the case of Speaker Reognition.\nWe trained five different models viz. Single Hidden Layer Neural network, Deep Neural Network, Radial Basis Function Neural Network(RBFNN) , Probabilistic Neural Network(PNN) and Self Organizing Maps(SOM,unsupervised) for the same and compared their performances. We used the same MFCC features extracted from each utterance for the training. We introduced an unsupervised paradigm too in the form of Self Organizing Maps that are an unsupervised clustering algorithm for classification.\nAs an extension we also trained a python version of the same model along with the data cleaning and feature extraction pipeline for an opensource noisy english digit data set to test the generalization ability of our approach.\n","date":1476593994,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1476593994,"objectID":"877670a81cf94abaa363d5a8980b77a3","permalink":"https://mnishant2.github.io/project/speech_recognition/","publishdate":"2016-10-16T00:59:54-04:00","relpermalink":"/project/speech_recognition/","section":"project","summary":"Hindi and English digit recognition using MFCC features and five different neural networks and their performance evaluation under different conditions","tags":["speech","digits","mfcc","SOM","RBFNN","MLP","PNN"],"title":"Multilingual Speech Recognition","type":"project"},{"authors":null,"categories":null,"content":"","date":-62135596800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1597802042,"objectID":"3ef1c3ed755398dc4fffccfec12a9a68","permalink":"https://mnishant2.github.io/misc/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/misc/","section":"","summary":"","tags":null,"title":"","type":"widget_page"}]