I chose to create an Optical Character Recognition model using Ridge-regularized logistic regression.
I thought this would be a good fit for an OCR model given that we won't be dealing with sparse data. I was able to find a dataset here that gave a list of 8x16 binary images that encoded letters.
the letter o
The data is already split into folds, so this was a perfect setup for k-fold cross-validation. From my testing, I found that a high initial-step of
The dataset is designed with 26 classes (one for each letter of the alphabet), and 128 features (binary pixel values). For training, we tune the weights for our likelihood using gradient ascent. I used one-hot encoding for the letters, so we have 26 sets of weights to compute the likelihood that a given image is a given letter. For example, if the image is of the letter a
, then we would expect the first probability z
, then we would expect the last probability predict
and validate
functions.
As a demo, I made a quick java program that allows you to draw your own 8x16 letter, and turn it into an array you can paste into the python notebook. I used it to write out "comp five six two" to test my classifier.
We can see that some letters were misclassified, like the e
turning into a z
and the w
turning into a u
.
I believe, given more time, I could create a better model with my current method and better hyperparameters. The issue with increasing the number of iterations, though, is that it could lead to overfitting. Similarly, the dataset I am using is not very large and has some weirdly written letters. For example, the letter f:
I experimented with the emnist-letters dataset, but found that I was not able to get a good result in any reasonable amount of time. So, I stuck with my quirky dataset.
An interesting extension of this project could be to use Principle Component Analysis to analyize the user's handwriting, and create a pdf from plain-text that looks hand-written.
For example, we could get the principle components of a w
from training dataset, get the average handwritten w
of the user from a picture, and then use slight variations of it as a font.