Skip to content

Latest commit

 

History

History
 
 

document-denoiser

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Clean Dirty Documents w/ Autoencoders

This example model cleans text documents of anything that isn't text (aka noise): coffee stains, old wear artifacts, etc. You can inspect the notebook that has been used to train the model here.

Here's a collage of input texts and predictions.

Imgur

Figure 1 - The dirty documents are on the left side and the cleaned ones are on the right

Sample Prediction

Once this model is deployed, get the API endpoint by running cortex get document-denoiser.

Now let's take a sample image like this one.

Imgur

Export the endpoint & the image's URL by running

export ENDPOINT=<API endpoint>
export IMAGE_URL=https://i.imgur.com/JJLfFxB.png

Then run the following piped commands

curl "${ENDPOINT}" -X POST -H "Content-Type: application/json" -d '{"url":"'${IMAGE_URL}'"}' |
sed 's/"//g' |
base64 -d > prediction.png

Once this has run, we'll see a prediction.png file saved to the disk. This is the result.

Imgur

As it can be seen, the text document has been cleaned of any noise. Success!


Here's a short list of URLs of other text documents in image format that can be cleaned using this model. Export these links to IMAGE_URL variable: