Skip to content

Advanced Tesseract Configuration

kevincon edited this page Jan 4, 2015 · 2 revisions

Configuring Tesseract parameters

Tesseract has many configuration parameters for controlling all sorts of aspects of the recognition process. These parameters are enumerated and described in G8TesseractParameters.h.

Reading parameters

You can read the value of a given parameter using the variableValueForKey: method.

For example, to read the value of the whitelist parameter (the whitelist consists of only the characters Tesseract should recognize):

// Assuming "tesseract" is an already initialized `G8Tesseract` object
NSString *whitelist = [tesseract variableValueForKey:kG8ParamTesseditCharWhitelist];

Setting parameters

You can set the value of a given parameter one of three ways: individually (after initialization), using a dictionary (during initialization), or by using one or more configuration files (during initialization). Additionally, you can use any combination of these methods in tandem. The parameters will be set (and possibly overridden) in the order in which you use each method.

Individually

// Assuming "tesseract" is an already initialized `G8Tesseract` object
// Set the whitelist to recognize only the numbers 0 through 9
[tesseract setVariableValue:@"0123456789" forKey:kG8ParamTesseditCharWhitelist];

Using a dictionary

// During initialization, set the whitelist to recognize only the numbers 0 through 9
// and disable word dictionaries
G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:@"eng"
                                              configDictionary:@{
                                                                kG8ParamTesseditCharWhitelist: @"0123456789",
                                                                kG8ParamLoadSystemDawg       : @"F",
                                                                kG8ParamLoadFreqDawg         : @"F",
                                                                }
                                              configFileNames:nil
                                              cachesRelatedDataPath:nil
                                              engineMode:G8OCREngineModeTesseractOnly];

Using one or more configuration files

Let's say you have one or more Tesseract configuration files in your "tessdata" folder. You can initialize a G8Tesseract object using these files as part of initialization by providing an array of the absolute file paths to the configuration files:

debugConfig.txt

tessdata_manager_debug_level    1

recognitionConfig.txt

load_system_dawg            F
load_freq_dawg              F
user_words_suffix           user-words
user_patterns_suffix        user-patterns
tessedit_char_whitelist     0123456789

Note that the above configuration files use the actual Tesseract parameter key strings instead of the variables defined in G8TesseractParameters.h.

ViewController.m

// Construct the paths to our config files
NSString *resourcePath = [NSBundle bundleForClass:G8Tesseract.class].resourcePath;
NSString *tessdataFolderName = @"tessdata";
NSString *tessdataFolderPathFromTheBundle = [[resourcePath stringByAppendingPathComponent:tessdataFolderName] stringByAppendingString:@"/"];
NSString *debugConfigFileName = @"debugConfig.txt";
NSString *recognitionConfigFileName = @"recognitionConfig.txt";
NSString *debugConfigFilePath = [tessdataFolderPathFromTheBundle stringByAppendingPathComponent:debugConfigFileName];
NSString *recognitionConfigFilePath = [tessdataFolderPathFromTheBundle stringByAppendingPathComponent:recognitionConfigFileName];

// Initialize the `G8Tesseract` object using the config files
G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:kG8Languages
                                              configDictionary:nil
                                               configFileNames:@[debugConfigFilePath, recognitionConfigFilePath]
                                         cachesRelatedDataPath:nil
                                                    engineMode:G8OCREngineModeTesseractOnly];

Using the Caches directory for the "tessdata" folder

What if we want to be able to download and use language/configuration files at runtime for use with Tesseract? Since our "tessdata" folder is read-only in our application's bundle, we can't store our newly downloaded files there.

The solution is to use a custom path relative to your app's Caches directory for storing the "tessdata" folder. When you initialize your G8Tesseract object, set the option cachesRelatedDataPath to be a filepath string relative to the Caches directory.

Note that even if you use this option, you must still create a referenced folder in your Xcode project called "tessdata", even if you don't put any files in it.

For example, let's say we want our "tessdata" folder to be located at "Caches/foo/bar/tessdata":

G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:@"eng"
                                              configDictionary:nil
                                               configFileNames:nil
                                         cachesRelatedDataPath:@"foo/bar"
                                                    engineMode:G8OCREngineModeTesseractOnly];

Upon executing the code above, the directory "Caches/foo/bar/tessdata" will be created (if it doesn't already exist), and all of the contents of the referenced "tessdata" folder in the Xcode project will be copied there. Finally, Tesseract will be initialized to use "Caches/foo/bar/tessdata" as its tessdata location, and it will search for any language/configuration files there.

So if you later download a new language/configuration file, store it in "Caches/foo/bar/tessdata" and re-initialize Tesseract with the same cachesRelatedDataPath but this time specifying the new language/configuration file for the initWithLanguage and/or configFileNames options.