Validation/test set different accuracy results #678

gregoireproust · 2025-01-18T22:07:40Z

Hi,

I hope that my question will be relevant as I'm not an expert.

I noticed that for the same validation and test set, the exact same model produced different accuracy results. I wanted to know whether this was normal behavior and how to obtain the same accuracy results.

The accuracy results are different when : validating a model, testing it with ketos test, and using ocr while getting the CER and WER from the ground truth and prediction produced by using the ocr command.

Data type is PageXML. I'm using latest Kraken version on Windows 11 with only Python 3.10. Running the commands with different accuracies either on CPU or GPU with --precision 16 disabled and same default hyperparameters as possible.

Validation runs give about 5-10% better than ketos test and "manually" getting results from torchmetrics CER and WER from the ocr command prediction. I think from testing lat week that results from ketos test and ocr "manually" retrieved accuracies are the same. So it seems really about validation runs and testing different accuracies for CER and WER.

Thank you for your attention!
Greg

mittagessen · 2025-01-20T11:15:42Z

Some minor differences are to be expected but the overall CER/WER values are calculated with the same implementation/data processing nowadays so they really should be minor (<0.25 percentage points in my experience). Are we talking 5 percentage points or 5%? Other factors that can also cause slight deviations between the metrics are the device (GPU and CPU torch implementations have slight numerical differences) and for whatever reason batch size. But those deltas are usually also well below what you're describing. Are you sure you're running the same checkpoint the validation report during training was reported by?

gregoireproust · 2025-01-20T17:54:10Z

Thank you for your reply and for providing more details!
We are talking 5 percentage points indeed, sorry for the confusion.

Here is a complete and precise breakdown of the tests I performed:

Training with fixed epochs: 1

Command:

ketos train -d cuda:0 -q fixed -N 1 -i trans.mlmodel --resize new -e manifest.txt -o trans_model/greg_trans -f xml VMBXCRMZOEYONYJFAPKZFCTU.xml

Definitions:

trans.mlmodel: is the McCATMuS_nfd_nofix_V1.mlmodel
manifest.txt: contains one file: VMBXCRMZOEYONYJFAPKZFCTU.xml with 49 lines

Training warnings:

The model will be flagged to use new polygon extractor.
Neural network has been trained on mode L images, training set contains 1 mode data. Consider setting force_binarization.

Results:

Validation run: CER: 0.708 / WER: 0.180
ketos test (command: ketos test -d cuda:0 -m trans_model/greg_trans_0.mlmodel -f xml VMBXCRMZOEYONYJFAPKZFCTU.xml): CER: 66.85% / WER: 14.29%
"Manual" results (to my surprise): CER: 70.81% / WER: 18.01%

Additional info:

There are always 49 lines processed for each command as seen in logs (validation run, ketos test, manual).
The "manual" script performs the kraken ocr command with the same device on the 49 lines and computes CER and WER on the predicted text using the same logic with torch metrics.
By testing a second time ketos test, accuracy results always seem to differ, not to be fixed in spite of using the exact same command.

Training with fixed epochs: 2

Command: the exact above with 2 epochs

Results:

Validation run and "manual" script: same results for the 2 epochs for instance epoch 2 => CER: 0.804-80.40% / WER: 0.357-35.71%
ketos test (with model_1.mlmodel and device cuda:0): CER: 75.51% / WER: 27.02%

Additional info:

Also tried running the manual "script" twice: resulting in producing the exact same results

3. Training until model reaches 100% accuracy:

Results (for the epoch when model reaches 100%):

Validation run and "manual" script comply: 100% CER and WER
ketos test: CER: 99.06% / WER: 95.03% / 17 errors => significant difference

Problem:

For these commands I used in the simplest form, I can't reproduce the case where validation runs differ from measuring "manually". I will experiment more with the commands to try and identify why I was getting different results (we'll assume that I run tests and was making mistakes).

Conclusion:

ketos test for me always produces different accuracies than validation runs
It appears validation runs are correct because model outputs do have the same accuracies (using ocr command and measuring "manually"), resulting in a problem only using the ketos test command

Manual Script:

import logging
import subprocess
from pathlib import Path

# Torch/Data
import torch
from torchmetrics.text import CharErrorRate, WordErrorRate

import xml.etree.ElementTree as ET

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def run_kraken_command():
    """
    Calls your exact Kraken command:

      kraken -f xml ocr
        -i VMBXCRMZOEYONYJFAPKZFCTU.xml
        output/output.txt
        ocr
        -m trans_model/greg_trans_1.mlmodel
    """
    cmd = [
        "kraken",
		"-d", "cuda:0",
        "-f", "xml",
        "-i", "VMBXCRMZOEYONYJFAPKZFCTU.xml",
        "output/output.txt",
        "ocr",
        "-m", "trans_model/greg_trans_81.mlmodel"
    ]
    logger.info(f"Running command: {' '.join(cmd)}")
    subprocess.run(cmd, check=True)
    logger.info("Kraken command finished. The recognized text should be in output/output.txt")

def load_gt_lines_from_xml(xml_file):
    """
    Loads ground-truth lines from a PAGE XML file that looks like this:

    <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15">
      <Page>
        <TextRegion>
          <TextLine>
            <TextEquiv>
              <Unicode>Some text</Unicode>
            </TextEquiv>
          </TextLine>
          ...
        </TextRegion>
        ...
      </Page>
    </PcGts>

    Returns:
      A list of strings, each one the ground-truth text for a line.
    """
    # Parse the XML
    tree = ET.parse(xml_file)
    root = tree.getroot()

    # The default namespace for PAGE 2013-07-15
    NS = {'page': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15'}

    # We'll store one text string per <TextLine>
    lines = []

    # Search for all TextLine elements in that namespace
    # .//page:TextLine means any <TextLine> at any depth within the root
    for textline in root.findall('.//page:TextLine', namespaces=NS):
        # Inside each <TextLine> we look for <TextEquiv><Unicode>...
        # We'll take the *first* <Unicode> if there are multiple
        text_equiv = textline.find('.//page:TextEquiv', namespaces=NS)
        if text_equiv is not None:
            unicode_elem = text_equiv.find('.//page:Unicode', namespaces=NS)
            if unicode_elem is not None and unicode_elem.text:
                lines.append(unicode_elem.text)
            else:
                # if <Unicode> is empty or missing
                lines.append('')
        else:
            # no <TextEquiv> at all
            lines.append('')

    return lines

def compare_lines(pred_lines, gt_lines):
    """
    Computes and logs line-by-line + global CER/WER between
    pred_lines and gt_lines.
    """
    if len(pred_lines) != len(gt_lines):
        logger.warning(f"Line count mismatch: {len(pred_lines)} recognized vs {len(gt_lines)} ground-truth lines!")

    cer_metric = CharErrorRate()
    wer_metric = WordErrorRate()

    total_chars = 0
    line_count = min(len(pred_lines), len(gt_lines))

    for i in range(line_count):
        pred = pred_lines[i]
        gt = gt_lines[i]
        line_chars = len(gt)

        # Update global metrics
        cer_metric.update([pred], [gt])
        wer_metric.update([pred], [gt])

        total_chars += line_chars

        # Compute line-level
        local_cer = CharErrorRate()
        local_cer.update([pred], [gt])
        line_cer = local_cer.compute().item()

        local_wer = WordErrorRate()
        local_wer.update([pred], [gt])
        line_wer = local_wer.compute().item()

        logger.info(f"Line #{i+1}:")
        logger.info(f"  GT   : {gt!r}")
        logger.info(f"  Pred : {pred!r}")
        logger.info(f"  CER  : {100*line_cer:.2f}% => Acc: {100*(1 - line_cer):.2f}%")
        logger.info(f"  WER  : {100*line_wer:.2f}% => Acc: {100*(1 - line_wer):.2f}%")

    # Compute global CER/WER
    global_cer = cer_metric.compute().item()
    global_wer = wer_metric.compute().item()
    char_acc = 1.0 - global_cer
    word_acc = 1.0 - global_wer

    logger.info("-" * 60)
    logger.info(f"GLOBAL RESULTS on {line_count} lines, {total_chars} characters:")
    logger.info(f"  Char Error Rate (CER): {global_cer*100:.2f}% => Char Accuracy: {char_acc*100:.2f}%")
    logger.info(f"  Word Error Rate (WER): {global_wer*100:.2f}% => Word Accuracy: {word_acc*100:.2f}%")
    logger.info("-" * 60)


def main():
    # 1) Run your exact Kraken command
    run_kraken_command()

    # 2) Read recognized text from output file
    recognized_file = "output/output.txt"
    pred_lines = Path(recognized_file).read_text(encoding="utf-8").splitlines()
    logger.info(f"Read {len(pred_lines)} recognized lines from {recognized_file}")

    # 3) Load ground-truth lines from the same XML
    xml_file = "VMBXCRMZOEYONYJFAPKZFCTU.xml"
    gt_lines = load_gt_lines_from_xml(xml_file)

    # 4) Compare
    compare_lines(pred_lines, gt_lines)


if __name__ == "__main__":
    main()

Note: This script contains minor issues about dealing with empty TextLines.

mittagessen · 2025-01-20T18:44:57Z

Thanks, that gives me something to work with. I'm quite busy until February 10 so I probably won't be able to have a deeper look at it until then but it definitely goes on the list of stuff to investigate urgently.

…

On 25/01/20 09:54AM, gregoireproust wrote: Thank you for your reply and for providing more details! We are talking 5 percentage points indeed, sorry for the confusion. Here is a complete and precise breakdown of the tests I performed: **Training with fixed epochs: 1** - **Command:** `ketos train -d cuda:0 -q fixed -N 1 -i trans.mlmodel --resize new -e manifest.txt -o trans_model/greg_trans -f xml VMBXCRMZOEYONYJFAPKZFCTU.xml` **Definitions:** - trans.mlmodel: is the McCATMuS_nfd_nofix_V1.mlmodel - manifest.txt: contains one file: VMBXCRMZOEYONYJFAPKZFCTU.xml with 49 lines **Training warnings:** - The model will be flagged to use new polygon extractor. - Neural network has been trained on mode L images, training set contains 1 mode data. Consider setting force_binarization. **Results:** - Validation run: CER: 0.708 / WER: 0.180 - ketos test (command: ketos test -d cuda:0 -m trans_model/greg_trans_0.mlmodel -f xml VMBXCRMZOEYONYJFAPKZFCTU.xml): CER: 66.85% / WER: 14.29% - "Manual" results (to my surprise): CER: 70.81% / WER: 18.01% **Additional info:** - There are always 49 lines processed for each command as seen in logs (validation run, ketos test, manual). - The "manual" script performs the kraken ocr command with the same device on the 49 lines and computes CER and WER on the predicted text using the same logic with torch metrics. - By testing a second time ketos test, accuracy results always seem to differ, not to be fixed in spite of using the exact same command. **Training with fixed epochs: 2** - **Command: the exact above with 2 epochs** **Results:** - Validation run and "manual" script: same results for the 2 epochs for instance epoch 2 => CER: 0.804-80.40% / WER: 0.357-35.71% - ketos test (with model_1.mlmodel and device cuda:0): CER: 75.51% / WER: 27.02% **Additional info:** - Also tried running the manual "script" twice: resulting in producing the exact same results **3. Training until model reaches 100% accuracy:** - **Results (for the epoch when model reaches 100%):** - Validation run and "manual" script comply: 100% CER and WER - ketos test: CER: 99.06% / WER: 95.03% / 17 errors => significant difference **Problem:** - - For these commands I used in the simplest form, I can't reproduce the case where validation runs differ from measuring "manually". I will experiment more with the commands to try and identify why I was getting different results (we'll assume that I run tests and was making mistakes). **Conclusion:** - - ketos test for me always produces different accuracies than validation runs - It appears validation runs are correct because model outputs do have the same accuracies (using ocr command and measuring "manually"), resulting in a problem only using the ketos test command **Manual Script:** - ``` import logging import subprocess from pathlib import Path # Torch/Data import torch from torchmetrics.text import CharErrorRate, WordErrorRate import xml.etree.ElementTree as ET logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) def run_kraken_command(): """ Calls your exact Kraken command: kraken -f xml ocr -i VMBXCRMZOEYONYJFAPKZFCTU.xml output/output.txt ocr -m trans_model/greg_trans_1.mlmodel """ cmd = [ "kraken", "-d", "cuda:0", "-f", "xml", "-i", "VMBXCRMZOEYONYJFAPKZFCTU.xml", "output/output.txt", "ocr", "-m", "trans_model/greg_trans_81.mlmodel" ] logger.info(f"Running command: {' '.join(cmd)}") subprocess.run(cmd, check=True) logger.info("Kraken command finished. The recognized text should be in output/output.txt") def load_gt_lines_from_xml(xml_file): """ Loads ground-truth lines from a PAGE XML file that looks like this: <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15"> <Page> <TextRegion> <TextLine> <TextEquiv> <Unicode>Some text</Unicode> </TextEquiv> </TextLine> ... </TextRegion> ... </Page> </PcGts> Returns: A list of strings, each one the ground-truth text for a line. """ # Parse the XML tree = ET.parse(xml_file) root = tree.getroot() # The default namespace for PAGE 2013-07-15 NS = {'page': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15'} # We'll store one text string per <TextLine> lines = [] # Search for all TextLine elements in that namespace # .//page:TextLine means any <TextLine> at any depth within the root for textline in root.findall('.//page:TextLine', namespaces=NS): # Inside each <TextLine> we look for <TextEquiv><Unicode>... # We'll take the *first* <Unicode> if there are multiple text_equiv = textline.find('.//page:TextEquiv', namespaces=NS) if text_equiv is not None: unicode_elem = text_equiv.find('.//page:Unicode', namespaces=NS) if unicode_elem is not None and unicode_elem.text: lines.append(unicode_elem.text) else: # if <Unicode> is empty or missing lines.append('') else: # no <TextEquiv> at all lines.append('') return lines def compare_lines(pred_lines, gt_lines): """ Computes and logs line-by-line + global CER/WER between pred_lines and gt_lines. """ if len(pred_lines) != len(gt_lines): logger.warning(f"Line count mismatch: {len(pred_lines)} recognized vs {len(gt_lines)} ground-truth lines!") cer_metric = CharErrorRate() wer_metric = WordErrorRate() total_chars = 0 line_count = min(len(pred_lines), len(gt_lines)) for i in range(line_count): pred = pred_lines[i] gt = gt_lines[i] line_chars = len(gt) # Update global metrics cer_metric.update([pred], [gt]) wer_metric.update([pred], [gt]) total_chars += line_chars # Compute line-level local_cer = CharErrorRate() local_cer.update([pred], [gt]) line_cer = local_cer.compute().item() local_wer = WordErrorRate() local_wer.update([pred], [gt]) line_wer = local_wer.compute().item() logger.info(f"Line #{i+1}:") logger.info(f" GT : {gt!r}") logger.info(f" Pred : {pred!r}") logger.info(f" CER : {100*line_cer:.2f}% => Acc: {100*(1 - line_cer):.2f}%") logger.info(f" WER : {100*line_wer:.2f}% => Acc: {100*(1 - line_wer):.2f}%") # Compute global CER/WER global_cer = cer_metric.compute().item() global_wer = wer_metric.compute().item() char_acc = 1.0 - global_cer word_acc = 1.0 - global_wer logger.info("-" * 60) logger.info(f"GLOBAL RESULTS on {line_count} lines, {total_chars} characters:") logger.info(f" Char Error Rate (CER): {global_cer*100:.2f}% => Char Accuracy: {char_acc*100:.2f}%") logger.info(f" Word Error Rate (WER): {global_wer*100:.2f}% => Word Accuracy: {word_acc*100:.2f}%") logger.info("-" * 60) def main(): # 1) Run your exact Kraken command run_kraken_command() # 2) Read recognized text from output file recognized_file = "output/output.txt" pred_lines = Path(recognized_file).read_text(encoding="utf-8").splitlines() logger.info(f"Read {len(pred_lines)} recognized lines from {recognized_file}") # 3) Load ground-truth lines from the same XML xml_file = "VMBXCRMZOEYONYJFAPKZFCTU.xml" gt_lines = load_gt_lines_from_xml(xml_file) # 4) Compare compare_lines(pred_lines, gt_lines) if __name__ == "__main__": main() ``` _Note: This script doesn't skip empty lines._ -- Reply to this email directly or view it on GitHub: #678 (comment) You are receiving this because you commented. Message ID: ***@***.***>

gregoireproust · 2025-01-21T11:43:02Z

No problem, I'll update you in the meantime if I've got some time to perform new tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation/test set different accuracy results #678

Validation/test set different accuracy results #678

gregoireproust commented Jan 18, 2025 •

edited

Loading

mittagessen commented Jan 20, 2025 via email

gregoireproust commented Jan 20, 2025 •

edited

Loading

mittagessen commented Jan 20, 2025 via email

gregoireproust commented Jan 21, 2025

Validation/test set different accuracy results #678

Validation/test set different accuracy results #678

Comments

gregoireproust commented Jan 18, 2025 • edited Loading

mittagessen commented Jan 20, 2025 via email

gregoireproust commented Jan 20, 2025 • edited Loading

Training with fixed epochs: 1

Training with fixed epochs: 2

3. Training until model reaches 100% accuracy:

Problem:

Conclusion:

Manual Script:

mittagessen commented Jan 20, 2025 via email

gregoireproust commented Jan 21, 2025

gregoireproust commented Jan 18, 2025 •

edited

Loading

gregoireproust commented Jan 20, 2025 •

edited

Loading