Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation/test set different accuracy results #678

Open
gregoireproust opened this issue Jan 18, 2025 · 4 comments
Open

Validation/test set different accuracy results #678

gregoireproust opened this issue Jan 18, 2025 · 4 comments

Comments

@gregoireproust
Copy link

gregoireproust commented Jan 18, 2025

Hi,

I hope that my question will be relevant as I'm not an expert.

I noticed that for the same validation and test set, the exact same model produced different accuracy results. I wanted to know whether this was normal behavior and how to obtain the same accuracy results.

The accuracy results are different when : validating a model, testing it with ketos test, and using ocr while getting the CER and WER from the ground truth and prediction produced by using the ocr command.

Data type is PageXML. I'm using latest Kraken version on Windows 11 with only Python 3.10. Running the commands with different accuracies either on CPU or GPU with --precision 16 disabled and same default hyperparameters as possible.

Validation runs give about 5-10% better than ketos test and "manually" getting results from torchmetrics CER and WER from the ocr command prediction. I think from testing lat week that results from ketos test and ocr "manually" retrieved accuracies are the same. So it seems really about validation runs and testing different accuracies for CER and WER.

Thank you for your attention!
Greg

@mittagessen
Copy link
Owner

mittagessen commented Jan 20, 2025 via email

@gregoireproust
Copy link
Author

gregoireproust commented Jan 20, 2025

Thank you for your reply and for providing more details!
We are talking 5 percentage points indeed, sorry for the confusion.

Here is a complete and precise breakdown of the tests I performed:

Training with fixed epochs: 1

Command:

ketos train -d cuda:0 -q fixed -N 1 -i trans.mlmodel --resize new -e manifest.txt -o trans_model/greg_trans -f xml VMBXCRMZOEYONYJFAPKZFCTU.xml

Definitions:

  • trans.mlmodel: is the McCATMuS_nfd_nofix_V1.mlmodel
  • manifest.txt: contains one file: VMBXCRMZOEYONYJFAPKZFCTU.xml with 49 lines

Training warnings:

  • The model will be flagged to use new polygon extractor.
  • Neural network has been trained on mode L images, training set contains 1 mode data. Consider setting force_binarization.

Results:

  • Validation run: CER: 0.708 / WER: 0.180
  • ketos test (command: ketos test -d cuda:0 -m trans_model/greg_trans_0.mlmodel -f xml VMBXCRMZOEYONYJFAPKZFCTU.xml): CER: 66.85% / WER: 14.29%
  • "Manual" results (to my surprise): CER: 70.81% / WER: 18.01%

Additional info:

  • There are always 49 lines processed for each command as seen in logs (validation run, ketos test, manual).
  • The "manual" script performs the kraken ocr command with the same device on the 49 lines and computes CER and WER on the predicted text using the same logic with torch metrics.
  • By testing a second time ketos test, accuracy results always seem to differ, not to be fixed in spite of using the exact same command.

Training with fixed epochs: 2

Command: the exact above with 2 epochs

Results:

  • Validation run and "manual" script: same results for the 2 epochs for instance epoch 2 => CER: 0.804-80.40% / WER: 0.357-35.71%
  • ketos test (with model_1.mlmodel and device cuda:0): CER: 75.51% / WER: 27.02%

Additional info:

  • Also tried running the manual "script" twice: resulting in producing the exact same results

3. Training until model reaches 100% accuracy:

Results (for the epoch when model reaches 100%):

  • Validation run and "manual" script comply: 100% CER and WER
  • ketos test: CER: 99.06% / WER: 95.03% / 17 errors => significant difference

Problem:

  • For these commands I used in the simplest form, I can't reproduce the case where validation runs differ from measuring "manually". I will experiment more with the commands to try and identify why I was getting different results (we'll assume that I run tests and was making mistakes).

Conclusion:

  • ketos test for me always produces different accuracies than validation runs
  • It appears validation runs are correct because model outputs do have the same accuracies (using ocr command and measuring "manually"), resulting in a problem only using the ketos test command

Manual Script:

import logging
import subprocess
from pathlib import Path

# Torch/Data
import torch
from torchmetrics.text import CharErrorRate, WordErrorRate

import xml.etree.ElementTree as ET

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def run_kraken_command():
    """
    Calls your exact Kraken command:

      kraken -f xml ocr
        -i VMBXCRMZOEYONYJFAPKZFCTU.xml
        output/output.txt
        ocr
        -m trans_model/greg_trans_1.mlmodel
    """
    cmd = [
        "kraken",
		"-d", "cuda:0",
        "-f", "xml",
        "-i", "VMBXCRMZOEYONYJFAPKZFCTU.xml",
        "output/output.txt",
        "ocr",
        "-m", "trans_model/greg_trans_81.mlmodel"
    ]
    logger.info(f"Running command: {' '.join(cmd)}")
    subprocess.run(cmd, check=True)
    logger.info("Kraken command finished. The recognized text should be in output/output.txt")

def load_gt_lines_from_xml(xml_file):
    """
    Loads ground-truth lines from a PAGE XML file that looks like this:

    <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15">
      <Page>
        <TextRegion>
          <TextLine>
            <TextEquiv>
              <Unicode>Some text</Unicode>
            </TextEquiv>
          </TextLine>
          ...
        </TextRegion>
        ...
      </Page>
    </PcGts>

    Returns:
      A list of strings, each one the ground-truth text for a line.
    """
    # Parse the XML
    tree = ET.parse(xml_file)
    root = tree.getroot()

    # The default namespace for PAGE 2013-07-15
    NS = {'page': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15'}

    # We'll store one text string per <TextLine>
    lines = []

    # Search for all TextLine elements in that namespace
    # .//page:TextLine means any <TextLine> at any depth within the root
    for textline in root.findall('.//page:TextLine', namespaces=NS):
        # Inside each <TextLine> we look for <TextEquiv><Unicode>...
        # We'll take the *first* <Unicode> if there are multiple
        text_equiv = textline.find('.//page:TextEquiv', namespaces=NS)
        if text_equiv is not None:
            unicode_elem = text_equiv.find('.//page:Unicode', namespaces=NS)
            if unicode_elem is not None and unicode_elem.text:
                lines.append(unicode_elem.text)
            else:
                # if <Unicode> is empty or missing
                lines.append('')
        else:
            # no <TextEquiv> at all
            lines.append('')

    return lines

def compare_lines(pred_lines, gt_lines):
    """
    Computes and logs line-by-line + global CER/WER between
    pred_lines and gt_lines.
    """
    if len(pred_lines) != len(gt_lines):
        logger.warning(f"Line count mismatch: {len(pred_lines)} recognized vs {len(gt_lines)} ground-truth lines!")

    cer_metric = CharErrorRate()
    wer_metric = WordErrorRate()

    total_chars = 0
    line_count = min(len(pred_lines), len(gt_lines))

    for i in range(line_count):
        pred = pred_lines[i]
        gt = gt_lines[i]
        line_chars = len(gt)

        # Update global metrics
        cer_metric.update([pred], [gt])
        wer_metric.update([pred], [gt])

        total_chars += line_chars

        # Compute line-level
        local_cer = CharErrorRate()
        local_cer.update([pred], [gt])
        line_cer = local_cer.compute().item()

        local_wer = WordErrorRate()
        local_wer.update([pred], [gt])
        line_wer = local_wer.compute().item()

        logger.info(f"Line #{i+1}:")
        logger.info(f"  GT   : {gt!r}")
        logger.info(f"  Pred : {pred!r}")
        logger.info(f"  CER  : {100*line_cer:.2f}% => Acc: {100*(1 - line_cer):.2f}%")
        logger.info(f"  WER  : {100*line_wer:.2f}% => Acc: {100*(1 - line_wer):.2f}%")

    # Compute global CER/WER
    global_cer = cer_metric.compute().item()
    global_wer = wer_metric.compute().item()
    char_acc = 1.0 - global_cer
    word_acc = 1.0 - global_wer

    logger.info("-" * 60)
    logger.info(f"GLOBAL RESULTS on {line_count} lines, {total_chars} characters:")
    logger.info(f"  Char Error Rate (CER): {global_cer*100:.2f}% => Char Accuracy: {char_acc*100:.2f}%")
    logger.info(f"  Word Error Rate (WER): {global_wer*100:.2f}% => Word Accuracy: {word_acc*100:.2f}%")
    logger.info("-" * 60)


def main():
    # 1) Run your exact Kraken command
    run_kraken_command()

    # 2) Read recognized text from output file
    recognized_file = "output/output.txt"
    pred_lines = Path(recognized_file).read_text(encoding="utf-8").splitlines()
    logger.info(f"Read {len(pred_lines)} recognized lines from {recognized_file}")

    # 3) Load ground-truth lines from the same XML
    xml_file = "VMBXCRMZOEYONYJFAPKZFCTU.xml"
    gt_lines = load_gt_lines_from_xml(xml_file)

    # 4) Compare
    compare_lines(pred_lines, gt_lines)


if __name__ == "__main__":
    main()

Note: This script contains minor issues about dealing with empty TextLines.

@mittagessen
Copy link
Owner

mittagessen commented Jan 20, 2025 via email

@gregoireproust
Copy link
Author

No problem, I'll update you in the meantime if I've got some time to perform new tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants