-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation/test set different accuracy results #678
Comments
Some minor differences are to be expected but the overall CER/WER values
are calculated with the same implementation/data processing nowadays so
they really should be minor (<0.25 percentage points in my experience).
Are we talking 5 percentage points or 5%?
Other factors that can also cause slight deviations between the metrics
are the device (GPU and CPU torch implementations have slight numerical
differences) and for whatever reason batch size. But those deltas are
usually also well below what you're describing.
Are you sure you're running the same checkpoint the validation report
during training was reported by?
|
Thank you for your reply and for providing more details! Here is a complete and precise breakdown of the tests I performed: Training with fixed epochs: 1Command:
Definitions:
Training warnings:
Results:
Additional info:
Training with fixed epochs: 2Command: the exact above with 2 epochs Results:
Additional info:
3. Training until model reaches 100% accuracy:Results (for the epoch when model reaches 100%):
Problem:
Conclusion:
Manual Script:
Note: This script contains minor issues about dealing with empty TextLines. |
Thanks, that gives me something to work with. I'm quite busy until
February 10 so I probably won't be able to have a deeper look at it
until then but it definitely goes on the list of stuff to investigate
urgently.
…On 25/01/20 09:54AM, gregoireproust wrote:
Thank you for your reply and for providing more details!
We are talking 5 percentage points indeed, sorry for the confusion.
Here is a complete and precise breakdown of the tests I performed:
**Training with fixed epochs: 1**
-
**Command:**
`ketos train -d cuda:0 -q fixed -N 1 -i trans.mlmodel --resize new -e manifest.txt -o trans_model/greg_trans -f xml VMBXCRMZOEYONYJFAPKZFCTU.xml`
**Definitions:**
- trans.mlmodel: is the McCATMuS_nfd_nofix_V1.mlmodel
- manifest.txt: contains one file: VMBXCRMZOEYONYJFAPKZFCTU.xml with 49 lines
**Training warnings:**
- The model will be flagged to use new polygon extractor.
- Neural network has been trained on mode L images, training set contains 1 mode data. Consider setting force_binarization.
**Results:**
- Validation run: CER: 0.708 / WER: 0.180
- ketos test (command: ketos test -d cuda:0 -m trans_model/greg_trans_0.mlmodel -f xml VMBXCRMZOEYONYJFAPKZFCTU.xml): CER: 66.85% / WER: 14.29%
- "Manual" results (to my surprise): CER: 70.81% / WER: 18.01%
**Additional info:**
- There are always 49 lines processed for each command as seen in logs (validation run, ketos test, manual).
- The "manual" script performs the kraken ocr command with the same device on the 49 lines and computes CER and WER on the predicted text using the same logic with torch metrics.
- By testing a second time ketos test, accuracy results always seem to differ, not to be fixed in spite of using the exact same command.
**Training with fixed epochs: 2**
-
**Command: the exact above with 2 epochs**
**Results:**
- Validation run and "manual" script: same results for the 2 epochs for instance epoch 2 => CER: 0.804-80.40% / WER: 0.357-35.71%
- ketos test (with model_1.mlmodel and device cuda:0): CER: 75.51% / WER: 27.02%
**Additional info:**
- Also tried running the manual "script" twice: resulting in producing the exact same results
**3. Training until model reaches 100% accuracy:**
-
**Results (for the epoch when model reaches 100%):**
- Validation run and "manual" script comply: 100% CER and WER
- ketos test: CER: 99.06% / WER: 95.03% / 17 errors => significant difference
**Problem:**
-
- For these commands I used in the simplest form, I can't reproduce the case where validation runs differ from measuring "manually". I will experiment more with the commands to try and identify why I was getting different results (we'll assume that I run tests and was making mistakes).
**Conclusion:**
-
- ketos test for me always produces different accuracies than validation runs
- It appears validation runs are correct because model outputs do have the same accuracies (using ocr command and measuring "manually"), resulting in a problem only using the ketos test command
**Manual Script:**
-
```
import logging
import subprocess
from pathlib import Path
# Torch/Data
import torch
from torchmetrics.text import CharErrorRate, WordErrorRate
import xml.etree.ElementTree as ET
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def run_kraken_command():
"""
Calls your exact Kraken command:
kraken -f xml ocr
-i VMBXCRMZOEYONYJFAPKZFCTU.xml
output/output.txt
ocr
-m trans_model/greg_trans_1.mlmodel
"""
cmd = [
"kraken",
"-d", "cuda:0",
"-f", "xml",
"-i", "VMBXCRMZOEYONYJFAPKZFCTU.xml",
"output/output.txt",
"ocr",
"-m", "trans_model/greg_trans_81.mlmodel"
]
logger.info(f"Running command: {' '.join(cmd)}")
subprocess.run(cmd, check=True)
logger.info("Kraken command finished. The recognized text should be in output/output.txt")
def load_gt_lines_from_xml(xml_file):
"""
Loads ground-truth lines from a PAGE XML file that looks like this:
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15">
<Page>
<TextRegion>
<TextLine>
<TextEquiv>
<Unicode>Some text</Unicode>
</TextEquiv>
</TextLine>
...
</TextRegion>
...
</Page>
</PcGts>
Returns:
A list of strings, each one the ground-truth text for a line.
"""
# Parse the XML
tree = ET.parse(xml_file)
root = tree.getroot()
# The default namespace for PAGE 2013-07-15
NS = {'page': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15'}
# We'll store one text string per <TextLine>
lines = []
# Search for all TextLine elements in that namespace
# .//page:TextLine means any <TextLine> at any depth within the root
for textline in root.findall('.//page:TextLine', namespaces=NS):
# Inside each <TextLine> we look for <TextEquiv><Unicode>...
# We'll take the *first* <Unicode> if there are multiple
text_equiv = textline.find('.//page:TextEquiv', namespaces=NS)
if text_equiv is not None:
unicode_elem = text_equiv.find('.//page:Unicode', namespaces=NS)
if unicode_elem is not None and unicode_elem.text:
lines.append(unicode_elem.text)
else:
# if <Unicode> is empty or missing
lines.append('')
else:
# no <TextEquiv> at all
lines.append('')
return lines
def compare_lines(pred_lines, gt_lines):
"""
Computes and logs line-by-line + global CER/WER between
pred_lines and gt_lines.
"""
if len(pred_lines) != len(gt_lines):
logger.warning(f"Line count mismatch: {len(pred_lines)} recognized vs {len(gt_lines)} ground-truth lines!")
cer_metric = CharErrorRate()
wer_metric = WordErrorRate()
total_chars = 0
line_count = min(len(pred_lines), len(gt_lines))
for i in range(line_count):
pred = pred_lines[i]
gt = gt_lines[i]
line_chars = len(gt)
# Update global metrics
cer_metric.update([pred], [gt])
wer_metric.update([pred], [gt])
total_chars += line_chars
# Compute line-level
local_cer = CharErrorRate()
local_cer.update([pred], [gt])
line_cer = local_cer.compute().item()
local_wer = WordErrorRate()
local_wer.update([pred], [gt])
line_wer = local_wer.compute().item()
logger.info(f"Line #{i+1}:")
logger.info(f" GT : {gt!r}")
logger.info(f" Pred : {pred!r}")
logger.info(f" CER : {100*line_cer:.2f}% => Acc: {100*(1 - line_cer):.2f}%")
logger.info(f" WER : {100*line_wer:.2f}% => Acc: {100*(1 - line_wer):.2f}%")
# Compute global CER/WER
global_cer = cer_metric.compute().item()
global_wer = wer_metric.compute().item()
char_acc = 1.0 - global_cer
word_acc = 1.0 - global_wer
logger.info("-" * 60)
logger.info(f"GLOBAL RESULTS on {line_count} lines, {total_chars} characters:")
logger.info(f" Char Error Rate (CER): {global_cer*100:.2f}% => Char Accuracy: {char_acc*100:.2f}%")
logger.info(f" Word Error Rate (WER): {global_wer*100:.2f}% => Word Accuracy: {word_acc*100:.2f}%")
logger.info("-" * 60)
def main():
# 1) Run your exact Kraken command
run_kraken_command()
# 2) Read recognized text from output file
recognized_file = "output/output.txt"
pred_lines = Path(recognized_file).read_text(encoding="utf-8").splitlines()
logger.info(f"Read {len(pred_lines)} recognized lines from {recognized_file}")
# 3) Load ground-truth lines from the same XML
xml_file = "VMBXCRMZOEYONYJFAPKZFCTU.xml"
gt_lines = load_gt_lines_from_xml(xml_file)
# 4) Compare
compare_lines(pred_lines, gt_lines)
if __name__ == "__main__":
main()
```
_Note: This script doesn't skip empty lines._
--
Reply to this email directly or view it on GitHub:
#678 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
|
No problem, I'll update you in the meantime if I've got some time to perform new tests. |
Hi,
I hope that my question will be relevant as I'm not an expert.
I noticed that for the same validation and test set, the exact same model produced different accuracy results. I wanted to know whether this was normal behavior and how to obtain the same accuracy results.
The accuracy results are different when : validating a model, testing it with ketos test, and using ocr while getting the CER and WER from the ground truth and prediction produced by using the ocr command.
Data type is PageXML. I'm using latest Kraken version on Windows 11 with only Python 3.10. Running the commands with different accuracies either on CPU or GPU with --precision 16 disabled and same default hyperparameters as possible.
Validation runs give about 5-10% better than ketos test and "manually" getting results from torchmetrics CER and WER from the ocr command prediction. I think from testing lat week that results from ketos test and ocr "manually" retrieved accuracies are the same. So it seems really about validation runs and testing different accuracies for CER and WER.
Thank you for your attention!
Greg
The text was updated successfully, but these errors were encountered: