Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize datadir #4372

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

zdenop
Copy link
Contributor

@zdenop zdenop commented Dec 18, 2024

  • use std::filesystem::path instead of std::string for datadir
  • add warning if datadir is not directory or does not exists

src/api/baseapi.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@egorpugin egorpugin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably all is_directory() checks are unnecessary. User will just get any other random error if he points to a file instead of to a dir.

src/api/baseapi.cpp Outdated Show resolved Hide resolved
src/ccmain/tessedit.cpp Outdated Show resolved Hide resolved
src/ccmain/tessedit.cpp Outdated Show resolved Hide resolved
src/ccutil/ccutil.cpp Outdated Show resolved Hide resolved
@zdenop
Copy link
Contributor Author

zdenop commented Dec 19, 2024

@stweil : Thank you for review. Unittests needs be fixed yet.

src/ccmain/tessedit.cpp Fixed Show fixed Hide fixed
src/ccmain/tessedit.cpp Fixed Show fixed Hide fixed
src/ccutil/ccutil.cpp Fixed Show fixed Hide fixed
@egorpugin
Copy link
Contributor

Old tprintfs should be removed slowly in favor of std::format.

src/ccmain/tessedit.cpp Outdated Show resolved Hide resolved
Signed-off-by: Stefan Weil <[email protected]>
@stweil
Copy link
Member

stweil commented Dec 19, 2024

Old tprintfs should be removed slowly in favor of std::format.

Isn't the tesserr stream even better?

@egorpugin
Copy link
Contributor

Streams are better than tprintf.
But when using them a lot soon you will notice that breaking strings for any insertion is quiet disturbing.

"we add two numbers: " << num1 << " + " << num2 << " = " << num1+num2;

compared to

std::format("we add two numbers: {} + {} = {}", num1, num2, num1+num2);

Having even more insertion points the original string can be hardly readable compared to a string with {} placeholders.

@egorpugin
Copy link
Contributor

Isn't the tesserr stream even better?

I mean with std::format() the usage will be something like tesserr << std::format("format string", args ... );.

@egorpugin
Copy link
Contributor

I've slightly updated tprintf() to C++ style.

8cb0418

@stweil
Copy link
Member

stweil commented Dec 19, 2024

I've slightly updated tprintf() to C++ style.

Now I get numerous compiler warnings:

../../../src/ccutil/tprintf.h:37:33: warning: format string is not a string literal (potentially insecure) [-Wformat-security]

And the code size is larger, too.

@egorpugin
Copy link
Contributor

  1. Warning should be silenced. Updated code is completely fine and valid. We should remove -Wformat-security flag if set or set -Wno-format-security.
  2. What is code size change? In general this should be ignored. But I'm curious what the difference is. Probably 5 bytes for each get_debugfp() and some pre- and post call setups.

@egorpugin
Copy link
Contributor

Can you try changing it to

template <typename ... Types>
auto tprintf(const char *format, Types && ... args) {
  return fprintf(get_debugfp(), format, std::forward<Types>(args)...);
}

?

@stweil
Copy link
Member

stweil commented Dec 19, 2024

Can you try changing it to [...]

The suggested change still produces the compiler warnings, but I think it improves the code because like that calling tprintf() without any argument would be caught as an error.

@egorpugin
Copy link
Contributor

It will be an error too if we call tprintf() even without const char *format.

1>tesseract\src\ccutil\tprintf.h(37,10): error C2660: 'fprintf': function does not take 1 arguments
1>(compiling source file '../../../../../src/ccutil/tprintf.cpp')
1>    C:\Program Files (x86)\Windows Kits\10\Include\10.0.22621.0\ucrt\stdio.h(830,37):
1>    see declaration of 'fprintf'
1>    tesseract\src\ccutil\tprintf.h(37,10):
1>    while trying to match the argument list '(FILE *)'
1>    tesseract\src\ccutil\tprintf.h(37,10):
1>    the template instantiation context (the oldest one first) is
1>        tesseract\src\ccutil\tprintf.cpp(74,3):
1>        see reference to function template instantiation 'auto tesseract::tprintf<>(void)' being compiled

@egorpugin
Copy link
Contributor

Added format in d95e9f7

I'll check ci for other warnings.

@egorpugin
Copy link
Contributor

I don't see warnings on gcc-14 (even without format).
Only with clang.

Possible solution is

auto tprintf(const char *format, Types && ... args) {
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wformat-security"
  return fprintf(get_debugfp(), format, std::forward<Types>(args)...);
#pragma clang diagnostic pop
}```

@zdenop
Copy link
Contributor Author

zdenop commented Dec 21, 2024

Seems like unittests GA are failing on the main (too), so it is not related to this PR:
image
Did I understand it correctly?

@zdenop
Copy link
Contributor Author

zdenop commented Dec 27, 2024

@egorpugin : I cherry-picked your commits regarding tprintf to my local branch it does not work for 'tessdata_path.string()'.
E.g. it produces output like Error opening data file ��]� for tesseract a b -l x (Win11 VS2022 64bit).
When I use tessdata_path.string().c_str() output is correct
Error opening data file F:\Projects\Community\tessdata\x.traineddata....

@egorpugin
Copy link
Contributor

How do you use it?

@zdenop
Copy link
Contributor Author

zdenop commented Dec 28, 2024

I am sorry, it looks like I messed something up... Today I started from clean setup and it worked for me:

git pull
git fetch origin pull/4372/head:pr4372
git switch pr4372
git cherry-pick 8cb04183c1953f988
git cherry-pick d95e9f7905cc9427d
git cherry-pick 2a944fbe98ed4408a

./autogen.sh && ./configure --prefix=/usr && make -j4 && make -j4 training
sudo make install && sudo make training-install

@zdenop
Copy link
Contributor Author

zdenop commented Dec 28, 2024

I looks like unittest fails if api_.Init uses tesseract::OEM_TESSERACT_ONLY but I am not able to find reason why (well api_. returns empty string, but why?)

If I made simplified test based on unittest/pagesegmode_test.cc it work for me. So where is problem? (gtest is installed from unittest/third_party/googletest)

/* g++ ocr_gtest.cpp -o ocr_gtest -lleptonica -ltesseract -lgtest
*/
##include <gtest/gtest.h>
#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>

TEST(OCRTest, ReadImage) {
    tesseract::TessBaseAPI api_;
    std::string TessdataPath = "/opt/Projects/tessdata";
    std::string filename = "test/testing/segmodeimg.tif";

    Pix *image = pixRead(filename.c_str());
    ASSERT_NE(image, nullptr) << "Failed to read image";

    ASSERT_EQ(api_.Init(TessdataPath.c_str(), "eng"), 
              tesseract::OEM_TESSERACT_ONLY) << "Failed to initialize Tesseract";

    api_.SetPageSegMode(tesseract::PSM_SINGLE_WORD);
    api_.SetImage(image);
    api_.SetRectangle(237, 393, 256, 36);
    char *ocr_text = api_.GetUTF8Text();
    ASSERT_NE(ocr_text, nullptr) << "OCR returned null";
    
    std::string ocr_output(ocr_text);
    printf("OCR output:\n'%s'\n", ocr_output.c_str());

    EXPECT_EQ(ocr_output, "What should\n");

    delete[] ocr_text; // Free OCR text
    api_.End(); // Cleanup Tesseract API
    pixDestroy(&image); // Cleanup image
}

int main(int argc, char** argv) {
    ::testing::InitGoogleTest(&argc, argv);
    return RUN_ALL_TESTS();
}

@egorpugin
Copy link
Contributor

Ignore unit test errors, they can be fixed after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants