Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix in strdb.cc file for handling large file reads #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ajalagam
Copy link

strdb.cc terminates with segmentation fault when run on large data files of say 5 GB in size. This commit has a fix for this issue.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff71293ab in __memcpy_ssse3_back () from /lib64/libc.so.6

0x0000000000422558 in read_text (file=<optimized out>, func=0x409ed0 <read_text_process_word(int)>, db=..., read_cached=<optimized out>,
    write_cached=<optimized out>, incorp_new=true) at basic/strdb.cc:189

Copy link
Owner

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This change doesn't preserve the previous functionality, which ignores all whitespace (multiple, spaces, tabs, newlines), whereas the proposed change only breaks on single spaces. Would be good not to change this.

  2. Can you explain why this code works when the previous one doesn't? I don't know how ifstream is implemented, but the memory usage for reading in a string should be constant...

char buf[16384]; int buf_i = 0; // Output buffer
while(in >> s) { // Read a string
for(std::string line; getline( in, line, ' ');) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra space

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. To preserve 'breaking on whitespace functionality' the code can be modified as below.
    for(std::string line; getline(std::ws(in), line, ' ');)

  2. I think the reason why the previous code didn't work has got something to do with the internal memory allocation, buffer handling of the inputstream and copying to a char array of [16384] size. The recommended way to read is into a std::string rather than a char array. The stacktrace also seems to indicate that the internal memcpy operation didn't go fine.

I think an understanding beyond this may require me to dig deeper into the workings of the ifstream, limitations of reading via operator>> versus getline, and also into possibly libc library oeprations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants