Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line ending handling in reverse_readfile/readline across OS, and not skipping empty lines #712

Open
wants to merge 100 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 88 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
e2952c4
add l_end arg to reverse_readfile
DanielYang59 Sep 4, 2024
619a38d
rstrip remove hard coded trailing white space
DanielYang59 Sep 4, 2024
6cbea60
add default value for l_end
DanielYang59 Sep 4, 2024
0de9696
add l_end to reverse_readline
DanielYang59 Sep 4, 2024
f114afe
continue CI test jobs upon failure
DanielYang59 Sep 4, 2024
deb1ad7
bump codecov to v4
DanielYang59 Sep 4, 2024
c4a845c
tweak docstring
DanielYang59 Sep 4, 2024
afbe573
var name and docstring tweak
DanielYang59 Sep 4, 2024
a4c4fe3
add helper function to get line ending and unit test
DanielYang59 Sep 5, 2024
86ea01b
add bzip2 and gzip file support and test
DanielYang59 Sep 5, 2024
c7ec2de
sort import and tweak docstring
DanielYang59 Sep 5, 2024
ae125c3
use unix line ending \n as default if empty
DanielYang59 Sep 5, 2024
343e0db
fix docstring
DanielYang59 Sep 5, 2024
d288e0d
update unit test
DanielYang59 Sep 5, 2024
064c064
remove accidental comment sign
DanielYang59 Sep 5, 2024
30624aa
use if after return
DanielYang59 Sep 5, 2024
3084123
tweak docstring
DanielYang59 Sep 5, 2024
1639725
reset pointer to fix test in windows
DanielYang59 Sep 6, 2024
dfe553d
encode in linux
DanielYang59 Sep 6, 2024
3b7a811
reset pointer in win only
DanielYang59 Sep 6, 2024
b291707
yield str in windows
DanielYang59 Sep 6, 2024
aeba5a7
use Iterator as return type
DanielYang59 Sep 6, 2024
1d7adda
assert iterator return str
DanielYang59 Sep 6, 2024
e6312a6
strict usage of rstrp(line_ending)
DanielYang59 Sep 6, 2024
7911e0a
better assert no error
DanielYang59 Sep 6, 2024
6cb5349
decode before strip, but tests are failing
DanielYang59 Sep 6, 2024
79f07f1
add test_file_with_empty_lines for readfile, Win still not working
DanielYang59 Sep 6, 2024
d2ea989
allow zipped file directly into get_line_end, and add test
DanielYang59 Sep 9, 2024
6803c7b
remove seemingly unused file
DanielYang59 Sep 9, 2024
351452a
update test file to include line end for the last line
DanielYang59 Sep 9, 2024
6f8baf9
skip the first match of line ending char
DanielYang59 Sep 9, 2024
74a451f
fix test for zipped format
DanielYang59 Sep 9, 2024
ef454b3
drop test for legacy MacOS \r
DanielYang59 Sep 9, 2024
3516c27
make capture of mmap valueerror more narrow
DanielYang59 Sep 9, 2024
5e1344c
use platform.system for windows check
DanielYang59 Sep 9, 2024
283a383
WIP: test for \r\n is failing for some reason
DanielYang59 Sep 9, 2024
ace9264
fix \r\n location, but it looks silly, need opt
DanielYang59 Sep 9, 2024
9e92bb9
add TODO tag
DanielYang59 Sep 9, 2024
6a8f8b5
finally fixed, avoid newline otherwise \n get overwrite by \r\n causi…
DanielYang59 Sep 9, 2024
f69c205
fix line ending for text mode
DanielYang59 Sep 9, 2024
28e6cc4
drop test of \r altogether
DanielYang59 Sep 9, 2024
bf3b90a
remove manual counter
DanielYang59 Sep 9, 2024
a03b301
copy unit test, to be fixed
DanielYang59 Sep 9, 2024
138b756
update warn msg upon empty file
DanielYang59 Sep 10, 2024
29dc50d
tweak docstring
DanielYang59 Sep 10, 2024
8f94964
add comments
DanielYang59 Sep 10, 2024
3804727
Fix line ending handling in reverse readline
DanielYang59 Sep 13, 2024
76c0243
significantly increase test
DanielYang59 Sep 13, 2024
3b68d14
add warning for overly small mem size
DanielYang59 Sep 13, 2024
cd3bbd1
update unit test
DanielYang59 Sep 13, 2024
36a02f4
add test for illegal usage
DanielYang59 Sep 13, 2024
6457922
use tuple for instance check for now
DanielYang59 Sep 13, 2024
14dad79
avoid repeated len(l_end) call
DanielYang59 Sep 13, 2024
306a8f1
tweak unit test names
DanielYang59 Sep 13, 2024
168bc42
supress newline convert
DanielYang59 Sep 13, 2024
dcdfde5
pre-commit auto-fixes
pre-commit-ci[bot] Sep 13, 2024
f255121
support bufferedreader and clarify m_file type
DanielYang59 Sep 14, 2024
57157ef
clarify and test missing last l_end char
DanielYang59 Sep 14, 2024
4276032
clean up docstring
DanielYang59 Sep 14, 2024
33257c5
clarify myth around blk size and max mem
DanielYang59 Sep 14, 2024
f75abd3
fix skip last l_end (temporarily)
DanielYang59 Sep 14, 2024
709ed17
fix skip first l_end match
DanielYang59 Sep 14, 2024
4799c24
update test big file, but it's failing and need to fix
DanielYang59 Sep 14, 2024
702e6ca
tweak comment
DanielYang59 Sep 14, 2024
d0a0fda
update TODO list
DanielYang59 Sep 15, 2024
5af7145
fix incorrect buffer refill position
DanielYang59 Sep 15, 2024
384c231
suppress OS l_end translate
DanielYang59 Sep 15, 2024
6102263
suppress unrelated warnings
DanielYang59 Sep 15, 2024
e9c22ed
check file pointer reset
DanielYang59 Sep 15, 2024
0be5a85
track benchmark script in case we need it someday
DanielYang59 Sep 15, 2024
edfa0cb
track benchmark script in case we need it someday
DanielYang59 Sep 15, 2024
bdda9f8
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 15, 2024
8714cb4
simplify test script, make env install manual
DanielYang59 Sep 15, 2024
c236434
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 15, 2024
174d940
save test log on wsl2
DanielYang59 Sep 15, 2024
1f7476e
pre-commit auto-fixes
pre-commit-ci[bot] Sep 15, 2024
0d2af60
also track test file create time
DanielYang59 Sep 16, 2024
7cddd29
get obj size as getsize is slow in win
DanielYang59 Sep 16, 2024
0b478dd
pre-commit auto-fixes
pre-commit-ci[bot] Sep 16, 2024
3053e87
update builtin readline test not to read entire file
DanielYang59 Sep 16, 2024
097ae65
remove outdated test log
DanielYang59 Sep 16, 2024
cf542a8
update test log on windows
DanielYang59 Sep 16, 2024
8dc894e
pre-commit auto-fixes
pre-commit-ci[bot] Sep 16, 2024
e90da64
test on Ubuntu 22.04 WSL2
DanielYang59 Sep 18, 2024
df0288b
pre-commit auto-fixes
pre-commit-ci[bot] Sep 18, 2024
0ea6828
remove dup test script
DanielYang59 Sep 18, 2024
8db6bb4
Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …
DanielYang59 Sep 18, 2024
f354756
clear finished TODO tag
DanielYang59 Sep 19, 2024
96c9108
tweak var name
DanielYang59 Sep 19, 2024
e1786ef
fix missing newline char in comment
DanielYang59 Sep 19, 2024
f90074c
guard warning filter with context manager
DanielYang59 Sep 19, 2024
27b28a6
add type annotation
DanielYang59 Sep 19, 2024
1cdc75b
put tag into condition branch
DanielYang59 Sep 21, 2024
e4940e0
untrack benchmark script and results
DanielYang59 Sep 22, 2024
279cd4c
Merge branch 'master' into readline-line-ending
shyuep Oct 21, 2024
6613d33
fix typo in test var name
DanielYang59 Oct 22, 2024
3baea3a
remove merge issue
DanielYang59 Oct 22, 2024
61f4aff
Merge branch 'master' into readline-line-ending
DanielYang59 Oct 22, 2024
f1fd669
revise comment
DanielYang59 Oct 22, 2024
d70e80e
suppress mypy errors
DanielYang59 Oct 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ jobs:
build:
strategy:
max-parallel: 20
fail-fast: false
matrix:
os: [ubuntu-latest, macos-14, windows-latest]
python-version: ["3.9", "3.x"]
Expand All @@ -29,6 +30,6 @@ jobs:
run: pytest --cov=monty --cov-report html:coverage_reports tests

- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v3
uses: codecov/codecov-action@v4
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
175 changes: 175 additions & 0 deletions benchmark/benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
"""Utility script for monty reverse reader speed benchmark.
- File of various sizes.
- Find the last line, 75 % line and 50 % line, and compare with reading from forwards.
"""

from __future__ import annotations

import os
import platform
import sys
import time

from monty.io import reverse_readfile, reverse_readline

# Test config
FILE_SIZES_MB = (1, 10, 100, 500, 1000, 5000)


def create_test_file(file_path, target_size_mb):
"""Creates a text file with lines until the target size is reached."""
target_size = target_size_mb * 1024 * 1024 # Convert MB to bytes
line_number = 1

start = time.perf_counter()

# Create a list of lines and concatenate them at the end
lines = []
total_bytes_written = 0

while total_bytes_written < target_size:
line = f"This is line number {line_number}\n"
line_bytes = line.encode("utf-8")

if total_bytes_written + len(line_bytes) > target_size:
break

lines.append(line)
total_bytes_written += len(line_bytes)
line_number += 1

with open(file_path, "wb") as f:
f.write("".join(lines).encode("utf-8"))

DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
last_time = time.perf_counter() - start

total_lines = line_number - 1
print(
f"Test file of size {target_size_mb} MB created with {total_lines} lines, time used {last_time:.2f} seconds."
)
return total_lines

DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved

def print_separator(title: str):
print(f"{title}".center(80, "="))
print("")


def test_builtin_readline(file_path, total_lines):
"""Test built-in readline function."""
start = time.perf_counter()
with open(file_path, "r") as f:
for _ in range(total_lines):
_last_line = f.readline()
last_time = time.perf_counter() - start

line_75_idx = int(0.75 * total_lines)
line_50_idx = int(0.5 * total_lines)

start = time.perf_counter()
with open(file_path, "r") as f:
for _ in range(line_75_idx + 1):
_line_75 = f.readline()
time_75 = time.perf_counter() - start

start = time.perf_counter()
with open(file_path, "r") as f:
for _ in range(line_50_idx + 1):
_line_50 = f.readline()
time_50 = time.perf_counter() - start

print_separator("Built-in readline")
print(f"Last line {total_lines} read, time taken: {last_time:.8f} s.")
print(f"75% line {line_75_idx} read, time taken: {time_75:.8f} s.")
print(f"50% line {line_50_idx} read, time taken: {time_50:.8f} s.")
print_separator("End of Built-in readline")

return last_time, time_75, time_50


def test_reverse_readline(file_path, total_lines):
"""Test reverse_readline function."""
start = time.perf_counter()
with open(file_path, "r") as f:
_last_line = next(reverse_readline(f))
last_time = time.perf_counter() - start

line_75_idx = int(0.75 * total_lines)
line_50_idx = int(0.5 * total_lines)

start = time.perf_counter()
with open(file_path, "r") as f:
for idx, _line in enumerate(reverse_readline(f), 1):
if idx == total_lines - line_75_idx:
break
time_75 = time.perf_counter() - start

start = time.perf_counter()
with open(file_path, "r") as f:
for idx, _line in enumerate(reverse_readline(f), 1):
if idx == total_lines - line_50_idx:
break
time_50 = time.perf_counter() - start

print_separator("reverse_readline")
print(f"Last line {total_lines} read, time taken: {last_time:.8f} s.")
print(f"75% line {line_75_idx} read, time taken: {time_75:.8f} s.")
print(f"50% line {line_50_idx} read, time taken: {time_50:.8f} s.")
print_separator("End of reverse_readline")

return last_time, time_75, time_50


def test_reverse_readfile(file_path, total_lines):
"""Test reverse_readfile function."""
start = time.perf_counter()
_last_line = next(reverse_readfile(file_path))
last_time = time.perf_counter() - start

line_75_idx = int(0.75 * total_lines)
line_50_idx = int(0.5 * total_lines)

start = time.perf_counter()
for idx, _line in enumerate(reverse_readfile(file_path), 1):
if idx == total_lines - line_75_idx:
break
time_75 = time.perf_counter() - start

start = time.perf_counter()
for idx, _line in enumerate(reverse_readfile(file_path), 1):
if idx == total_lines - line_50_idx:
break
time_50 = time.perf_counter() - start

print_separator("reverse_readfile")
print(f"Last line {total_lines} read, time taken: {last_time:.8f} s.")
print(f"75% line {line_75_idx} read, time taken: {time_75:.8f} s.")
print(f"50% line {line_50_idx} read, time taken: {time_50:.8f} s.")
print_separator("End of reverse_readfile")

return last_time, time_75, time_50


def run_benchmark(file_size_mb):
"""Run benchmark for all test functions."""
print_separator(f"Benchmarking file size: {file_size_mb} MB")

test_file = f"test_file_{file_size_mb}MB.txt"
total_lines = create_test_file(test_file, file_size_mb)

test_builtin_readline(test_file, total_lines)
test_reverse_readline(test_file, total_lines)
test_reverse_readfile(test_file, total_lines)

os.remove(test_file)


if __name__ == "__main__":
# Show OS info
os_info = platform.platform()
python_version = sys.version.split()[0]
print(f"\nRunning on OS: {os_info}, Python {python_version}")

# Run benchmark for each file size
for file_size_mb in FILE_SIZES_MB:
run_benchmark(file_size_mb)
145 changes: 145 additions & 0 deletions benchmark/develop-ubuntu2204.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@

Running on OS: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35, Python 3.12.5
==========================Benchmarking file size: 1 MB==========================

Test file of size 1 MB created with 40757 lines, time used 0.01 seconds.
===============================Built-in readline================================

Last line 40757 read, time taken: 0.00299960 s.
75% line 30567 read, time taken: 0.00225260 s.
50% line 20378 read, time taken: 0.00147340 s.
============================End of Built-in readline============================

================================reverse_readline================================

Last line 40757 read, time taken: 0.00215450 s.
75% line 30567 read, time taken: 0.00299860 s.
50% line 20378 read, time taken: 0.00347230 s.
============================End of reverse_readline=============================

================================reverse_readfile================================

Last line 40757 read, time taken: 0.00007200 s.
75% line 30567 read, time taken: 0.00393440 s.
50% line 20378 read, time taken: 0.00788680 s.
============================End of reverse_readfile=============================

=========================Benchmarking file size: 10 MB==========================

Test file of size 10 MB created with 392476 lines, time used 0.11 seconds.
===============================Built-in readline================================

Last line 392476 read, time taken: 0.02891790 s.
75% line 294357 read, time taken: 0.02203730 s.
50% line 196238 read, time taken: 0.01453720 s.
============================End of Built-in readline============================

================================reverse_readline================================

Last line 392476 read, time taken: 0.00010690 s.
75% line 294357 read, time taken: 0.08467931 s.
50% line 196238 read, time taken: 0.16308102 s.
============================End of reverse_readline=============================

================================reverse_readfile================================

Last line 392476 read, time taken: 0.00007110 s.
75% line 294357 read, time taken: 0.03886231 s.
50% line 196238 read, time taken: 0.07676581 s.
============================End of reverse_readfile=============================

=========================Benchmarking file size: 100 MB=========================

Test file of size 100 MB created with 3784596 lines, time used 1.11 seconds.
===============================Built-in readline================================

Last line 3784596 read, time taken: 0.28840513 s.
75% line 2838447 read, time taken: 0.22593426 s.
50% line 1892298 read, time taken: 0.14234112 s.
============================End of Built-in readline============================

================================reverse_readline================================

Last line 3784596 read, time taken: 0.00010290 s.
75% line 2838447 read, time taken: 0.80831332 s.
50% line 1892298 read, time taken: 1.62029043 s.
============================End of reverse_readline=============================

================================reverse_readfile================================

Last line 3784596 read, time taken: 0.00007960 s.
75% line 2838447 read, time taken: 0.35162211 s.
50% line 1892298 read, time taken: 0.70727881 s.
============================End of reverse_readfile=============================

=========================Benchmarking file size: 500 MB=========================

Test file of size 500 MB created with 18462038 lines, time used 5.24 seconds.
===============================Built-in readline================================

Last line 18462038 read, time taken: 1.38365400 s.
75% line 13846528 read, time taken: 1.11287734 s.
50% line 9231019 read, time taken: 0.69148150 s.
============================End of Built-in readline============================

================================reverse_readline================================

Last line 18462038 read, time taken: 0.00009210 s.
75% line 13846528 read, time taken: 3.98585572 s.
50% line 9231019 read, time taken: 7.93171154 s.
============================End of reverse_readline=============================

================================reverse_readfile================================

Last line 18462038 read, time taken: 0.00007860 s.
75% line 13846528 read, time taken: 1.75920299 s.
50% line 9231019 read, time taken: 3.45924340 s.
============================End of reverse_readfile=============================

========================Benchmarking file size: 1000 MB=========================

Test file of size 1000 MB created with 36540934 lines, time used 10.33 seconds.
===============================Built-in readline================================

Last line 36540934 read, time taken: 2.79645362 s.
75% line 27405700 read, time taken: 2.20113669 s.
50% line 18270467 read, time taken: 1.37504352 s.
============================End of Built-in readline============================

================================reverse_readline================================

Last line 36540934 read, time taken: 0.00009140 s.
75% line 27405700 read, time taken: 8.09242921 s.
50% line 18270467 read, time taken: 15.43773309 s.
============================End of reverse_readline=============================

================================reverse_readfile================================

Last line 36540934 read, time taken: 0.00010630 s.
75% line 27405700 read, time taken: 3.53976401 s.
50% line 18270467 read, time taken: 7.18230309 s.
============================End of reverse_readfile=============================

========================Benchmarking file size: 5000 MB=========================

Test file of size 5000 MB created with 178466370 lines, time used 59.52 seconds.
===============================Built-in readline================================

Last line 178466370 read, time taken: 14.19376238 s.
75% line 133849777 read, time taken: 10.26362936 s.
50% line 89233185 read, time taken: 6.69335968 s.
============================End of Built-in readline============================

================================reverse_readline================================

Last line 178466370 read, time taken: 0.00255530 s.
75% line 133849777 read, time taken: 39.30825986 s.
50% line 89233185 read, time taken: 76.32949390 s.
============================End of reverse_readline=============================

================================reverse_readfile================================

Last line 178466370 read, time taken: 0.00073260 s.
75% line 133849777 read, time taken: 18.99764580 s.
50% line 89233185 read, time taken: 37.87064222 s.
============================End of reverse_readfile=============================
DanielYang59 marked this conversation as resolved.
Show resolved Hide resolved
Loading
Loading