Extra record is read when file has line ending in the end #21

joelverhagen · 2024-01-12T22:08:52Z

It is common to have a new line (\n or \r\n) at the end of a text file following the last line (e.g. in Posix as shared on Stack Overflow). Generally this is not seen as the separator for a new CSV record. For the CSV records in my benchmark (https://github.com/joelverhagen/NCsvPerf), all of the parsers I've tested so far have this property of not yielding an empty record at the end.

Repro of what I am talking about:

using System.Text;
using Addax.Formats.Tabular;

var lines = new[] { "a,b,c", "1,2,3", "x,y,z" };
var file = string.Join("\r\n", lines) + "\r\n"; // line ending at the end
var dialect = new TabularDialect("\r\n", ',', '\"');
var stream = new MemoryStream(Encoding.UTF8.GetBytes(file));
using (var reader = new TabularReader(stream, dialect))
{
    while (reader.TryPickRecord())
    {
        Console.WriteLine("Record:");
        while (reader.TryReadField())
        {
            Console.Write("  Field: ");
            if (reader.TryGetString(out var value))
            {
                Console.WriteLine(value);
            }
            else
            {
                Console.WriteLine("(no value)");
            }
        }
    }
}

Actual output:

Record:
  Field: a
  Field: b
  Field: c
Record:
  Field: 1
  Field: 2
  Field: 3
Record:
  Field: x
  Field: y
  Field: z
Record:
  Field:

Expected output:

Record:
  Field: a
  Field: b
  Field: c
Record:
  Field: 1
  Field: 2
  Field: 3
Record:
  Field: x
  Field: y
  Field: z

I think this can be easily worked around by detecting a single empty string field on a line when more fields are expected, which is what I will do for my benchmark which will include Addax.

Nice work on the library! Thanks!

The text was updated successfully, but these errors were encountered:

Workaround filed here: alexanderkozlenko/addax#21

alexanderkozlenko · 2024-12-04T23:51:50Z

That's a valid case about the line ending in the end. The library provides two types of readers, and the intention behind this API design is to give developers flexibility. TabularReader is a low-level API that exposes the file structure exactly as it is, including all line endings and comments, which may be critical in some use cases. TabularReader<T> is a high-level API that focuses on consuming records in a structured and user-friendly way, ignoring empty lines and the line ending in the end of a file. If we adjust the example to use the latter, we observe the desired behavior:

using (var reader = new TabularReader<MyRecord>(stream, dialect))
{
    while (reader.TryReadRecord())
    {
        Console.WriteLine("Record:");
        Console.WriteLine("  Field: {0}", reader.CurrentRecord.Field0);
        Console.WriteLine("  Field: {0}", reader.CurrentRecord.Field1);
        Console.WriteLine("  Field: {0}", reader.CurrentRecord.Field2);
    }
}

[TabularRecord]
internal class MyRecord
{
    [TabularFieldOrder(0)]
    public string? Field0 { get; set; }
    [TabularFieldOrder(1)]
    public string? Field1 { get; set; }
    [TabularFieldOrder(2)]
    public string? Field2 { get; set; }
}

Record:
  Field: a
  Field: b
  Field: c
Record:
  Field: 1
  Field: 2
  Field: 3
Record:
  Field: x
  Field: y
  Field: z

In some scenarios, such as the benchmark project, it may require additional handling of the trailing line ending. However, unless this behavior proves to be a significant blocker for adoption, I would like to keep the current API shape to aligns with the initial library's goals.

Thank you for including the library in the benchmark, I appreciate it!

joelverhagen added a commit to joelverhagen/NCsvPerf that referenced this issue Jan 12, 2024

Add Addax with workaround

45fecf7

Workaround filed here: alexanderkozlenko/addax#21

alexanderkozlenko self-assigned this Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra record is read when file has line ending in the end #21

Extra record is read when file has line ending in the end #21

joelverhagen commented Jan 12, 2024

alexanderkozlenko commented Dec 4, 2024

Extra record is read when file has line ending in the end #21

Extra record is read when file has line ending in the end #21

Comments

joelverhagen commented Jan 12, 2024

alexanderkozlenko commented Dec 4, 2024