Process ASCII file with fixed length format #651

jaysara · 2023-12-29T17:07:39Z

Background [Optional]

I have a need to use an ASCII file that has record segments of fixed length defined. There is no LF/CR kind of characters. The new record starts after every 426bytes. I believe, I should be able to ues Cobrix for this. I am looking for a documentation on what "options" should I specify in my spark.read method. I can see cobrix libraries working very well for ebcdic/binary cobol file. The files that I have are simple ASCII files. These files can grow upto 1GB easily. Spark parllel processsing will be very important for me. I do not have copybook for this defined. However, I know that the structure will be like bello2
Header -- size 100 bytes.
Records Segments -- each 426 bytes
Trailer - size 100 bytes (optional)

Question

What spark.read options from cobrix can I use to process larger ASCII files with fixed or variable length record segments.

yruslan · 2023-12-29T21:34:01Z

Hi,

For the example above this might work:

spark.read.format("cobol")
  .option("copybook_contents", copybookContentsStr)
  .option("encoding", "ascii")
  .option("record_format", "F") // Fixed length
  .option("record_length", "426") // Record length encoded in the copybook
  .option("file_start_offset", "100") // skip file header
  .option("file_end_offset", "100") // skip file footer
  .load(pathToData)

But you also need the copybook for your record payload that might look like:

      01 RECORD.
           10 FIELD1    PIC X(10).
           10 FIELD2    PIC X(15).
...

Remember that first characters in each line of the copybook should be spaces.

jaysara · 2024-01-04T02:48:48Z

I just got another requirements added to the above, if this is considered a new question, I can create a new question. There are scenarios where record length can vary and it is presented by the first 4 bytes... Example is below, (there is no LF or CR character to separate the line)
Header -- size 100 bytes.
0426xxxxmyrecorddataupto426bytes
0456xxxxxmyrecorddataupto456bytes
0435xxxxxmyrecorddataupto435bytes
Trailer - size 100 bytes (optional)

Does Cobrix support this kind of variable length format ? What options should I use for that.

yruslan · 2024-01-04T12:17:20Z

With the new requirement the record format is now 'V', which is 'variable length. You can specify the field in the copybook that contains the length:

spark.read.format("cobol")
  .option("copybook_contents", copybookContentsStr)
  .option("encoding", "ascii")
  .option("record_format", "V") // Variable length records
  .option("record_length_field", "RECORD_LENGTH_FIELD")
  .option("file_start_offset", "100") // skip file header
  .option("file_end_offset", "100") // skip file footer
  .load(pathToData)

The copybook should define the first 4 bytes as a numeric field:

      01 RECORD.
           10 RECORD_LENGTH_FIELD  PIC 9(4).
           10 FIELD1    PIC X(10).
           10 FIELD2    PIC X(15).
...

The record length by default is the full record payload. If the value in the field does not match the record length exactly, you can use an arithmetic expression, for instance:

  .option("record_length_field", "RECORD_LENGTH_FIELD + 4")

jaysara · 2024-01-04T17:06:27Z

Ok. Thanks. In the above copybook example you gave, the lenght of field is defined by "RECORD_LENGTH_FIELD PIC 9(4)" The FIELD1 and FIELD2 length will depend upon the value in RECORD_LENGTH_FIELD. For every record it can be different based on the value in RECORD_LENGTH_FIELD. In that case, PIC X(10) and PIC X(15) may not be true all the time.. My structure will be like this,
``

01 RECORD.
           10 RECORD_LENGTH_FIELD  PIC 9(4).
           10 BASE_SEGMENT    PIC X(???). ** The size of this will come from RECORD_LENGTH_FIELD above.

How should I define the contents for copybook for that usecase. I have used any arbitary number in the above number and I was able to process the file successfully using Cobrix datasource. The sample is https://github.com/jaysara/spark-cobol-jay/blob/main/src/main/java/com/test/cobol/FixedWidthApp.java

yruslan · 2024-01-05T07:34:26Z

Since each record type probably has a different schema, your data can be considered multisegment. In this case you can define a redefined group for each segment. So the copybook will look like this:

      01 RECORD.
           10 RECORD_LENGTH_FIELD  PIC 9(4).
             15 SEGMENT1.
                20 SEG1_FIELD1    PIC X(15).
                20 SEG1_FIELD2    PIC X(10).
             15 SEGMENT2 REDEFINES SEGMENT1.
                20 SEG2_FIELD1    PIC X(5).
                20 SEG2_FIELD2    PIC X(11).

(note that SEGMENT2 redefines SEGMENT1)

You can also apply automatic segment filtering based on record length, like this: https://github.com/AbsaOSS/cobrix?tab=readme-ov-file#automatic-segment-redefines-filtering
You can use the record length field as the segment id.

jaysara added the question Further information is requested label Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process ASCII file with fixed length format #651

Process ASCII file with fixed length format #651

jaysara commented Dec 29, 2023

yruslan commented Dec 29, 2023 •

edited

Loading

jaysara commented Jan 4, 2024

yruslan commented Jan 4, 2024

jaysara commented Jan 4, 2024 •

edited

Loading

yruslan commented Jan 5, 2024

Process ASCII file with fixed length format #651

Process ASCII file with fixed length format #651

Comments

jaysara commented Dec 29, 2023

Background [Optional]

Question

yruslan commented Dec 29, 2023 • edited Loading

jaysara commented Jan 4, 2024

yruslan commented Jan 4, 2024

jaysara commented Jan 4, 2024 • edited Loading

yruslan commented Jan 5, 2024

yruslan commented Dec 29, 2023 •

edited

Loading

jaysara commented Jan 4, 2024 •

edited

Loading