Convert a folder of TXT
files into a folder of bigger TXT
files.
Files are foramtted as "--- {source file stem} --- \n\n{source file body}'. The source file's body will retain line breaks.
Below are the steps needed to run the conversion process. The pathing can be changed by updating the parameters.
- Clone this repository.
- Open a PowerShell window to the
~/src
directory. - Convert a folder of
TXT
files into a folder of biggerTXT
files.- The
-in
/-out
parameters control the source and destination folders. If the output folder does not exist it is created. WARNING: If the output folder does exist AND is not empty, newTXT
files will overwrite old ones. - The
-s
parameter controls the output file name's stem. I.E.f'./{stem}.{count}.txt'
. It defaults to 'stacked'. - The
-l
parameter controls how many lines (approxmitly) per new file. It defaults to 100k. - The optional
-spc
parameter allows for tuning on multi core machines. It defaults to 1.
python convert_txt.py -in d:/corpus_in -out d:/corpus_out
- The