filter_io
is analogous to Java's FilterIOStream
in that it allows you to intercept and process data in an IO stream. This is particularly useful when cleaning up bad input data for a CSV or XML parser.
filter_io
provides a one-pass approach to filtering data which can be much faster and memory efficient than doing two passes (cleaning the source file into a buffer and then calling the original parser).
filter_io
has been tested against Ruby 1.8.7 and Ruby 1.9.2.
You can install the gem by running:
gem install filter_io
io = FilterIO.new io do |data|
data.tr "A-Za-z", "N-ZA-Mn-za-m"
end
A common usage of filter_io
is to normalise line endings before parsing CSV data:
require 'csv'
# open source stream
File.open(filename, external_encoding: 'UTF-8') do |io|
# apply filter to stream
io = FilterIO.new(io) do |data, state|
# grab another chunk if the last character is a delimiter
raise FilterIO::NeedMoreData if data =~ /[\r\n]\z/ && !state.eof?
# normalise line endings to LF
data.gsub /\r\n|\r|\n/, "\n"
end
# process resulting stream normally
CSV.parse(io, row_sep: "\n") do |row|
p row
end
end
Call FilterIO.new
with the original IO stream, any options and the filtering block. The returned filter_io
object acts like a normal read-only forward-only IO stream.
An optional second parameter to the block is the state
parameter which contains stream metadata which may be useful when processing the chuck. The methods currently available are:
bof?
: Returns true if this is the first chuck of the stream.eof?
: Returns true if this is the last chunk of the stream.
If the filtering block needs more data to be able to return anything, you can raise a FilterIO::NeedMoreData
exception and filter_io
will read another block and pass the additional data to you. This can be repeated as necessary until enough data is retrieved.
For example usage of NeedMoreData
, see the line ending normalisation example above.
If your block is unable to process the whole chunk of data immediately, it can return both the processed chuck and the remainder to be processed later. This is done by returning a 2 element array: [processed, unprocessed]
. If processed
is empty and there is unprocessed
data, filter_io
will grab another block of data from the source stream and call the block again.
Here's an example which processes whole lines and prepends the line length to the beginning of each line.
io = FilterIO.new io do |data, state|
output = ''
# grab complete lines until we hit EOF
while data =~ /(.*)\n/ || (state.eof? && data =~ /(.+)/)
output << "#{$1.size} #{$1}\n"
data = $'
end
# `output` contains the processed lines, `data` contains any left over partial line
[output, data]
end
When either readline
, gets
or read(nil)
is called, filter_io
will process the input stream in 1,024 byte chucks. You can adjust this by passing a :block_size
option to new
.
Ruby 1.9 has character encoding support can convert between UTF-8, ISO-8859-1, ASCII-8BIT, etc. This is triggered in IO
by using :external_encoding
and :internal_encoding
when opening the stream.
filter_io
will use the underlying stream's encoding settings when reading and filtering data. The processing block will be passed data in the internal encoding.
As per the core IO
object, if read
is called with a length (in bytes), the data will be returned in ASCII-8BIT.
In summary, everything should Just Work™
- Fork the project.
- Make your feature addition or bug fix.
- Add tests for it. This is important so I don't break it in a future version unintentionally.
- Commit, do not mess with Rakefile, VERSION, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
- Send me a pull request. Bonus points for topic branches.
Copyright (c) 2010 Jason Weathered. See LICENSE for details.