Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform the parser to CPS. Support pull parsing #51

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ppolv
Copy link

@ppolv ppolv commented Jan 4, 2016

I wonder if you will be willing to merge something in the lines of this patch.
(In such case this will need more tests and performance comparison)

The basic idea is:

  • Backward compatible, actual api is unchanged
  • Provide new, small api for allowing pull-parsing (erlsom:parse_pull/2).
  • Both use the same underling parsing code

For pull parsing, I mean:

3> {continue, C1, E1} = erlsom:parse_pull(<<"<s:xstream xmlns:s='http://some-namespace.com' as='234' version='1.0'>">>, [{output_encoding, utf8}]).
{continue,#Fun<erlsom_sax_lib.4.36439255>,
          [startDocument,
           {startPrefixMapping,"s","http://some-namespace.com"},
           {startElement,"http://some-namespace.com","xstream","s",
                         [{attribute,"version",[],[],<<"1.0">>},
                          {attribute,"as",[],[],<<"234">>}]}]}
4> {continue, C2, E2} = C1(<<"<a href='ab">>).
{continue,#Fun<erlsom_sax_lib.0.36439255>,[]}
5> {continue, C3, E3} = C2(<<"c'><b>ss</b></a>">>).
{continue,#Fun<erlsom_sax_lib.4.36439255>,
          [{startElement,[],"a",[],
                         [{attribute,"href",[],[],<<"abc">>}]},
           {startElement,[],"b",[],[]},
           {characters,<<"ss">>},
           {endElement,[],"b",[]},
           {endElement,[],"a",[]}]}
6> {ok, E4, _Rest} = C3(<<"</s:xstream> tail data">>).   
{ok,[{endElement,"http://some-namespace.com","xstream","s"},
     {endPrefixMapping,"s"},
     endDocument],
    " tail data"}

That is, allow the user of the parser to pass the data in chunks to the parser, instead of the parser asking for more in a callback.

This allow simpler integration for long running parsing (think in infinite xml streams coming from network), and also allows hibernating the parsing process, that on current version is not always possible (as hibernate discard the stack).

Returning the list of events instead of using sax callbacks is mostly a matter of preference, sax callbacks can still be used. I found returning a list easier to work with (and size of the list is bounded
if you feed the parser in chunks).

The change was to make every possible "blocking" parsing operation be in continuation passing style. Then the continueFun/* in erlsom_sax_lib check if we are in a pull parsing or not. If we are a pull parser, instead of asking the user for more data, it just return a function from where caller can continue parsing:

continueFun2K(T, V1, V2, V3, State = #erlsom_sax_state{continuation_fun=undefined}, ParseFun, K) ->
    {continue, fun(Data) -> ParseFun(<<T/binary, Data/binary>>, V1, V2,
    V3, State#erlsom_sax_state{user_state = []}, K) end,
         lists:reverse(State#erlsom_sax_state.user_state)};

(when pull parsing, the sax callbacks just accumulate the events in the user_state, so here we return them on the right order)

For transforming to CPS, the code is now allocating more funs while parsing. Initial test show little performance degradation, and any work done by the user with the parser events would make the difference in parsing costs irrelevant, but need more complete tests to assert that.

I only tested the utf8 backend, others should work the same.

What's your opinion?

The goal was to be able to do:

> {continue, Continue, Events1} = erlsom:parse_pull(InitialData, ParserOpts),
%%  Events1 being a list of the events produced by erlsom so far.

>{continue, Continue2, Events2} = Continue(MoreData),
...

{ok, EventsN, RemainingData} = ContinueN(FinalData).

That is, invert the control with the caller passing the data in chunks
to the parser, instead of the parser asking for more in a callback.

This allow simpler integration for long running parsing (think in
infinite xml streams comming from network), and also allows hibernating
the parsing process,  that on current version is not always possible (as
hibernate discard the stack).

Returning the list of events instead of using sax callbacks is mostly a
matter of preference,  sax callbacks can still be used. A found
returning a list cleaner to work with (and size of the list is bounded
if you feed the parser in chunks)

The change was to make every possible "blocking" parsing operation
become CPS. Then the continueFun/* in  erlsom_sax_lib check if we
are using continuation callback or not.
If not using continuation callback, it just return a function from where
caller can continue parsing:

continueFun2K(T, V1, V2, V3, State = #erlsom_sax_state{continuation_fun=undefined}, ParseFun, K) ->
    {continue, fun(Data) -> ParseFun(<<T/binary, Data/binary>>, V1, V2,
    V3, State#erlsom_sax_state{user_state = []}, K) end,
         lists:reverse(State#erlsom_sax_state.user_state)};

Initial test show little performance degradation, and any work done by
the user with the parser events would make the difference in parsing
costs irrelevant, but need more complete tests to assert that.
@willemdj
Copy link
Owner

willemdj commented Jan 5, 2016

Hello Pablo,

Very interesting.

At first sight I would say that it might be even nicer if you could really
'pull' the next event, in stead of getting a list of all events. Do you
think that would be possible?

I don't understand your point about hibernation, can you give an example?

Regards,
Willem

On Mon, Jan 4, 2016 at 5:33 PM, Pablo Polvorin [email protected]
wrote:

I wonder if you will be willing to merge something in the lines of this
patch.
(In such case this will need more tests and performance comparison)

The basic idea is:

  • Backward compatible, actual api is unchanged
  • Provide new, small api for allowing pull-parsing
    (erlsom:parse_pull/2).
  • Both use the same underling parsing code

For pull parsing, I mean:

3> {continue, C1, E1} = erlsom:parse_pull(<<"<s:xstream xmlns:s='http://some-namespace.com' as='234' version='1.0'>">>, [{output_encoding, utf8}]).
{continue,#Fun<erlsom_sax_lib.4.36439255>,
[startDocument,
{startPrefixMapping,"s","http://some-namespace.com"},
{startElement,"http://some-namespace.com","xstream","s",
[{attribute,"version",[],[],<<"1.0">>},
{attribute,"as",[],[],<<"234">>}]}]}
4> {continue, C2, E2} = C1(<<"ss">>).
{continue,#Fun<erlsom_sax_lib.4.36439255>,
[{startElement,[],"a",[],
[{attribute,"href",[],[],<<"abc">>}]},
{startElement,[],"b",[],[]},
{characters,<<"ss">>},
{endElement,[],"b",[]},
{endElement,[],"a",[]}]}
6> {ok, E4, _Rest} = C3(<<"/s:xstream tail data">>).
{ok,[{endElement,"http://some-namespace.com","xstream","s"},
{endPrefixMapping,"s"},
endDocument],
" tail data"}

That is, allow the user of the parser to pass the data in chunks to the
parser, instead of the parser asking for more in a callback.

This allow simpler integration for long running parsing (think in infinite
xml streams coming from network), and also allows hibernating the parsing
process, that on current version is not always possible (as hibernate
discard the stack).

Returning the list of events instead of using sax callbacks is mostly a
matter of preference, sax callbacks can still be used. I found returning a
list easier to work with (and size of the list is bounded
if you feed the parser in chunks).

The change was to make every possible "blocking" parsing operation be in
continuation passing style. Then the continueFun/* in erlsom_sax_lib check
if we are in a pull parsing or not. If we are a pull parser, instead of
asking the user for more data, it just return a function from where caller
can continue parsing:

continueFun2K(T, V1, V2, V3, State = #erlsom_sax_state{continuation_fun=undefined}, ParseFun, K) ->
{continue, fun(Data) -> ParseFun(<<T/binary, Data/binary>>, V1, V2,
V3, State#erlsom_sax_state{user_state = []}, K) end,
lists:reverse(State#erlsom_sax_state.user_state)};

(when pull parsing, the sax callbacks just accumulate the events in the
user_state, so here we return them on the right order)

For transforming to CPS, the code is now allocating more funs while
parsing. Initial test show little performance degradation, and any work
done by the user with the parser events would make the difference in
parsing costs irrelevant, but need more complete tests to assert that.

I only tested the utf8 backend, others should work the same.

What's your opinion?

You can view, comment on, or merge this pull request online at:

#51
Commit Summary

  • Transform the parser to CPS. Support pull parsing

File Changes

Patch Links:


Reply to this email directly or view it on GitHub
#51.

@ppolv
Copy link
Author

ppolv commented Jan 5, 2016

Hello, thanks for your time!.
IMHO, for pulling events one-by-one, more intrusive changes are needed, as would imply that wrapCallback/2 could return the control to the caller (instead of only the continueFun* in erlsom_sax_lib).
I, lazy as I'm, preferred to touch the less possible and reuse the same code for both parsing styles.
The API for that could be easily provided based on this one, but of course that will still require all events of a "chunk" to be build in memory. So if you want to parse a big file using that "pull" interface, feeding the parser in chunks rather than the full file content would be a must.

Regarding hibernation, see the code at the end.
The scenario in mind is where the parsing is over a long period of time, with data read from the network. Basically an endless stream (think in XMPP for example). If the caller is able to feed the parser itself, it has more control on when/how to do that. It is also much easier to embed in an gen_server/gen_fsm (erlsom:parse_sax with continuation_fun is not, as it is a blocking call that won't return).

-module(phibernation).
-compile(export_all).

send(Pid, []) ->
    Pid ! done;
send(Pid, [D|Rest]) ->
    Pid ! {data, D},
    receive after 100 -> ok end, %%just to simulate a little delay.  On cases this is minutes between each chunk comming from client side
    io:format("\nParser process memory usage: ~p\n", [erlang:process_info(Pid, memory)]),
    send(Pid, Rest).

test() ->
    io:format("\n*** Normal run\n"),
    test(fun() -> loop(false) end),
    io:format("\n*** Hibernate run\n"),
    test(fun() -> loop(true) end),
    io:format("\n*** Sax with continuation_fun\n"),
    test(fun() -> loop_with_callback(false) end),
    io:format("\n*** Sax with continuation_fun, hibernate (fail)\n"),
    test(fun() -> loop_with_callback(true) end),
    ok.

test(LoopFun) ->
    Data = [<<"<s:xstream xmlns:s='http://some-namespace.com' as='234' version='1.0'>">>, 
            <<"<a href='ab">>,
            <<"c'><b>ss</b></a>">>,
            <<"<a href='ab">>,
            <<"c'><b>ss</b></a>">>],
    Pid = spawn(LoopFun),
    send(Pid, Data).

loop(ShouldHibernate) ->
    loop(undefined, ShouldHibernate).
loop(Parser, ShouldHibernate) ->
    receive
        {data, Data} ->
            {continue, Parser2, Events} = case Parser of
                                               undefined -> erlsom:parse_pull(Data, [{output_encoding, utf8}]);
                                               _ -> Parser(Data)
                                          end,
            io:format("Parsed ~p  Events: ~p", [Data, Events]),
            case ShouldHibernate of
                %% Hibernate the process to reduce memory usage, since we expect that further data won't come in for a while
                true ->  erlang:hibernate(?MODULE, loop, [Parser2, ShouldHibernate]);
                false -> loop(Parser2, ShouldHibernate)
            end;
        done ->
            ok
    end.

loop_with_callback(ShouldHibernate) ->
    SaxCallbackState = [],
    F = fun(Ev, _) -> io:format("~p" ,[Ev]) end,
    Continuation = case ShouldHibernate of
                       true -> fun continuation_hibernated/2;
                       false -> fun continuation/2
                   end,
    erlsom:parse_sax(<<>>, SaxCallbackState, F, [{continuation_function, Continuation, []}, {output_encoding, utf8}]).

continuation(Tail, State) ->
    receive
        {data, Data} ->
            {<<Tail/binary, Data/binary>>, State}
    end.

continuation_hibernated(Tail, State) ->
    %% This won't work.  erlang:hibernate discard the call stack, (no erlsom:parse_sax/4 is left) so when this process wakeups
    %% continuation/2 will be called and the process ends.
    erlang:hibernate(?MODULE, continuation, [Tail, State]).

Results (events prints removed for clarity) :

*** Normal run
Parser process memory usage: {memory,8688}
Parser process memory usage: {memory,8688}
Parser process memory usage: {memory,13576}
Parser process memory usage: {memory,13576}
Parser process memory usage: {memory,16592}

*** Hibernate run
Parser process memory usage: {memory,2520}
Parser process memory usage: {memory,2288}
Parser process memory usage: {memory,2560}
Parser process memory usage: {memory,2288}
Parser process memory usage: {memory,2560}

*** Sax with continuation_fun
Parser process memory usage: {memory,10560}
Parser process memory usage: {memory,10560}
Parser process memory usage: {memory,13576}
Parser process memory usage: {memory,13576}
Parser process memory usage: {memory,8688}

*** Sax with continuation_fun, hibernate (fail)
Parser process memory usage: undefined
Parser process memory usage: undefined
Parser process memory usage: undefined
Parser process memory usage: undefined
Parser process memory usage: undefined

One instead of hibernate could force GC, but isn't the same

@willemdj
Copy link
Owner

Hi Pablo,

Just a short note to tell you that I did not forget about your idea.

I agree that support for hibernation would be nice, and the idea of a continuation is also nice. But I still don't like the idea of getting a list of events, for me that goes against the idea of erlsom, which (in my mind) is to look at sax events one by one.

So I would like to think about this a bit more when I have some more time.

(To get a bit more feeling for the idea I wrote the small module below: this is a simple way to use the existing parser as the basis for a pull parser)

-module(pull_parser).

-export([start/1, pull/1]).

%% export for use by this module only
-export([parse/1]).


%% starts a new process that parses up to the next event and then
%% waits until it receives a pull-request
start(Xml) ->
  spawn(?MODULE, parse, [Xml]).

pull(Parser) ->
  Parser ! {pull, self()},
  receive
    Event ->
      Event
  end.

%% internal functions

parse(Xml) ->
  erlsom:parse_sax(Xml, [], fun callback/2).

callback(endDocument, State) ->
  State;
callback(Event, State) ->
  receive
    {pull, Client} ->
       Client ! Event,
       State
  end.

Regards,
Willem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants