Skip to content

Parsing

Chiara Di Pietro edited this page Oct 6, 2020 · 3 revisions

After reading the source file indicated in the proper configuration parameter, EVT parses the structure of the edition. At the moment, everything is based on pages (this will probably change when we will add the support for critical edition and pageless editions). A page is identified as the list of XML elements included between a <pb/> and the next one (or between a <pb/> and the end of the node containing the main text, which is the <body> in the case of the last page).

Each page is represented in the EVT Model as a Page:

interface Page {
    id: string;
    label: string;
    originalContent: OriginalEncodingNodeType[];
    parsedContent: Array<ParseResult<GenericElement>>;
}

The content of each page is therefore represented as an array of object retrieved by parsing the original XML nodes.

After parsing the structure, for each page identified, we then proceed to parse all the child nodes, by calling the parse method of the GenericParserService.

Parsers are defined in a map that associates a parser with each supported tagName. This map is retrieved by the generic parsing function which chooses the right parser based on the node type and its tagName. If a tag does not match a specific parser, the ElementParser, which does not add any logic to the parsing results, is used.

type SupportedTagNames = 'lb' | 'note' | 'p';

const parseF: { [T in SupportedTagNames]: Parser<XMLElement> } = {
    lb: createParser(LBParser, parse),
    note: createParser(NoteParser, parse),
    p: createParser(ParagraphParser, parse),
}

export function parse(xml: XMLElement): ParseResult<GenericElement> {
    if (!xml) { return { content: [xml] } as HTML; }
    // Text Node
    if (xml.nodeType === 3) { return createParser(TextParser, parse).parse(xml); }
    // Comment
    if (xml.nodeType === 8) { return {} as Comment; }
    const tagName = xml.tagName.toLowerCase();
    const parser: Parser<XMLElement> = parseF[tagName] || createParser(ElementParser, parse);

    return parser.parse(xml);
}

The generic parsing function (parse) to be used to parse the children of a specific node is passed to the parser as a parameter (NB: it is not retrieved as an import to avoid running into circular dependencies if the individual parsers are defined in different files).

The return type of each parser is defined as follows:

type ParseResult<T extends GenericElement> = T | HTML | GenericElement | Attributes | Description | AttributesMap;

When handling the content of a node, the basic idea is: when I don't know what there is to parse, I use the generic parser, which has the complete map of all the parsers and automatically manages the checks to choose the right parser. Instead, when I know what to parse, I directly call the specific parser to compose the object I need.

Each parser implements the interface

interface Parser<T> { parse(data: T): ParseResult<GenericElement>; }

and it must be created using the createParser factory:

function createParser<U, T extends Parser<U>>(c: new (raw: ParseFn) => T, data: ParseFn): T { return new c(data); }

In order to set up the possibility of automatically parsing nodes (i.e. using the generic map indicated above), each parser must extend the EmptyParser and implement the parse function according to its needs:

class EmptyParser {
    genericParse: ParseFn;
    constructor(parseFn: ParseFn) { this.genericParse = parseFn; }
}

class ParagraphParser extends EmptyParser implements Parser<XMLElement> {
    attributeParser = createParser(AttributeParser, this.genericParse);
    parse(xml: XMLElement): Paragraph {
        const attributes = this.attributeParser.parse(xml);
        const paragraphComponent: Paragraph = {
            type: Paragraph,
            content: parseChildren(xml, this.genericParse),
            attributes,
            n: getDefaultN(attributes.n),
        };

        return paragraphComponent;
    }
}