Skip to content

Parsing

Chiara Di Pietro edited this page Jan 30, 2021 · 3 revisions

After reading the source file indicated in the proper configuration parameter, EVT parses the structure of the edition. At the moment, everything is based on pages (this will probably change when we will add the support for critical edition and pageless editions). A page is identified as the list of XML elements included between a <pb/> and the next one (or between a <pb/> and the end of the node containing the main text, which is the <body> in the case of the last page).

Each page is represented in the EVT Model as a Page:

interface Page {
    id: string;
    label: string;
    originalContent: OriginalEncodingNodeType[];
    parsedContent: Array<ParseResult<GenericElement>>;
}

The content of each page is therefore represented as an array of object retrieved by parsing the original XML nodes.

After parsing the structure, for each page identified, we then proceed to parse all the child nodes, by calling the parse method of the GenericParserService.

Parsers are defined in a map that associates a parser with each supported tagName. This map is retrieved by the generic parsing function which chooses the right parser based on the node type and its tagName. If a tag does not match a specific parser, the ElementParser, which does not add any logic to the parsing results, is used. Tags and parsers are divided by belonging TEI module.

type AnalysisTags = 'w';
type CoreTags = 'lb' | 'note' | 'p';
type SupportedTagNames = AnalysisTags | CoreTags;

const analysisParseF: { [T in AnalysisTags]: Parser<XMLElement> } = {
    w: createParser(WordParser, parse),
};

const coreParseF: { [T in CoreTags]: Parser<XMLElement> } = {
    lb: createParser(LBParser, parse),
    note: createParser(NoteParser, parse),
    p: createParser(ParagraphParser, parse),
}
const parseF: { [T in SupportedTagNames]: Parser<XMLElement> } = {
   ...analysisParseF,
   ...coreParseF,
}

export function parse(xml: XMLElement): ParseResult<GenericElement> {
    if (!xml) { return { content: [xml] } as HTML; }
    // Text Node
    if (xml.nodeType === 3) { return createParser(TextParser, parse).parse(xml); }
    // Comment
    if (xml.nodeType === 8) { return {} as Comment; }
    const tagName = xml.tagName.toLowerCase();
    const parser: Parser<XMLElement> = parseF[tagName] || createParser(ElementParser, parse);

    return parser.parse(xml);
}

The generic parsing function (parse) to be used to parse the children of a specific node is passed to the parser as a parameter (NB: it is not retrieved as an import to avoid running into circular dependencies if the individual parsers are defined in different files).

The return type of each parser is defined as follows:

type ParseResult<T extends GenericElement> = T | HTML | GenericElement | Attributes | Description | AttributesMap;

When handling the content of a node, the basic idea is: when I don't know what there is to parse, I use the generic parser, which has the complete map of all the parsers and automatically manages the checks to choose the right parser. Instead, when I know what to parse, I directly call the specific parser to compose the object I need.

Each parser implements the interface

interface Parser<T> { parse(data: T): ParseResult<GenericElement>; }

and it must be created using the createParser factory:

function createParser<U, T extends Parser<U>>(c: new (raw: ParseFn) => T, data: ParseFn): T { return new c(data); }

In order to set up the possibility of automatically parsing nodes (i.e. using the generic map indicated above), each parser must extend the EmptyParser and implement the parse function according to its needs:

class EmptyParser {
    genericParse: ParseFn;
    constructor(parseFn: ParseFn) { this.genericParse = parseFn; }
}

class ParagraphParser extends EmptyParser implements Parser<XMLElement> {
    attributeParser = createParser(AttributeParser, this.genericParse);
    parse(xml: XMLElement): Paragraph {
        const attributes = this.attributeParser.parse(xml);
        const paragraphComponent: Paragraph = {
            type: Paragraph,
            content: parseChildren(xml, this.genericParse),
            attributes,
            n: getDefaultN(attributes.n),
        };

        return paragraphComponent;
    }
}

Add a new parser

In order to add a new parser you need to follow some important steps:

  • Analyze the element you want to parse and define a data model that will represent it.
    • This data model should be defined as an interface that extends the GenericElement.
    • Try to more as more specific as possible and define elements with their own type and interface, but also stay focused on the ultimate goal of your contribution: if an element it is not strictly within the scope of your goal, you can define them as Array<ParseResult<GenericElement>> or ParseResult<GenericElement> and add a TODO comment that will remind future contributors to add specific type when that particular element will be handled.
  • Once you have the interface, you can implement the parser for that specific element.
    • As indicated above, it should be defined as an extension of the EmptyParser and an implementation of Parser<XMLElement> and should return an element of the type you defined before.
    • Since the interface you defined before is an extension of the GenericElement the element returned by the parsed should also have
      • a property type with the interface itself as value,
      • a property attributes containing a map of all the attributes of the node (check existing functions to see how to parse them without rewriting everything),
      • a property content containing all children properly parsed (check existing functions to see how to retrive this list without rewriting everything),
      • possibly a property class containing the tagName of the node (check existing functions to see how to retrieve it without rewriting everything),
      • possibly a property path containing the xpath of the node (check existing functions to see how to retrieve it without rewriting everything).
  • Lastly you have to add your parser to the parsers map:
    • check which TEI module the element belongs to and add the tag name to the specific list and the parser in the specific map in xml-parsers/index.ts
    • if there is no map for the TEI module your element belongs to, please implement it both as a list for tags and a map for parsers, and add it to the main parsers map, similarly to what has been done for other modules.

ContentViewerComponent

This is a dynamic component that takes a ParsedElement as input and establishes which component to use for displaying this data based on the type indicated in the type property.

This type is used to manage the component register, to be accessed for dynamic compilation, and also the type of data that the component in question receives as input:

/* component-register.service.ts */
const COMPONENT_MAP: Map<Type<any>> = {};

export function register(dataType: Type<any>) {
  return (cls: Type<any>) => {
      COMPONENT_MAP[dataType.name] = cls;
  };
}

@Injectable({
  providedIn: 'root',
})
export class ComponentRegisterService {

  getComponent(dataType: Type<any>) {
    return COMPONENT_MAP[dataType.name];
  }
}

/* paragraph.component.ts */

@Component({
  selector: 'evt-paragraph',
  templateUrl: './paragraph.component.html',
  styleUrls: ['./paragraph.component.scss'],
})

@register(Paragraph)
export class ParagraphComponent {
  @Input() data: Paragraph;
}