-
Notifications
You must be signed in to change notification settings - Fork 17
Parsing
After reading the source file indicated in the proper configuration parameter, EVT parses the structure of the edition. At the moment, everything is based on pages (this will probably change when we will add the support for critical edition and pageless editions).
A page is identified as the list of XML elements included between a <pb/>
and the next one (or between a <pb/>
and the end of the node containing the main text, which is the <body>
in the case of the last page).
Each page is represented in the EVT Model as a Page
:
interface Page {
id: string;
label: string;
originalContent: OriginalEncodingNodeType[];
parsedContent: Array<ParseResult<GenericElement>>;
}
The content of each page is therefore represented as an array of object retrieved by parsing the original XML nodes.
After parsing the structure, for each page identified, we then proceed to parse all the child nodes, by calling the parse
method of the GenericParserService
.
Parsers are defined in a map that associates a parser with each supported tagName
. This map is retrieved by the generic parsing function which chooses the right parser based on the node type and its tagName
. If a tag does not match a specific parser, the ElementParser
, which does not add any logic to the parsing results, is used. Tags and parsers are divided by belonging TEI module.
type AnalysisTags = 'w';
type CoreTags = 'lb' | 'note' | 'p';
type SupportedTagNames = AnalysisTags | CoreTags;
const analysisParseF: { [T in AnalysisTags]: Parser<XMLElement> } = {
w: createParser(WordParser, parse),
};
const coreParseF: { [T in CoreTags]: Parser<XMLElement> } = {
lb: createParser(LBParser, parse),
note: createParser(NoteParser, parse),
p: createParser(ParagraphParser, parse),
}
const parseF: { [T in SupportedTagNames]: Parser<XMLElement> } = {
...analysisParseF,
...coreParseF,
}
export function parse(xml: XMLElement): ParseResult<GenericElement> {
if (!xml) { return { content: [xml] } as HTML; }
// Text Node
if (xml.nodeType === 3) { return createParser(TextParser, parse).parse(xml); }
// Comment
if (xml.nodeType === 8) { return {} as Comment; }
const tagName = xml.tagName.toLowerCase();
const parser: Parser<XMLElement> = parseF[tagName] || createParser(ElementParser, parse);
return parser.parse(xml);
}
The generic parsing function (parse
) to be used to parse the children of a specific node is passed to the parser as a parameter (NB: it is not retrieved as an import to avoid running into circular dependencies if the individual parsers are defined in different files).
The return type of each parser is defined as follows:
type ParseResult<T extends GenericElement> = T | HTML | GenericElement | Attributes | Description | AttributesMap;
When handling the content of a node, the basic idea is: when I don't know what there is to parse, I use the generic parser, which has the complete map of all the parsers and automatically manages the checks to choose the right parser. Instead, when I know what to parse, I directly call the specific parser to compose the object I need.
Each parser implements the interface
interface Parser<T> { parse(data: T): ParseResult<GenericElement>; }
and it must be created using the createParser
factory:
function createParser<U, T extends Parser<U>>(c: new (raw: ParseFn) => T, data: ParseFn): T { return new c(data); }
In order to set up the possibility of automatically parsing nodes (i.e. using the generic map indicated above), each parser must extend the EmptyParser
and implement the parse
function according to its needs:
class EmptyParser {
genericParse: ParseFn;
constructor(parseFn: ParseFn) { this.genericParse = parseFn; }
}
class ParagraphParser extends EmptyParser implements Parser<XMLElement> {
attributeParser = createParser(AttributeParser, this.genericParse);
parse(xml: XMLElement): Paragraph {
const attributes = this.attributeParser.parse(xml);
const paragraphComponent: Paragraph = {
type: Paragraph,
content: parseChildren(xml, this.genericParse),
attributes,
n: getDefaultN(attributes.n),
};
return paragraphComponent;
}
}
In order to add a new parser you need to follow some important steps:
- Analyze the element you want to parse and define a data model that will represent it.
- This data model should be defined as an interface that extends the
GenericElement
. - Try to more as more specific as possible and define elements with their own type and interface, but also stay focused on the ultimate goal of your contribution: if an element it is not strictly within the scope of your goal, you can define them as
Array<ParseResult<GenericElement>>
orParseResult<GenericElement>
and add aTODO
comment that will remind future contributors to add specific type when that particular element will be handled.
- This data model should be defined as an interface that extends the
- Once you have the interface, you can implement the parser for that specific element.
- As indicated above, it should be defined as an extension of the
EmptyParser
and an implementation ofParser<XMLElement>
and should return an element of the type you defined before. - Since the interface you defined before is an extension of the
GenericElement
the element returned by the parsed should also have- a property
type
with the interface itself as value, - a property
attributes
containing a map of all the attributes of the node (check existing functions to see how to parse them without rewriting everything), - a property
content
containing all children properly parsed (check existing functions to see how to retrive this list without rewriting everything), - possibly a property
class
containing thetagName
of the node (check existing functions to see how to retrieve it without rewriting everything), - possibly a property
path
containing the xpath of the node (check existing functions to see how to retrieve it without rewriting everything).
- a property
- As indicated above, it should be defined as an extension of the
- Lastly you have to add your parser to the parsers map:
- check which TEI module the element belongs to and add the tag name to the specific list and the parser in the specific map in
xml-parsers/index.ts
- if there is no map for the TEI module your element belongs to, please implement it both as a list for tags and a map for parsers, and add it to the main parsers map, similarly to what has been done for other modules.
- check which TEI module the element belongs to and add the tag name to the specific list and the parser in the specific map in
This is a dynamic component that takes a ParsedElement
as input and establishes which component to use for displaying this data based on the type indicated in the type
property.
This type is used to manage the component register, to be accessed for dynamic compilation, and also the type of data that the component in question receives as input:
/* component-register.service.ts */
const COMPONENT_MAP: Map<Type<any>> = {};
export function register(dataType: Type<any>) {
return (cls: Type<any>) => {
COMPONENT_MAP[dataType.name] = cls;
};
}
@Injectable({
providedIn: 'root',
})
export class ComponentRegisterService {
getComponent(dataType: Type<any>) {
return COMPONENT_MAP[dataType.name];
}
}
/* paragraph.component.ts */
@Component({
selector: 'evt-paragraph',
templateUrl: './paragraph.component.html',
styleUrls: ['./paragraph.component.scss'],
})
@register(Paragraph)
export class ParagraphComponent {
@Input() data: Paragraph;
}
User feedback is very much appreciated: please send all comments, suggestions, bug reports, etc. to [email protected]. See other details of our project in our web site http://evt.labcd.unipi.it/.