Namu Wiki Extractor

This library strips all namu marks from a namu wiki document and extracts its plain text only.

Requirement

Python 3

Installation

pip install namu-wiki-extractor

Usage

Basic

import json
from namuwiki.extractor import extract_text

with open('namu_wiki.json', 'r', encoding='utf-8') as input_file:
    namu_wiki = json.load(input_file)

item = namu_wiki[1]
plain_text = extract_text(item['text'])
print(plain_text)

Extract deletions and footnotes separately

import json
from namuwiki.extractor import extract_text

with open('namu_wiki.json', 'r', encoding='utf-8') as input_file:
    namu_wiki = json.load(input_file)

item = namu_wiki[1]
document = extract_text(item['text'], separate_deletions=True, separate_footnotes=True)
print(document.text)
print(document.deletions)
print(document.footnotes)

Multiprocessing

import json
from multiprocessing import Pool

from namuwiki.extractor import extract_text

def work(document):
    return {
        'title': document['title'],
        'content': extract_text(document['text'])
    }

with open('namu_wiki.json', 'r', encoding='utf-8') as input_file:
    namu_wiki = json.load(input_file)

with Pool() as pool:
    items = pool.map(work, namu_wiki)

API

namuwiki.extractor.extract_text(source: str, separate_deletions: bool = False, separate_footnotes: bool = False) -> Union[str, Document]

This function strips all namu marks from source and extracts its plain text. If either separate_deletions or separate_footnotes is True, this returns extracted plain text as str. Otherwise, this returns extracted plain text, deletions and footnotes as Document

Parameter

source: Text from a namu wiki document
separate_deletions: Whether deletions should be separately extracted from the source
separate_footnotes: Whether footnotes should be separately extracted from the source

namuwiki.extractor.Document(text: str, deletions: List[str], footnotes: List[str])

text: Plain text with all namu marks removed from the given source
deletions: Separately extracted deletions from the given source
footnotes: Separately extracted footnotes from the given source

Note

A JSON dump file of namu wiki can be downloaded from here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Namu Wiki Extractor

Requirement

Installation

Usage

Basic

Extract deletions and footnotes separately

Multiprocessing

API

namuwiki.extractor.extract_text(source: str, separate_deletions: bool = False, separate_footnotes: bool = False) -> Union[str, Document]

Parameter

namuwiki.extractor.Document(text: str, deletions: List[str], footnotes: List[str])

Note

Files

README.md

Latest commit

History

README.md

File metadata and controls

Namu Wiki Extractor

Requirement

Installation

Usage

Basic

Extract deletions and footnotes separately

Multiprocessing

API

namuwiki.extractor.extract_text(source: str, separate_deletions: bool = False, separate_footnotes: bool = False) -> Union[str, Document]

Parameter

namuwiki.extractor.Document(text: str, deletions: List[str], footnotes: List[str])

Note