This package cannot be used alone. ezs has to be installed
import ezs from 'ezs';
import teeftfr from 'ezs-teeftfr';
ezs.use(teeftfr);
process.stdin
.pipe(ezs('STATEMENT_NAME', { STATEMENT_PARAMETERS })
.pipe(process.stdout);
The sequence of statements is determined by the structure of the expected input.
[ "/path/to/a/directory/of/documents" ] ->
[ListFiles]
pattern = *.txt
--> [ "/path1", "path2", ... ] -->
[GetFilesContent]
--> [ { path, content }, ... ] -->
[TEEFTSentenceTokenize]
--> [ { path, sentences: [ "sentence", ... ] }, ... ] -->
[TEEFTTokenize]
--> [ { path, sentences: [ ["token", ... ], ...] }, ... ] -->
[TEEFTNaturalTag]
--> [ { path, sentences: [ [
{
token: "token",
tag: [ "tag", ...]
}, ...
], ... ] }, ... ]
[TEEFTExtractTerms]
nounTag = NOM
adjTag = ADJ
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length
},
{
term: "multiterm",
frequency,
length
}, ...
] }, ... ]
[TEEFTFilterTags]
tags = NOM
tags = ADJ
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length
},
{
term: "multiterm",
frequency,
length
}, ...
] }, ... ]
[TEEFTStopWords]
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length
},
{
term: "multiterm",
frequency,
length
}, ...
] }, ... ]
[TEEFTSumUpFrequencies]
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length
},
{
term: "multiterm",
frequency,
length
}, ...
] }, ... ]
[TEEFTSpecificity]
sort = true
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length,
specificity,
},
{
term: "multiterm",
frequency,
length,
specificity
}, ...
] }, ... ]
[TEEFTFilterMonoFreq]
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length,
specificity,
},
{
term: "multiterm",
frequency,
length,
specificity
}, ...
] }, ... ]
[TEEFTFilterMultiSpec]
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length,
specificity,
},
{
term: "multiterm",
frequency,
length,
specificity
}, ...
] }, ... ]
[JSONString]
wrap = true
indent = true
To use the example examples/teeftfr.ezs
, you have to
- install
ezs
- install
ezs-teeftr
- install
ezs-basics
- run the script
That is to say:
npm i ezs
npm i ezs-teeftfr
npm i ezs-basics
echo examples/data/fr-articles | npx ezs ./examples/teeftfr.ezs
You can even use jq to beautify the JSON in the output.
- TEEFTExtractTerms
- TEEFTFilterMonoFreq
- TEEFTFilterMultiSpec
- TEEFTFilterTags
- GetFilesContent
- ListFiles
- natural-tag
- profile
- TEEFTSentenceTokenize
- TEEFTSpecificity
- TEEFTStopWords
- TEEFTSumUpFrequencies
- ToLowerCase
- TEEFTTokenize
Take an array of objects { path, sentences: [token, tag: ["tag"]]} Regroup multi-terms when possible (noun + noun, adjective + noun, etc.), and computes statistics (frequency, etc.).
data
Stream array of documents containing sentences of tagged tokensfeed
Array<Objects> same as data, withterm
replacingtoken
,length
, andfrequency
nounTag
string noun tag (optional, default'NOM'
)adjTag
string adjective tag (optional, default'ADJ'
)
[{
path: '/path/1',
sentences:
[[
{ token: 'elle', tag: ['PRO:per'] },
{ token: 'semble', tag: ['VER'] },
{ token: 'se', tag: ['PRO:per'] },
{ token: 'nourrir', tag: ['VER'] },
{
token: 'essentiellement',
tag: ['ADV'],
},
{ token: 'de', tag: ['PRE', 'ART:def'] },
{ token: 'plancton', tag: ['NOM'] },
{ token: 'frais', tag: ['ADJ'] },
{ token: 'et', tag: ['CON'] },
{ token: 'de', tag: ['PRE', 'ART:def'] },
{ token: 'hotdog', tag: ['UNK'] }
]]
}]
Filter the data
, keeping only multiterms and frequent monoterms.
data
Streamfeed
Array<Object>multiLimit
Number threshold for being a multiterm (in tokens number) (optional, default2
)minFrequency
Number minimal frequency to be taken as a frequent term (optional, default7
)
Filter multiterms to keep only multiterms which specificity is higher than multiterms' average specificity.
data
anyfeed
any
Filter the text in input, by keeping only adjectives and names
Take an array of file paths as input, and returns a list of
objects containing the path
, and the content
of each file.
Take an array of directory paths as input, a pattern, and returns a list of file paths matching the pattern in the directories from the input.
pattern
String pattern for files (ex: "*.txt") (optional, default"*"
)
POS Tagger from natural
French pos tagging using natural (and LEFFF resources)
Take an array of documents (objects: { path, sentences: [[]] })
Yield an array of documents (objects: { path, sentences: [ [{ token: "token", tag: [ "tag", ... ] }, ...] ] })
[{
path: "/path/1",
sentences: [{ "token": "dans", "tag": ["prep"] },
{ "token": "le", "tag": ["det"] },
{ "token": "cadre", "tag": ["nc"] },
{ "token": "du", "tag": ["det"] },
{ "token": "programme", "tag": ["nc"] }
},
]
}]
Profile the time a statement takes to execute.
You have to place one to initialize, and a second to display the time it takes.
data
anyfeed
any
Segment the data into an array of documents (objects { path, content }).
Yield an array of documents (objects { path, sentences: []})
Take documents (with a path
, an array of terms
, each term being an object
{ term, frequency, length[, tag] })
Process objects containing frequency, add a specificity to each object, and
remove all object with a specificity below average specificity (except when
filter
is false
).
Can also sort the objects according to their specificity, when sort
is
true
.
data
anyfeed
anyweightedDictionary
string name of the weigthed dictionary (optional, default"Ress_Frantext"
)filter
Boolean filter below average specificity (optional, defaulttrue
)sort
Boolean sort objects according to their specificity (optional, defaultfalse
)
Filter the text in input, by removing stopwords in token
data
Streamfeed
Array<Object>stopwords
string name of the stopwords file to use (optional, default'StopwFrench'
)
Sums up the frequencies of identical lemmas from different chunks.
Transform strings to lower case.
Extract tokens from an array of documents (objects { path, sentences: [] }).
Yields an array of documents (objects: { path, sentences: [[]] })
Warning: results are surprising on uppercase sentences