-
Notifications
You must be signed in to change notification settings - Fork 45
/
Copy pathdoc.go
32 lines (22 loc) · 2.79 KB
/
doc.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/*
Package nlp provides implementations of selected machine learning algorithms for natural language processing of text corpora. The primary focus is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.
The package makes use of the Gonum (http://http//www.gonum.org/) library for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn (http://scikit-learn.org/stable/) and Gensim(https://radimrehurek.com/gensim/)
Overview
The primary intended use case is to support document input as text strings encoded as a matrix of numerical feature vectors called a `term document matrix`. Each column in the matrix corresponds to a document in the corpus and each row corresponds to a unique term occurring in the corpus. The individual elements within the matrix contain the frequency with which each term occurs within each document (referred to as `term frequency`). Whilst textual data from document corpora are the primary intended use case, the algorithms can be used with other types of data from other sources once encoded (vectorised) into a suitable matrix e.g. image data, sound data, users/products, etc.
These matrices can be processed and manipulated through the application of additional transformations for weighting features, identifying relationships or optimising the data for analysis, information retrieval and/or predictions.
Typically the algorithms in this package implement one of three primary interfaces:
Vectoriser - Taking document input as strings and outputting matrices of numerical features e.g. term frequency.
Transformer - Takes matrices of numerical features and applies some logic/transformation to output a new matrix.
Comparer - Functions taking two vectors (columns from a matrix) and outputting a distance/similarity measure.
One of the implementations of Vectoriser is Pipeline which can be used to wire together pipelines composed of a Vectoriser and one or more Transformers arranged in serial so that the output from each stage forms the input of the next. This can be used to construct a classic LSI (Latent Semantic Indexing) pipeline (vectoriser -> TF.IDF weighting -> Truncated SVD):
pipeline := nlp.NewPipeline(
nlp.NewCountVectoriser(true),
nlp.NewTFIDFTransformer(),
nlp.NewTruncatedSVD(100),
)
Whilst they take different inputs, both Vectorisers and Transformers have 3 primary methods:
Fit() - Trains the model based upon the supplied, input training data.
Transform() - Transforms the input into the output matrix (requires the model to be already fitted by a previous call to Fit() or FitTransform()).
FitTransform() - Convenience method combining Fit() and Transform() methods to transform input data, fitting the model to the input data in the process.
*/
package nlp