A Python module to convert natural language numerics into ints and floats. This is a port of the Ruby gem numerizer
The numerizer library can be installed from PyPI as follows:
$ pip install numerizer
or from source as follows:
$ git clone https://github.com/jaidevd/numerizer.git
$ cd numerizer
$ pip install -e .
>>> from numerizer import numerize
>>> numerize('forty two')
'42'
>>> numerize('forty-two')
'42'
>>> numerize('four hundred and sixty two')
'462'
>>> numerize('one fifty')
'150'
>>> numerize('twelve hundred')
'1200'
>>> numerize('twenty one thousand four hundred and seventy three')
'21473'
>>> numerize('one million two hundred and fifty thousand and seven')
'1250007'
>>> numerize('one billion and one')
'1000000001'
>>> numerize('nine and three quarters')
'9.75'
>>> numerize('platform nine and three quarters')
'platform 9.75'
Since version 0.2, numerizer is available as a SpaCy extension.
Any named entities of a quantitative nature within a SpaCy document can be numerized as follows:
>>> from spacy import load
>>> nlp = load('en_core_web_sm') # or load any other spaCy model
>>> doc = nlp('The projected revenue for the next quarter is over two million dollars.')
>>> doc._.numerize()
{the next quarter: 'the next 1/4', over two million dollars: 'over 2000000 dollars'}
Users can specify which entity types are to be numerized, by using the labels argument in the extension function, as follows:
>>> doc._.numerize(labels=['MONEY']) # only numerize entities of type 'MONEY'
{over two million dollars: 'over 2000000 dollars'}
The extension is available for tokens and spans as well.
>>> two_million = doc[-4:-2] # span corresponding to "two million"
>>> two_million._.numerize()
'2000000'
>>> quarter = doc[6] # token corresponding to "quarter"
>>> quarter._.numerized
'1/4'
For R users, a wrapper library has been developed by @amrrs. Try it out here.