Sparv Pipeline v5.0.0
This version contains a great overhaul of the API making the CLI much faster! There are lots of general improvements and of course a load of bug fixes!
Please read the documentation: https://spraakbanken.gu.se/sparv
Added
- Added a quick start guide in the documentation.
- Added importers for more file formats: docx and odt.
- Added support for language varieties.
- Re-introduced analyses for Old Swedish and Swedish from the 1800's.
- Added a more flexible stats export which lets you choose which annotations to include in the frequency list.
- Added installer for stats export.
- Added Stanza support for English.
- Added better install and uninstall instructions for plugins.
- Added support for XML namespaces.
- Added explicit
ref
annotations (indexing tokens within sentences) for Stanza, Malt and Stanford. - Added a
--reset
flag to thesparv setup
command for resetting the data directory setting. - Added a separate installer for installing scrambled CWB files.
- A warning message is printed when Sparv discovers source files that don't match the file extension in the corpus config.
- An error message is shown if unknown exporters are listed under
export.default
. - Allow source annotations named "not".
- Added a source filename annotator.
- Show an error message if user specifies an invalid installation.
- Added a
--stats
flag to several commands, showing a summary after completion of time spent per annotator. - Added
stanza.max_token_length
option. - Added Hunpos-backoff annotation for Stanza msd and pos.
- Added
--force
flag torun-rule
andcreate-file
commands to force recreation of the listed targets. - Added a new exporter which produces a YAML file with info about the Sparv version and annotation date. This info is also added to the combined XML exports.
- Exit with an error message if a required executable is missing.
- Show a warning if an installed plugin is incompatible with Sparv.
- Introduced compression of annotation files in sparv-workdir. The type of compression can be configured (or disabled) by using the
sparv.compression
variable.gzip
is used by default. - Add flags
--rerun-incomplete
and--mark-complete
to thesparv run
command for handling incomplete output files. - Several exporters now show a warning if a token annotation isn't included in the list of export annotations.
- Added
get_size()
to theAnnotation
andAnnotationAllSourceFiles
classes, to get the size (number of values) for an annotation. - Added support for individual progress bars for annotators.
- Added
SourceAnnotationsAllSourceFiles
class.
Changed
- Significantly improved the CLI startup time.
- Replaced the
--verbose
flag with--simple
and made verbose the default mode. - Everything needed by Sparv modules (including
utils
) is now available through thesparv.api
package. - Empty corpus config files are treated as missing config files.
- Moved CWB corpus installer from
korp
module intocwb
module.
This lead to some name changes of variables used in the corpus config:korp.remote_cwb_datadir
is now calledcwb.remote_data_dir
korp.remote_cwb_registry
is now calledcwb.remote_registry_dir
korp.remote_host
has been split intokorp.remote_host
(host for SQL files) andcwb.remote_host
(host for CWB files)- install target
korp:install_corpus
has been renamed and split intocwb:install_corpus
andcwb:install_corpus_scrambled
- Renamed the following stats exports:
stats_export:freq_list
is now calledstats_export:sbx_freq_list
stats_export:freq_list_simple
is now calledstats_export:sbx_freq_list_simple
stats_export:install_freq_list
is now calledstats_export:install_sbx_freq_list
stats_export:freq_list_fsv
is now calledstats_export:sbx_freq_list_fsv
- Now incrementally compresses bz2 files in compressed XML export to avoid memory problems with large files.
- Corpus source files are now called "source files" instead of "documents". Consequently, the
--doc/-d
flag has been renamed to--file/-f
. import.document_annotation
has been renamed toimport.text_annotation
, and all references to "document" as a text unit have been changed to "text".- Minimum Python version is now 3.6.2.
- Removed Python 2 dependency for hfst-SweNER.
- Tweaked compound analysis to make it less slow and added option to disable using source text as lexicon.
cwb
module now exports to regular export directory instead of CWB's own directories.- Removed ability to use absolute path for exports.
- Renamed the installer
xml_export:install_original
toxml_export:install
. The configuration variablesxml_export.export_original_host
andxml_export.export_original_path
have been changed toxml_export.export_host
andxml_export.export_path
respectively. The configuration variables for the scrambled installer has been changed fromxml_export.export_host
andxml_export.export_path
toxml_export.export_scrambled_host
andxml_export.export_scrambled_path
respectively. - Removed
header_annotations
configuration variable fromexport
(it is still available asxml_export.header_annotations
). - All export files must now be written to subdirectories, and each subdirectory must use the exporter's module name as prefix (or be equal to the module name).
- Empty attributes are no longer included in the csv export.
- When Sparv crashes due to unexpected errors, the traceback is now hidden from the user unless the
--log debug
argument is used. - If the
-j
/--cores
option is used without an argument, all available CPU cores are used. - Importers are now required to write a source structure file.
- CWB installation now also works locally.
Fixed
- Fixed rule ambiguity problems (functions with an order higher than 1 were not accessible).
- Automatically download correct Hunpos model depending on the Hunpos version installed.
- Stanza can now handle tokens containing whitespaces.
- Fixed a bug which lead to computing the source file list multiple times.
- Fixed a few date related crashes in the
cwb
module. - Fixed installation of compressed, scrambled XML export.
- Fixed bug in PunctuationTokenizer leading to orphaned tokens.
- Fixed crash when scrambling nested spans by only scrambling the outermost ones.
- Fixed crash in xml_import when no elements are imported.
- Fixed crash on empty sentences in Stanza.
- Better handling of empty XML elements in XML export.
- Faulty custom modules now result in a warning instead of a crash.
- Notify user when SweNER crashes.
- Fixed crash when config file can't be read due to file permissions.
- Fixed bug where
geo:contextual
would only work for sentences. - Fixed crash on systems with encodings other than UTF-8.