Skip to content

Latest commit

 

History

History
executable file
·
394 lines (277 loc) · 15.3 KB

tutorial.md

File metadata and controls

executable file
·
394 lines (277 loc) · 15.3 KB
title author date bibliography output
bibliographica vignette
Leo Lahti, Niko Ilomaki, Mikko Tolonen
2016-11-01
bibliography.bib
references.bib
html_document

R tools for bibliographic data analysis

The bibliographica R package provides tools for automated extraction and analysis of bibliographic metadata collections (library catalogues) such as the ESTC. The toolkit includes functions to extract and clean up information from plain text library catalogue annotation fields following common standards such as the Bibliographic Processing Cataloging Rules. This package can be used in combination with any library catalogue that follows these common standards.

Your contributions, bug reports and suggestions are welcome!

Installation

We assume you have installed R. If you use RStudio, change the default encoding to UTF-8.

Installing the stable release version in R:

library(devtools)
install_github("ropengov/bibliographica")

Load the tools:

library(bibliographica)
library(knitr)
Sys.setlocale(locale="UTF-8") 
## [1] ""

Harmonizing textual annotation fields in library catalogues

Below, you will find simple examples on the package functionality. In real studies the tools can be used to preprocess collections with millions of documents.

Page information

Estimate the total page count for two documents:

# unlist(polish_physical_extent(c("50 p.", "[6],viii,386p. ;"))$estimated.pages)

Dimension information

Extract and print document dimension information in a harmonized format:

res <- polish_dimensions("1/2fo (37 cm)")
knitr::kable(res)
gatherings.original width.original height.original obl.original original gatherings width height obl area
2fo NA 37 FALSE 2fo 37 cm 2fo 25 37 0 925

The missing fields can be estimated with the 'fill' argument:

res <- polish_dimensions("1/2fo (37 cm)", fill = TRUE)
knitr::kable(res)
gatherings.original width.original height.original obl.original original gatherings width height obl area
2fo NA 37 FALSE 2fo 37 cm 2fo 25 37 0 925

Estimate the total page count:

unlist(polish_physical_extent(c("50 p.", "[6],viii,386p. ;"))$estimated.pages)
## NULL

Estimation of the missing information (gatherings, width, and/or height) is based on a ready-made dimension mapping table. This table can be changed by the user if necessary (see function arguments). The default table can be retrieved in R with:

dtab <- dimension_table()
kable(head(dtab)) # just show the first rows
height NA 1to 2long 2fo 2small 4long 4to 4small 8long 8vo 8small 12long 12mo 16long 16mo 18mo 24long 24mo 32mo 40mo 48mo 64mo 80mo 84mo 1to
90 60 60 x x x x x x x x x x x x x x x x x x x x x x x
110 66 73 x x x x x x x x x x x x x x x x x x x x x x x
109 66 72 x x x x x x x x x x x x x x x x x x x x x x x
108 65 72 x x x x x x x x x x x x x x x x x x x x x x x
107 64 71 x x x x x x x x x x x x x x x x x x x x x x x
106 64 70 x x x x x x x x x x x x x x x x x x x x x x x

Conversions between different versions of the gatherings names:

res <- gatherings_table()
knitr::kable(res)
Alternate Standard Symbol Name
1to 1to sheet
bs bs bs broadside
2long 2long folio
2to 2fo folio
2small 2small folio
4long 4long quarto
4to 4to quarto
4small 4small quarto
8long 8long octavo
8to 8vo octavo
8small 8small octavo
12long 12long 12° duodecimo
12to 12mo 12° duodecimo
16to 16mo 16° sextodecimo
16long 16long 16° sextodecimo
18to 18mo 18° octodecimo
24to 24mo 24° vigesimo-quarto
24long 24long 24° vigesimo-quarto
32to 32mo 32° trigesimo-segundo
40to 40mo 40° quadraquinto
48to 48mo 48° quadragesimo-octavo
64to 64mo 64° sexagesimo-quarto
80to 80mo 80° octogentesimo
84to 84mo 84° octoginta-quarto

Volume information

Pick information on the volume numbers:

# Volume number 3 from multi-volume document
unname(polish_physical_extent("v.3, 50 p"))
##             
## 1 50 3 NA NA

Pick information on the total volume count:

# Document with 4 volumes and missing page information
unlist(polish_physical_extent("4v.")) 
## pagecount volnumber  volcount     parts 
##        NA        NA         4        NA

Dimension information

Extract and print dimension information:

res <- polish_dimensions("1/2fo (37 cm)")
knitr::kable(res)
gatherings.original width.original height.original obl.original original gatherings width height obl area
2fo NA 37 FALSE 2fo 37 cm 2fo 25 37 0 925

Also the missing fields can be estimated:

res <- polish_dimensions("1/2fo (37 cm)", fill = TRUE)
knitr::kable(res)
gatherings.original width.original height.original obl.original original gatherings width height obl area
2fo NA 37 FALSE 2fo 37 cm 2fo 25 37 0 925

Estimation of the missing information (gatherings, width, and/or height) is based on a ready-made approximation table. The table can be changed by the user (see function arguments). The default table can be retrieved in R with:

dtab <- dimension_table()
kable(head(dtab)) # just print the first rows
height NA 1to 2long 2fo 2small 4long 4to 4small 8long 8vo 8small 12long 12mo 16long 16mo 18mo 24long 24mo 32mo 40mo 48mo 64mo 80mo 84mo 1to
90 60 60 x x x x x x x x x x x x x x x x x x x x x x x
110 66 73 x x x x x x x x x x x x x x x x x x x x x x x
109 66 72 x x x x x x x x x x x x x x x x x x x x x x x
108 65 72 x x x x x x x x x x x x x x x x x x x x x x x
107 64 71 x x x x x x x x x x x x x x x x x x x x x x x
106 64 70 x x x x x x x x x x x x x x x x x x x x x x x

Sheet size table

Check approximated sheet area for folio. The area units are in cm2. Also other sheet types are available.

sheet_area("folio")
## [1] 1350

The sheet sizes are calculated in the above example according to the sheet size table. The table can be changed by the user (see the function arguments) but by default the functions use this mapping:

kable(sheet_area()) 
format gatherings width height area
sheet 1to 60.0 90.0 5760.00
broadside bs 60.0 64.0 3840.00
folio-large 2long 30.0 53.0 1749.00
folio 2fo 30.0 45.0 1350.00
folio-small 2small 25.0 38.0 950.00
quarto-long 4long 27.0 35.0 945.00
quarto 4to 22.0 28.0 616.00
quarto-small 4small 17.0 20.0 340.00
octavo-small 8small 11.0 17.0 187.00
octavo 8vo 13.0 19.0 247.00
octavo-long 8long 13.0 22.0 286.00
duodecimo 12mo 12.5 19.0 237.50
duodecimo-long 12long 13.0 20.0 260.00
sextodecimo 16mo 12.0 15.0 180.00
sextodecimo-long 16long 14.0 18.0 252.00
octodecimo 18mo 10.0 16.0 160.00
vigesimo-quarto 24mo 8.5 12.5 106.25
vigesimo-quarto-long 24long 9.5 14.0 133.00
trigesimo-secundo 32mo 7.0 12.0 84.00
quadraquinto 40mo 6.0 11.0 66.00
quadragesimo-octavo 48mo 6.5 10.0 65.00
sexagesimo-quarto 64mo 5.0 7.0 35.00
octogentesimo 80mo 4.0 6.0 24.00
octoginta-quarto 84mo 3.5 5.0 17.50

Stopwords

Removing stopwords is often necessary in text analysis. The stopwords form multiple categories, such as individual letters, conjugates, special characters, or particular expressions. The definition of a stopword may also depend on a context.

The following example removes the term "well" and individual letters from the input vector:

remove_terms(c("a", "well", "james", "30 year war"), terms = "well")
## [1] "a"           NA            "james"       "30 year war"

For a full list of stopword and related functions, see the function documentation. We also provide some ready-made stopword lists that can be easily downloaded in R with the 'read.csv' function.

Person names

Several person name lists are available for validation purposes, including also gender information. These can be read with the functions firstnames, lastnames, and notnames (pseudonymes also coming). The name lists are collected by combining manually constructed lists and openly available material from various sources listed in the respective data folders.

Licensing and Citations

This work can be freely used, modified and distributed under the Two-clause BSD license.

citation("bibliographica")
## 
## Kindly cite this R package as follows:
## 
##   (C) Leo Lahti, Hege Roivainen, Niko Ilomaki, Mikko Tolonen
##   (rOpenGov 2015-2016).  bibliographica R package URL:
##   http://github.com/ropengov/bibliographica
## 
## A BibTeX entry for LaTeX users is
## 
##   @Misc{,
##     title = {bibliographica: tools for bibliographic analysis},
##     author = {Leo Lahti and Hege Roivainen and Niko Ilomaki and Mikko Tolonen},
##     year = {2015-2016},
##   }
## 
## Many thanks for all contributors! See: http://ropengov.github.io

Related work

The generic tools of this package can be used in combination with packages that provide more specific tools for targeted data collections such as the ESTC or Fennica.

Session info

This vignette was created with

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rmarkdown_1.0.9016    bibliographica_0.2.30 knitr_1.14           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7        magrittr_1.5       munsell_0.4.3     
##  [4] tm_0.6-2           colorspace_1.2-7   R6_2.2.0          
##  [7] highr_0.6          stringr_1.1.0      plyr_1.8.4        
## [10] dplyr_0.5.0        tools_3.3.1        babynames_0.2.1   
## [13] parallel_3.3.1     grid_3.3.1         data.table_1.9.6  
## [16] gtable_0.2.0       genderdata_0.5.0   DBI_0.5-1         
## [19] htmltools_0.3.5    yaml_2.1.13        digest_0.6.10     
## [22] lazyeval_0.2.0     assertthat_0.1     tibble_1.2        
## [25] NLP_0.1-9          tidyr_0.6.0        reshape2_1.4.1    
## [28] ggplot2_2.1.0      formatR_1.4        stringdist_0.9.4.2
## [31] slam_0.1-38        evaluate_0.10      stringi_1.1.2     
## [34] sorvi_0.7.47       gender_0.5.1.9000  scales_0.4.0      
## [37] chron_2.3-47