This is the legacy version of lucidum. For the latest version please visit https://github.com/SpeckledJim2/lucidum
lucidum is an open source R Shiny app to help users build and communicate GLMs and GBMs without writing code.
lucidum works with standard R data.frames and data.tables and is designed to make model building more interactive, visual and insightful.
I originally wrote lucidum to automate the repetitive tasks involved when building the types of regression models common to UK personal lines insurers. More recently, I have used it as a tool to help insurers move from GLMs to GBMs (specifically LightGBM), using SHAP values to understand and communicate model features and interaction effects.
lucidum's functionality includes:
- Collection of metadata to support a modelling exercise
- define modelling KPIs (e.g. frequency by claim peril)
- filters to apply to charts and maps (e.g. new business vs renewals)
- base levels and bandings to apply when tabulating a GLM
- setup feature scenarios for inclusion in a GBM
- Interactive charting
- actual vs expected charting by rating factor, with easy access to filters and training vs test views
- plot several models' predictions (GLMs and GBMs) simultaneously
- user-defined banding for continuous features - no "pre-banding" required
- overlay "single profile" lines for GLMs to understand the underlying model effect
- overlay SHAP value ribbons for GBMs to understand the underlying model effect
- automated "residual error" analysis to identify poorly fitting features
- Interactive mapping of data at UK Postcode Area, Sector and Unit resolution
- uses the leaflet library to draw choropleth maps for Postcode Area and Sector
- uses open source shapefiles for Area and Sector, rendered down to a lower resolution to work well in the browser
- Unit level plots at the centroid of individual Postcode Units
- Support a GLM build
- "formula helper" to make the job of building an R GLM formula much faster
- convert GLMs to tabular format ("ratebooks") with user-defined bandings and base levels
- export tabulated GLMs as Excel workbooks
- Support a GBM build
- provide a simple user interface for GBM feature selection
- provide a simple user interface to the most common LightGBM parameters
- build GA2M models (1D+2D GBMs) to support interaction detection
- use feature interaction constraints to build indices for high cardinality features like postcode
- 1D SHAP plots to interpret the model's main effects
- 2D SHAP plots to interpret interaction effects
You can install the development version of lucidum from GitHub with:
# install.packages("devtools")
devtools::install_github("SpeckledJim2/lucidum.legacy")
Option 1: use the dropdown menu to choose a data.frame or data.table to load into lucidum
library(lucidum.legacy)
lucidum()
Option 2: supply the data.frame or data.table name as an argument to lucidum
library(lucidum.legacy)
lucidum(your_dataframe_name)
To separate training and test rows in your dataset, include a numerical column called "train_test" with value 0 for training and 1 for test.
Specification files make lucidum more useful by specifying metadata to make model building faster.
Specification files are .csv files which can be created within lucidum itself or in a text editor. You don’t have to use specification files, but they make life easier if you are going to be working with a dataset on a regular basis.
The specification file formats are described in the app - for example the KPI specification screen follows:
There are three types of specification files:
- KPI specification: the metrics you want to access quickly in the app’s sidebar
- Filter specification: formulae that define filters you want to apply to charts and maps
- Feature specification: quicker access to features in ChartaR and feature scenarios that you want to use in your models
Save the specification files in folders called: “kpi_specifications”, “filter_specifications” and “feature_specifications”.
Specification files should have the same name as your dataset, i.e. if the dataset is called “football” the specification files should be called “football.csv”
Set the path to the specification files before running library(lucidum.legacy) or it won’t pick up the path:
options(lucidum=list(specification_path=“my_path”))
library(lucidum.legacy)
lucidum(my_dt)
If you want to change the path without restarting R, you need to detach lucidum and reload the library:
detach("package:lucidum.legacy", unload = TRUE)
options(lucidum=list(specification_path=“my_different_path”))
library(lucidum.legacy)
-
Dataset
- data.tables will retain any new columns created after quitting lucidum - useful if you build a model and want to use the predictions elsewhere
- data.frames will not retain any new columns built in lucidum.
- The first numerical dataset column will be set as the response when you first load lucidum.
- Create a column “train_test” to distinguish training rows (value 0) from test rows (value 1). Lucidum won’t generate this column for you, you need to supply it and make sure it’s a sensible split.
- Identifier columns with unique values on larger datasets (>100k) will be slow if selected in ChartaR as it first checks how many levels are in the feature before plotting. I use the feature specification to put identifier type columns into a feature grouping, so I don’t include them in a model by mistake.
-
How to build models on a subset of the data
- Use a weight column with zeroes to remove those rows from the training data.
- You must ensure that the corresponding response is also zero for those rows (e.g. you would not expect non-zero incurred claims on a row with zero claims).
-
User interface
- You can change lucidum browser zoom level as for any other webpage.
- MappaR will sometimes not refresh if you have changed the zoom level while on another tab. If this happens, zoom in and back out and the map will reappear.
-
Operations that can be slow running
- SHAP values on deep trees built on large datasets can be very slow to generate. Turn off the option to generate SHAP values if you don’t need them, or use lower number of trees, higher learning rates and fewer number of leaves when first exploring a GBM build with SHAP values.
- Summarising by character columns is slower than summarizing by factor columns, so convert character columns to factors before loading into lucidum.
-
Bugs
- You can change the displayed dataset freely (using the top right control) before building any models, but once you have built a model avoid changing the dataset as you can’t in general apply a model built on dataset A to dataset B.
- There is no error checking for invalid monotonicity parameters or additional parameters in BoostaR – if you use something incorrect LightGBM will throw an error and the app will stop.
- shinyAce doesn’t start automatically when first loading lucidum. A workaround is to run the command shinyAce::aceEditor(outputId = NULL) in the console before library(lucidum.legacy).
- The 10k option for SHAP values generates the SHAP values correctly for a random 10k sample of the data, but the SHAP ribbons don’t yet display correctly in ChartaR.
- R GLMs store the entire modelling dataset by default as part of the model object. When you save a dataset out from GlimmaR, lucidum strips the dataset from the .RDS glm object to make the file a sensible size. However, this stripping out doesn’t occur until the model is saved and so if several GLMs are built within GlimmaR on very large datasets RAM usage will increase.