non-linear-model-vs-non-linear-relationship

Introduction

This project is my collection of notes and customised software tools for data management, manipulation and analysis.

################################################################
# devtools is recommended
require(devtools)
install_github("disentangle", "ivanhanigan")
require(disentangle)

Test Data

Test Data for Classification Trees

Data Input - Remote

Database Connection

connect2postgres

connect2oracle

Database Input

readOGR2

PSQL dump and restore

Data Input - Local

Download File from HTTPS

download-file-https-code

Data Operation

R-subset

R-transform

R-mutate

R-summarise

R-arrange

R-upcase_string

test-upcase_string

man-upcase_string

blog

R-levenshtein

test-levenshtein

man-levenshtein

Data Output

Data Documentation

R-reml-and-rfigshare

R-reml-and-rfigshare-part-2

dc-uploader-and-ANU-DataCommons

morpho-and-rfigshare

morpho-and-reml-boilerplate-streamline-the-process-of-metadata-entry

Background

The Morpho/Metacat system is great for a data repository
Morpho also claims to be suitable for Ecologists to document their data
But in my experience it leaves a little to be desired in ease of use for both purposes
Specifically the speed that documentation can be entered into Morpho is slow
This post is a first attempt to create some boilerplate code to quickly generate EML metadata using REML.

Speed and Rigour

As I noted in a previous post, there are [two types of data documentation workflow](http://ivanhanigan.github.io/2013/10/two-main-types-of-data-documentation-workflow/).

GUI
Programatic

I also think there are two types of users with different motivations and constraints:

1) Data Analysts
2) Data Librarians

Analysts can often trade-off completeness of documentation for speed

In my view the Analysts group of users need a tool that will very rapidly document their data and workflow steps and can live with a bit less rigour in the quality of documentation. Obviously this is not ideal but seems an inevitable trade-off needed to enable analysts to keep up the momentum of the data processing and modelling without getting distracted by tedious (and potentially unnecessary) data documentation tasks.

Librarians produce gold plated documentation and can take longer to produce this

On the other hand the role of the Librarian group is to produce documentation to the best level possible (given time and resource constraints) the datasets and methodologies that lead to the creation of the datasets. For that group Rigour will take precedence and there will be a trade-off in terms of the amount of time needed to produce the documentation.

An example

As an example of the two different groups, an analyst working with weather data in Australia may want to specify that their variable “temperature” is the average of the daily maxima and minima, but might not need to specify that the observations were taken inside a Stevenson Screen, or even if they are in Celsius, Farenhiet or Kelvin. They will be very keen to start the analysis to identify any associations between weather variables and the response variable they are investigating. The data librarian on the other hand will be more likely to need to include this information so that the users of the temperature data do not mis-interpret it.

Embracing Inaccuracy and Incompleteness

I’ve been talking about this for a while got referred to this document by Ben Davies at the ANUSF

[http://thedailywtf.com/Articles/Documentation-Done-Right.aspx](http://thedailywtf.com/Articles/Documentation-Done-Right.aspx)

It has this bit:

Roughly speaking, a full EML document produced by Morpho is a bit like a whole bunch of cruft that isnt needed and gets in the way (and is more confusing)
Whereas a minimal version Im thinking of covers almost all the generic entries providing the “minimum amount of stuff to make it work right”.

Aim

This experiment aims to speed up the creation of a minimal “skeleton” of metadata to a level that both the groups above can be comfortable with AS A FIRST STEP.
It is assumed that additional steps will then need to be taken to complete the documentation, but the automation of the first part of the process should shave off enough time to suit the purposes of both groups
It is an imperative that the quick-start creation of the metadata does not end up costing the documentor more time later on down the track if they need to go back to many of the elements for additional editing.

Step 1: load a simple example dataset

I’ve been using a [fictitious dataset from a Statistics Methodology paper by Ritschard 2006](http://ivanhanigan.github.io/2013/10/test-data-for-classification-trees/). It will do as a first cut but when it comes to actually test this out it would be good to have something that would take a bit longer (so that the frustrations of using Morpho become very apparent).

Step 2 create a function to deliver the minimal metadata object

the package REML will create a EML metadata document quite easily
I will assume that a lot of the data elements are self explanatory and take column names and factor levels as the descriptions

reml_boilerplate-code

reml_boilerplate-test-code

Results: This loads into Morpho with some errors

Notably unable to import the data file

![morpho-reml-boilerplate.png](/images/morpho-reml-boilerplate.png)

Also “the saved document is not valid for some reason”

![morpho-reml-boilerplate.png](/images/morpho-reml-boilerplate.png)

Conclusions

This needs testing
A failure would be that even if it is quicker to get started if it takes a long time and is difficult to fix up it might increase the risk of misunderstandings.

R-get.var.labels

test-get.var.labels

R-spss-variable-labels-read

test-spss-variable-labels-read-code

R-spss-variable-summary-table-code

R-reporttools-variable-summary-table

################################################################
# func
if(!require(reporttools)) install.packages("reporttools"); require(reporttools)
require(devtools)
install_github("disentangle", "ivanhanigan")
require(disentangle)
# load
fpath <- system.file(file.path("extdata", "civst_gend_sector_full.csv"), package = "disentangle")

analyte  <- read.csv(fpath)
analyte$random <- rnorm(nrow(analyte), 0 , 1)
summary(analyte)
# create a large number of randome variables
for(i in 1:75)
  {
    analyte[,ncol(analyte) + 1] <- rnorm(nrow(analyte), 10 , 20)    
  }
names(analyte)
str(analyte)
data_continuous <- numeric(0) 
for(i in 1:length(names(analyte)))
  {
    if(is.numeric(analyte[,i]))
        {
            data_continuous <- c(data_continuous, i)
        }
  }
# clean        
str(analyte[,data_continuous])
str(analyte[,-data_continuous])
# do
sink('inst/doc/tabContinuous.tex')
tableContinuous(vars = analyte[,data_continuous],
                stats = c("n", "min", "mean", "median",
                  "max", "iqr", "na"),
                cap = "Table of continuous variables.", lab = "tab:table4",
                caption.placement = "top",
                longtable = TRUE, add.to.row = list(pos = list(0), 
                command = "\\hline \\endhead "))
sink()

x.big <- analyte[,-data_continuous]
sink('inst/doc/tabNominal.tex')
tableNominal(vars = x.big, cap = "Table of nominal variables",
             vertical = FALSE,
             lab = "tab:table5", longtable = TRUE,
             caption.placement = "top")

sink()

Exploratory Data Analysis

General Purpose

Visualisation

2013-12-18-animations-using-R

Statistical Modelling

Tree-Based Methods

Misclassification Error Rate for Classification Trees

Deviance Based Measures of Descriptive Power for Classification Trees

Computing-and-using-deviance-with-classification-trees-Ritschard, G. (2006).

I’m reading Ritschard, G. (2006). Computing and using the deviance with classification trees. In Compstat 2006 - Proceedings in Computational Statistics 17th Symposium Held in Rome, Italy, 2006. Retrieved from http://link.springer.com/chapter/10.1007%2F978-3-7908-1709-6_5

This is implemented in SPSS code. I’ll try to develop R code to do these tests.

First I’ll get the data out of their paper and fit the tree in figure 1

Reproduce the figure from the paper

The figure in the paper can be checked against our results (and also the improved plot from the party package might be used).

One row per case or using weights?

Using the case weights like above is convenient especially when datasets are very large, but caused problems in model fitting for me (tree failed to compute a deviance when done this way but succeeded with a dataset expanded so the data.frame is transformed into one in which each row is an observation.

################################################################
# name:reassurance-re-weights
 
# just to reasure myself I understand what case weights do, I'll make
# this into a survey dataset with one row per respondent
df <- as.data.frame(matrix(NA, nrow = 0, ncol = 3))
for(i in 1:nrow(civst_gend_sector))
    {
    #    i <- 1
        n <- civst_gend_sector$number_of_cases[i]
        if(n == 0) next
        for(j in 1:n)
            {
              df <- rbind(df, civst_gend_sector[i,1:3])              
            }
 
    }
# save this for use later
write.csv(df, "inst/extdata/civst_gend_sector_full.csv", row.names = F)
# clean
nrow(df)
str(df)
fit1 <- rpart(civil_status ~ gender + activity_sector, data = df)
summary(fit1)

# report
par(mfrow=c(1,2), xpd = NA) 
plot(fit)
text(fit, use.n = TRUE)
title("fit")
plot(fit1)
text(fit1, use.n = TRUE)
title("fit1")
# great these are the same which is what we'd hoped to see

Check This: R function to calculate for classification trees

The Ritschard (2006) paper (with SPSS code) describes a complicated method that includes Needing to retrieve for each case:

leaf number and
profile number

I really want to use the deviance as well as the misclassification error rate for measuring the descriptive power of the tree. Ripley’s tree package is the only one I found to give me deviance for classification trees.

The Ritschard papers suggest nice methods to test differences between nested trees ie testing the difference with the root node with a Chi-square statistic (equivalent of the usual method used in logistic regression).

Is this method employed widely in analysing survey data? I haven’t turned up many references to Ritschard since he wrote these.

So let’s start simple first. The following code follows the simpler approach:

Take the difference in the deviance for the models (less complex model minus more complex model)
Take the difference in degrees of freedom for the models
difference between less complex and more complex model follows chi-square distribution

R-tree.chisq

R code

################################################################
# name:tree.chisq
tree.chisq <- function(null_model, fitted_model)
{
    # TODO check if these are tree model class
    fit_dev  <- summary(fitted_model)$dev
    null_dev  <- summary(null_model)$dev    
    dev  <-  null_dev - fit_dev
    df  <- summary(fitted_model)$size - summary(null_model)$size
    sig  <- 1 - pchisq(dev, df)
    sprintf("Reduction in deviance is %s percent, p-value is %s (based on a chi-squared test)",
            ((null_dev - fit_dev) / null_dev) * 100,
            sig)
}

test-tree.chisq

# func
require(tree)
require(devtools)
install_github("TransformSurveyTools", "ivanhanigan")
require(TransformSurveyTools)
# load locally
# fpath  <- "inst/extdata/civst_gend_sector_full.csv"
# or via package
fpath <- system.file("extdata", "civst_gend_sector_full.csv", package="TransformSurveyTools")
civst_gend_sector  <- read.csv(fpath)

# clean
str(civst_gend_sector)

# do
variables  <- names(civst_gend_sector)
y_variable  <- variables[1]
x_variables  <- variables[-1]

# NULL
form0  <- reformulate("1",
                      response = y_variable)
form0
model0 <- tree(form0, data = civst_gend_sector, method = "class")
print(model0)
# FIT
form1  <- reformulate(x_variables,
                      response = y_variable)
form1
model1 <- tree(form1, data = civst_gend_sector, method = "class")
print(model1)
summary(model1)
plot(model1)
text(model1,pretty = 0)
tree.chisq(null_model = model0, fitted_model = model1)

main-tree-model

source("tests/test-tree.chisq.r")

Deviance Measures and Descriptive Power for Regression Trees

rpart-deviance-explained-code

rpart-deviance-explained-test

Terminology

2013-12-03-non-linear-relationships-vs-non-linear-models

orgmode version

non-linear-relationships-vs-non-linear-models

I value precise language very highly
this is because in multi-disciplinary teams it is easy to talk using the same words and mean different things
in recent discussion about Distributed Lag Non-linear Models I started to reflect on something that has bothered me for a While
back in 2005 my old mate Prof Keith Dear picked me up on using the term “non-linear model” incorrectly and explained the maths…
I kind of understood but promptly forgot and found a lot of people use the term non-linear model a bit carelessly
Yesterday I was in a discussion about comparing non-linear relationships between different studies in a meta-analysis
I immediatly felt uncomfortable when we started to discuss these as “non-linear models”
so here is a quick bit of google fu (with a session at the coffee shop with Steve and Mishka) to remind me about the difference between

Nonlinear Regression vs. Linear Regression

the following comes from
http://www.ats.ucla.edu/stat/sas/library/SASNLin_os.htm
verbatim except for my attempt at mathjax notation in latex

A regression model is called nonlinear, if the derivatives of the model with respect to the model parameters depends on one or more parameters. This definition is essential to distinguish nonlinear from curvilinear regression. A regression model is not necessarily nonlinear if the graphed regression trend is curved. A polynomial model such as this:

Model 1

$Y_i = β₀ + β₁ X_i + β₂ X_i^2 + ε_i$

appears curved when y is plotted against x. It is, however, not a nonlinear model. To see this, take derivatives of y with respect to the parameters b0, b1
dy/db0 = 1
dy/db1 = x
dy/db2 = x^2

None of these derivatives depends on a model parameter, the model is linear. In contrast, consider the log-logistic model

Model 2

$Y_i = d + (a - d)/(1 + e^{b × log(x/g)}) + ε$

Take derivatives with respect to d, for example:

$dy/dd = 1 - 1/(1 + e^{b × log(x/g)})$

The derivative involves other parameters, hence the model is nonlinear.

Conclusions

It is probably best to refer to the polynomial as a “non-linear relationship” in a linear model
reserving “non-linear model” for things like Model 2

md version

Bibliograph-ology

2013-11-20-sync-endnote-and-mendeley-references-using-r-xml

Intro

some crazy stuff

Check in spreadsheet

Do concatenate and loercase

select out of mendeley and send to Endnote

results, conclude

Code Editors

Workflow Tools

R-newnode

test-newnode

2013-11-25-setting-up-a-workflow-script

2013-12-01-graphviz-automagic-flowcharts

workflow_steps-code

post

2013-12-24-a-few-best-practices-for-statistical-programming

Graphical User Interfaces

Version Control

2013-11-19-git-can-be-simple-or-very-complicated

2013-12-09-bitbucket-has-unlimited-private-git-repositories-for-universities

2013-12-09-dual-code-repository-and-project-website

Latex/Sweave/orgmode/knitr

Orgmode headers

2013-11-26-a-sharp-looking-orgmode-latex-export-header

raw

~/Dropbox/projects/tools/LaTeX templates/Sweave/SweaveExample1.Rnw see ./src/SharpReportTemplate.org

blogged

LaTeX templates

R Packages

Project Management

Gantt Charts

taskjuggler gantt-tj3-code

2013-12-01-gantt-charts-for-health-professionals

2013-12-02-research-protocol-we-used-for-our-bushfire-project

2013-12-02-research-protocol-for-manitoba-centre-for-health-policy-raw-U-Manitoba Centre for Health Policy Guidelines

These guidelines come from:

\noindent http://umanitoba.ca/faculties/medicine/units/mchp/protocol/media/manage_guidelines.pdf

Most of the material below is taken verbatim from the original. Unfortunately many of the items described below have links to internal MCHP documents that we cannot access. Nonetheless the structure of the guidelines provides a useful skeleton to frame our thinking.

The following areas should be reviewed with project team members near the beginning of the study and throughout the project as needed:

Confidentiality
Project team
File organization and documentation development
Communication
Administrative
Report Preparation
Project Completion

Confidentiality

Maintaining data access

Project Team Makeup

Roles and contact information should be documented on the project website for the following, where applicable (information may also be included on level of access approved for each team member).

Principal Investigator

This is the lead person on the project, who assumes responsibility for delivering the project. The PI makes decisions on project direction and analysis requirements, with input from programmers and the research coordinator (an iterative process). If there is more than one PI (e.g., multi-site studies), overall responsibility for the study needs to be determined, and how the required work will be allocated and coordinated among the co-investigators. Researcher Workgroup website (internal link)

Research Coordinator

Th RC is always assigned to deliverables and is usually brought in on other types of projects involving multiple sites, investigators and/or programmers. Responsibilities include project documentation, project management (e.g., ensuring that timelines are met, ensuring that project specifications are being followed), and working with both investigator(s) and the Programmer Coordinator throughout the project to coordinate project requirements.

The Programmer Coordinator

The PC is a central management role who facilitates assignment of programming resources to projects, ensuring the best possible match among programmers and investigators. Research Coordinator Workgroup website(internal link)

Programmer Analyst

This is primarily responsible for programming and related programming documentation (such that the purpose of the program and how results were derived can be understood by others). However, a major role may be taken in the analyses of the project as well, and this will characteristically vary with the project. Programmer Analyst Workgroup website(internal link)

Research Support

This is primarily responsible for preparing the final product (i.e., the report), including editing and formatting of final graphs and manuscript and using Reference Manager to set up the references. Research support also normally sets up and attends working group meetings. All requests for research support go through the Office Manager.

Project Team considerations

Roles

It is important to clarify everyone’s roles at the beginning of the project; for example, whether the investigator routinely expects basic graphs and/or programming logs from the programmer.

Continuity

It is highly desirable to keep the same personnel, from the start of the project, where possible. It can take some time to develop a cohesive working relationship, particularly if work styles are not initially compatible. Furthermore, requesting others to temporarily fill in for team absences is generally best avoided, particularly for programming tasks (unless there is an extended period of absence). The original programmer will know best the potential impact of any changes that may need to be made to programming code.

Access levels

Access to MCHP internal resources (e.g., Windows, Unix) need to be assessed for all team members and set up as appropriate to their roles on the project.

Working group

A WG is always set up for deliverables (and frequently for other projects): Terms of Reference for working group (internal)

Atmospherics

File organization and Documentation Development.

All project-related documentation, including key e-mails used to update project methodology, should be saved within the project directory. Resources for directory setup and file development include:

Managing MCHP resources

This includes various process documents as well as an overview of the documentation process for incorporating research carried out by MCHP into online resources: Documentation Management Guide (internal)

MCHP directory structure

A detailed outline of how the Windows environment is structured at MCHP

Managing project files

How files and sub-directories should be organized and named as per the MCHP Guide to Managing Project Files (internal pdf). Information that may be suitable for incorporating into MCHP online resources should be identified; for example, a Concept Development section for subsequent integration of a new concept(s) into the MCHP Concept Dictionary. The deliverable glossary is another resource typically integrated into the MCHP Glossary.

Recommended Directories

NOTE this is a diversion from the MCHP guidelines. These recommended directories are from a combination of sources that we have synthesised.

Background: concise summaries: possibly many documents for main project and any main analyses based on the 1:3:25 paradigm: one page of main messages; a three-page executive summary; 25 pages of detailed findings.

Proposals: for documents related to grant applications.

Approvals: for ethics applications.

Budget: spreadsheets and so-forth.

Data

dataset1

dataset2

Paper1

Data

merged dataset1 and 2

Analysis (also see http://projecttemplate.net for a programmer oriented template)

exploratory analysesdata cleaningmain analysissensitivity analysisdata checkingmodel checkinginternal review

Document

DraftJournal1rejected? :-(Journal2Response to reviews

Versions: folders named by date - dump entire copies of the project at certain milestones/change points

Archiving final data with final published paper

Papers 2, 3, etc: same structure as paper 1 hopefully the project spawns several papers

Communication: details of communication with stakeholders and decision makers

Meetings: for organisation and records of meetings

Contact details. table contacts lists

Completion: checklists to make sure project completion is systematic. Factor in a critical reflection of lessons learnt.

References

Communication

Project communication should be in written form, wherever possible, to serve as reference for project documentation. Access and confidentiality clearance levels for all involved in the project will determine whether separate communication plans need to be considered for confidential information.

E-mail

provides opportunities for feedback/ discussion from everyone and for documenting key project decisions. Responses on any given issue would normally be copied to every project member, with the expectation of receiving feedback within a reasonable period of time - e.g.,a few days). The Research Coordinator should be copied on ALL project correspondence in order to keep the information up to date on the project website.

E-mail etiquette (internal)

Meetings

Regularly-scheduled meetings or conference calls should include all project members where possible. Research Coordinators typically arrange project team meetings and take meeting minutes, while Research Support typically arranges the Working Group meetings.

Tips for taking notes (internal)

Outlook calendar

Used for booking rooms, it displays information on room availability and may include schedules of team members.

Administrative

Time entry

Time spent on projects should be entered by all MCHP employees who are members of the project team.

website for time entry (internal)

procedures for time entry (internal)

Report preparation

This includes:

Policies - e.g., Dissemination of Research Findings
Standards - e.g., deliverable production, use of logos, web publishing
Guidelines - e.g., producing PDFs, powerpoint, and Reference Manager files
Other resources - e.g., e-mail etiquette, technical resources, photos.

Reliability and Validity Checks

Making sure the numbers “make sense”. Carrying out these checks requires spelling out who will do which checks.

Data Validity Checks

A variety of things to check for at various stages of the study. Programming can be reviewed, for example, by checking to ensure all programs have used the right exclusions, the correct definitions, etc. , and output has been accurately transferred to graphs, tables, and maps for the report.

Discrepancies between data sources

In this case it is MCHP and Manitoba Health Reports - an example of cross-checking against another source of data.

Project Completion

Several steps need to take place to “finish” the project:

Final Project Meeting.

Wind-up or debriefing meetings are held shortly after public release of a deliverable. Such meetings provide all team members with an opportunity to communicate what worked/did not work in bringing the project to completion, providing lessons learned for future deliverables.

Final Documentation Review.

Findings from the wind-up meeting should be used to update and finalize the project website (including entering the date of release of report/paper). Both Windows and Unix project directories should be reviewed to ensure that only those SAS programs relevant to project analyses are kept (and well-documented) for future reference. Any related files which may be stored in a user directory should be moved to the project directory.

System Cleanup.

When the project is complete, the Systems Administrator should be informed. Project directories, including program files and output data sets, will be archived to tape or CD. Tape backups are retained for a 5-year period before being destroyed so any project may be restored up to five years after completion.

Integration of new material to institution repository

This is with MCHP resource repository - a general overview of this process is described in General Documentation Process {internal}.

Files

index.org

Latest commit

History

index.org

File metadata and controls

non-linear-model-vs-non-linear-relationship

Introduction

Test Data

Test Data for Classification Trees

Data Input - Remote

Database Connection

connect2postgres

connect2oracle

Database Input

readOGR2

PSQL dump and restore

Data Input - Local

Download File from HTTPS

download-file-https-code

Data Operation

R-subset

R-transform

R-mutate

R-summarise

R-arrange

R-upcase_string

R-upcase_string

test-upcase_string

man-upcase_string

blog

R-levenshtein

R-levenshtein

test-levenshtein

man-levenshtein

Data Output

Data Documentation

R-reml-and-rfigshare

R-reml-and-rfigshare-part-2

dc-uploader-and-ANU-DataCommons

morpho-and-rfigshare

morpho-and-reml-boilerplate-streamline-the-process-of-metadata-entry

Background

Speed and Rigour

Analysts can often trade-off completeness of documentation for speed

Librarians produce gold plated documentation and can take longer to produce this

An example

Embracing Inaccuracy and Incompleteness

Aim

Step 1: load a simple example dataset

Step 2 create a function to deliver the minimal metadata object

reml_boilerplate-code

reml_boilerplate-test-code

Results: This loads into Morpho with some errors

Conclusions

R-get.var.labels

test-get.var.labels

R-spss-variable-labels-read

R-spss-variable-labels-read

test-spss-variable-labels-read-code

R-spss-variable-summary-table-code

R-reporttools-variable-summary-table

Exploratory Data Analysis

General Purpose

Visualisation

2013-12-18-animations-using-R

Statistical Modelling

Tree-Based Methods

Misclassification Error Rate for Classification Trees

Deviance Based Measures of Descriptive Power for Classification Trees

Computing-and-using-deviance-with-classification-trees-Ritschard, G. (2006).

Reproduce the figure from the paper

One row per case or using weights?

Check This: R function to calculate for classification trees

R-tree.chisq

R code

test-tree.chisq

main-tree-model

Deviance Measures and Descriptive Power for Regression Trees

rpart-deviance-explained-code