This project is my collection of notes and customised software tools for data management, manipulation and analysis.
################################################################
# devtools is recommended
require(devtools)
install_github("disentangle", "ivanhanigan")
require(disentangle)
- The Morpho/Metacat system is great for a data repository
- Morpho also claims to be suitable for Ecologists to document their data
- But in my experience it leaves a little to be desired in ease of use for both purposes
- Specifically the speed that documentation can be entered into Morpho is slow
- This post is a first attempt to create some boilerplate code to quickly generate EML metadata using REML.
As I noted in a previous post, there are [two types of data documentation workflow](http://ivanhanigan.github.io/2013/10/two-main-types-of-data-documentation-workflow/).
- GUI
- Programatic
I also think there are two types of users with different motivations and constraints:
- 1) Data Analysts
- 2) Data Librarians
In my view the Analysts group of users need a tool that will very rapidly document their data and workflow steps and can live with a bit less rigour in the quality of documentation. Obviously this is not ideal but seems an inevitable trade-off needed to enable analysts to keep up the momentum of the data processing and modelling without getting distracted by tedious (and potentially unnecessary) data documentation tasks.
On the other hand the role of the Librarian group is to produce documentation to the best level possible (given time and resource constraints) the datasets and methodologies that lead to the creation of the datasets. For that group Rigour will take precedence and there will be a trade-off in terms of the amount of time needed to produce the documentation.
As an example of the two different groups, an analyst working with weather data in Australia may want to specify that their variable “temperature” is the average of the daily maxima and minima, but might not need to specify that the observations were taken inside a Stevenson Screen, or even if they are in Celsius, Farenhiet or Kelvin. They will be very keen to start the analysis to identify any associations between weather variables and the response variable they are investigating. The data librarian on the other hand will be more likely to need to include this information so that the users of the temperature data do not mis-interpret it.
- I’ve been talking about this for a while got referred to this document by Ben Davies at the ANUSF
- It has this bit:
- Roughly speaking, a full EML document produced by Morpho is a bit like a whole bunch of cruft that isnt needed and gets in the way (and is more confusing)
- Whereas a minimal version Im thinking of covers almost all the generic entries providing the “minimum amount of stuff to make it work right”.
- This experiment aims to speed up the creation of a minimal “skeleton” of metadata to a level that both the groups above can be comfortable with AS A FIRST STEP.
- It is assumed that additional steps will then need to be taken to complete the documentation, but the automation of the first part of the process should shave off enough time to suit the purposes of both groups
- It is an imperative that the quick-start creation of the metadata does not end up costing the documentor more time later on down the track if they need to go back to many of the elements for additional editing.
I’ve been using a [fictitious dataset from a Statistics Methodology paper by Ritschard 2006](http://ivanhanigan.github.io/2013/10/test-data-for-classification-trees/). It will do as a first cut but when it comes to actually test this out it would be good to have something that would take a bit longer (so that the frustrations of using Morpho become very apparent).
- the package REML will create a EML metadata document quite easily
- I will assume that a lot of the data elements are self explanatory and take column names and factor levels as the descriptions
- Notably unable to import the data file
![morpho-reml-boilerplate.png](/images/morpho-reml-boilerplate.png)
- Also “the saved document is not valid for some reason”
![morpho-reml-boilerplate.png](/images/morpho-reml-boilerplate.png)
- This needs testing
- A failure would be that even if it is quicker to get started if it takes a long time and is difficult to fix up it might increase the risk of misunderstandings.
################################################################
# func
if(!require(reporttools)) install.packages("reporttools"); require(reporttools)
require(devtools)
install_github("disentangle", "ivanhanigan")
require(disentangle)
# load
fpath <- system.file(file.path("extdata", "civst_gend_sector_full.csv"), package = "disentangle")
analyte <- read.csv(fpath)
analyte$random <- rnorm(nrow(analyte), 0 , 1)
summary(analyte)
# create a large number of randome variables
for(i in 1:75)
{
analyte[,ncol(analyte) + 1] <- rnorm(nrow(analyte), 10 , 20)
}
names(analyte)
str(analyte)
data_continuous <- numeric(0)
for(i in 1:length(names(analyte)))
{
if(is.numeric(analyte[,i]))
{
data_continuous <- c(data_continuous, i)
}
}
# clean
str(analyte[,data_continuous])
str(analyte[,-data_continuous])
# do
sink('inst/doc/tabContinuous.tex')
tableContinuous(vars = analyte[,data_continuous],
stats = c("n", "min", "mean", "median",
"max", "iqr", "na"),
cap = "Table of continuous variables.", lab = "tab:table4",
caption.placement = "top",
longtable = TRUE, add.to.row = list(pos = list(0),
command = "\\hline \\endhead "))
sink()
x.big <- analyte[,-data_continuous]
sink('inst/doc/tabNominal.tex')
tableNominal(vars = x.big, cap = "Table of nominal variables",
vertical = FALSE,
lab = "tab:table5", longtable = TRUE,
caption.placement = "top")
sink()
I’m reading Ritschard, G. (2006). Computing and using the deviance with classification trees. In Compstat 2006 - Proceedings in Computational Statistics 17th Symposium Held in Rome, Italy, 2006. Retrieved from http://link.springer.com/chapter/10.1007%2F978-3-7908-1709-6_5
This is implemented in SPSS code. I’ll try to develop R code to do these tests.
First I’ll get the data out of their paper and fit the tree in figure 1
The figure in the paper can be checked against our results (and also the improved plot from the party package might be used).
Using the case weights like above is convenient especially when datasets are very large, but caused problems in model fitting for me (tree failed to compute a deviance when done this way but succeeded with a dataset expanded so the data.frame is transformed into one in which each row is an observation.
################################################################
# name:reassurance-re-weights
# just to reasure myself I understand what case weights do, I'll make
# this into a survey dataset with one row per respondent
df <- as.data.frame(matrix(NA, nrow = 0, ncol = 3))
for(i in 1:nrow(civst_gend_sector))
{
# i <- 1
n <- civst_gend_sector$number_of_cases[i]
if(n == 0) next
for(j in 1:n)
{
df <- rbind(df, civst_gend_sector[i,1:3])
}
}
# save this for use later
write.csv(df, "inst/extdata/civst_gend_sector_full.csv", row.names = F)
# clean
nrow(df)
str(df)
fit1 <- rpart(civil_status ~ gender + activity_sector, data = df)
summary(fit1)
# report
par(mfrow=c(1,2), xpd = NA)
plot(fit)
text(fit, use.n = TRUE)
title("fit")
plot(fit1)
text(fit1, use.n = TRUE)
title("fit1")
# great these are the same which is what we'd hoped to see
The Ritschard (2006) paper (with SPSS code) describes a complicated method that includes Needing to retrieve for each case:
- leaf number and
- profile number
I really want to use the deviance as well as the misclassification error rate for measuring the descriptive power of the tree. Ripley’s tree package is the only one I found to give me deviance for classification trees.
The Ritschard papers suggest nice methods to test differences between nested trees ie testing the difference with the root node with a Chi-square statistic (equivalent of the usual method used in logistic regression).
Is this method employed widely in analysing survey data? I haven’t turned up many references to Ritschard since he wrote these.
So let’s start simple first. The following code follows the simpler approach:
- Take the difference in the deviance for the models (less complex model minus more complex model)
- Take the difference in degrees of freedom for the models
- difference between less complex and more complex model follows chi-square distribution
################################################################
# name:tree.chisq
tree.chisq <- function(null_model, fitted_model)
{
# TODO check if these are tree model class
fit_dev <- summary(fitted_model)$dev
null_dev <- summary(null_model)$dev
dev <- null_dev - fit_dev
df <- summary(fitted_model)$size - summary(null_model)$size
sig <- 1 - pchisq(dev, df)
sprintf("Reduction in deviance is %s percent, p-value is %s (based on a chi-squared test)",
((null_dev - fit_dev) / null_dev) * 100,
sig)
}
# func
require(tree)
require(devtools)
install_github("TransformSurveyTools", "ivanhanigan")
require(TransformSurveyTools)
# load locally
# fpath <- "inst/extdata/civst_gend_sector_full.csv"
# or via package
fpath <- system.file("extdata", "civst_gend_sector_full.csv", package="TransformSurveyTools")
civst_gend_sector <- read.csv(fpath)
# clean
str(civst_gend_sector)
# do
variables <- names(civst_gend_sector)
y_variable <- variables[1]
x_variables <- variables[-1]
# NULL
form0 <- reformulate("1",
response = y_variable)
form0
model0 <- tree(form0, data = civst_gend_sector, method = "class")
print(model0)
# FIT
form1 <- reformulate(x_variables,
response = y_variable)
form1
model1 <- tree(form1, data = civst_gend_sector, method = "class")
print(model1)
summary(model1)
plot(model1)
text(model1,pretty = 0)
tree.chisq(null_model = model0, fitted_model = model1)
source("tests/test-tree.chisq.r")
- I value precise language very highly
- this is because in multi-disciplinary teams it is easy to talk using the same words and mean different things
- in recent discussion about Distributed Lag Non-linear Models I started to reflect on something that has bothered me for a While
- back in 2005 my old mate Prof Keith Dear picked me up on using the term “non-linear model” incorrectly and explained the maths…
- I kind of understood but promptly forgot and found a lot of people use the term non-linear model a bit carelessly
- Yesterday I was in a discussion about comparing non-linear relationships between different studies in a meta-analysis
- I immediatly felt uncomfortable when we started to discuss these as “non-linear models”
- so here is a quick bit of google fu (with a session at the coffee shop with Steve and Mishka) to remind me about the difference between
- the following comes from
- http://www.ats.ucla.edu/stat/sas/library/SASNLin_os.htm
- verbatim except for my attempt at mathjax notation in latex
A regression model is called nonlinear, if the derivatives of the model with respect to the model parameters depends on one or more parameters. This definition is essential to distinguish nonlinear from curvilinear regression. A regression model is not necessarily nonlinear if the graphed regression trend is curved. A polynomial model such as this:
$Yi = β0 + β1 Xi + β2 Xi^2 + εi$
- appears curved when y is plotted against x. It is, however, not a nonlinear model. To see this, take derivatives of y with respect to the parameters b0, b1
- dy/db0 = 1
- dy/db1 = x
- dy/db2 = x^2
- None of these derivatives depends on a model parameter, the model is linear. In contrast, consider the log-logistic model
$Yi = d + (a - d)/(1 + eb × log(x/g)) + ε$
- Take derivatives with respect to d, for example:
$dy/dd = 1 - 1/(1 + eb × log(x/g))$
- The derivative involves other parameters, hence the model is nonlinear.
- It is probably best to refer to the polynomial as a “non-linear relationship” in a linear model
- reserving “non-linear model” for things like Model 2
~/Dropbox/projects/tools/LaTeX templates/Sweave/SweaveExample1.Rnw see ./src/SharpReportTemplate.org
2013-12-02-research-protocol-for-manitoba-centre-for-health-policy-raw-U-Manitoba Centre for Health Policy Guidelines
These guidelines come from:
\noindent http://umanitoba.ca/faculties/medicine/units/mchp/protocol/media/manage_guidelines.pdf
Most of the material below is taken verbatim from the original. Unfortunately many of the items described below have links to internal MCHP documents that we cannot access. Nonetheless the structure of the guidelines provides a useful skeleton to frame our thinking.
The following areas should be reviewed with project team members near the beginning of the study and throughout the project as needed:
- Confidentiality
- Project team
- File organization and documentation development
- Communication
- Administrative
- Report Preparation
- Project Completion
Maintaining data access
Roles and contact information should be documented on the project website for the following, where applicable (information may also be included on level of access approved for each team member).
This is the lead person on the project, who assumes responsibility for delivering the project. The PI makes decisions on project direction and analysis requirements, with input from programmers and the research coordinator (an iterative process). If there is more than one PI (e.g., multi-site studies), overall responsibility for the study needs to be determined, and how the required work will be allocated and coordinated among the co-investigators. Researcher Workgroup website (internal link)
Th RC is always assigned to deliverables and is usually brought in on other types of projects involving multiple sites, investigators and/or programmers. Responsibilities include project documentation, project management (e.g., ensuring that timelines are met, ensuring that project specifications are being followed), and working with both investigator(s) and the Programmer Coordinator throughout the project to coordinate project requirements.
The PC is a central management role who facilitates assignment of programming resources to projects, ensuring the best possible match among programmers and investigators. Research Coordinator Workgroup website(internal link)
This is primarily responsible for programming and related programming documentation (such that the purpose of the program and how results were derived can be understood by others). However, a major role may be taken in the analyses of the project as well, and this will characteristically vary with the project. Programmer Analyst Workgroup website(internal link)
This is primarily responsible for preparing the final product (i.e., the report), including editing and formatting of final graphs and manuscript and using Reference Manager to set up the references. Research support also normally sets up and attends working group meetings. All requests for research support go through the Office Manager.
It is important to clarify everyone’s roles at the beginning of the project; for example, whether the investigator routinely expects basic graphs and/or programming logs from the programmer.
It is highly desirable to keep the same personnel, from the start of the project, where possible. It can take some time to develop a cohesive working relationship, particularly if work styles are not initially compatible. Furthermore, requesting others to temporarily fill in for team absences is generally best avoided, particularly for programming tasks (unless there is an extended period of absence). The original programmer will know best the potential impact of any changes that may need to be made to programming code.
Access to MCHP internal resources (e.g., Windows, Unix) need to be assessed for all team members and set up as appropriate to their roles on the project.
A WG is always set up for deliverables (and frequently for other projects): Terms of Reference for working group (internal)
All project-related documentation, including key e-mails used to update project methodology, should be saved within the project directory. Resources for directory setup and file development include:
This includes various process documents as well as an overview of the documentation process for incorporating research carried out by MCHP into online resources: Documentation Management Guide (internal)
A detailed outline of how the Windows environment is structured at MCHP
How files and sub-directories should be organized and named as per the MCHP Guide to Managing Project Files (internal pdf). Information that may be suitable for incorporating into MCHP online resources should be identified; for example, a Concept Development section for subsequent integration of a new concept(s) into the MCHP Concept Dictionary. The deliverable glossary is another resource typically integrated into the MCHP Glossary.
NOTE this is a diversion from the MCHP guidelines. These recommended directories are from a combination of sources that we have synthesised.
Background: concise summaries: possibly many documents for main project and any main analyses based on the 1:3:25 paradigm: one page of main messages; a three-page executive summary; 25 pages of detailed findings.
Analysis (also see http://projecttemplate.net for a programmer oriented template)
Versions: folders named by date - dump entire copies of the project at certain milestones/change points
Completion: checklists to make sure project completion is systematic. Factor in a critical reflection of lessons learnt.
Project communication should be in written form, wherever possible, to serve as reference for project documentation. Access and confidentiality clearance levels for all involved in the project will determine whether separate communication plans need to be considered for confidential information.
provides opportunities for feedback/ discussion from everyone and for documenting key project decisions. Responses on any given issue would normally be copied to every project member, with the expectation of receiving feedback within a reasonable period of time - e.g.,a few days). The Research Coordinator should be copied on ALL project correspondence in order to keep the information up to date on the project website.
Regularly-scheduled meetings or conference calls should include all project members where possible. Research Coordinators typically arrange project team meetings and take meeting minutes, while Research Support typically arranges the Working Group meetings.
Used for booking rooms, it displays information on room availability and may include schedules of team members.
Time spent on projects should be entered by all MCHP employees who are members of the project team.
This includes:
- Policies - e.g., Dissemination of Research Findings
- Standards - e.g., deliverable production, use of logos, web publishing
- Guidelines - e.g., producing PDFs, powerpoint, and Reference Manager files
- Other resources - e.g., e-mail etiquette, technical resources, photos.
Making sure the numbers “make sense”. Carrying out these checks requires spelling out who will do which checks.
A variety of things to check for at various stages of the study. Programming can be reviewed, for example, by checking to ensure all programs have used the right exclusions, the correct definitions, etc. , and output has been accurately transferred to graphs, tables, and maps for the report.
In this case it is MCHP and Manitoba Health Reports - an example of cross-checking against another source of data.
Several steps need to take place to “finish” the project:
Wind-up or debriefing meetings are held shortly after public release of a deliverable. Such meetings provide all team members with an opportunity to communicate what worked/did not work in bringing the project to completion, providing lessons learned for future deliverables.
Findings from the wind-up meeting should be used to update and finalize the project website (including entering the date of release of report/paper). Both Windows and Unix project directories should be reviewed to ensure that only those SAS programs relevant to project analyses are kept (and well-documented) for future reference. Any related files which may be stored in a user directory should be moved to the project directory.
When the project is complete, the Systems Administrator should be informed. Project directories, including program files and output data sets, will be archived to tape or CD. Tape backups are retained for a 5-year period before being destroyed so any project may be restored up to five years after completion.
This is with MCHP resource repository - a general overview of this process is described in General Documentation Process {internal}.