Skip to content

Commit

Permalink
Merge branch 'modularize_cleanepi' of https://github.com/epiverse-tra…
Browse files Browse the repository at this point in the history
…ce/cleanepi into modularize_cleanepi
  • Loading branch information
Karim-Mane committed Feb 5, 2024
2 parents 0c37781 + 09f1d58 commit c6ad63b
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 13 deletions.
50 changes: 49 additions & 1 deletion R/clean_data.R
Original file line number Diff line number Diff line change
@@ -1,11 +1,58 @@
#' Clean data
#'
#' @description this function cleans up messy data frames by performing several operations. These Include
#' @description Cleans up messy data frames by performing several operations. These Include
#' cleaning of column names, detecting and removing
#' duplicates, empty records and columns, constant columns, replacing missing
#' values by NA, converting character columns into dates when they contain a
#' certain number of date values, and detecting subject IDs with wrong format
#'
<<<<<<< HEAD
#' @param data A data frame
#' @param params A list of parameters that define what cleaning operations will
#' be applied on the input data. Possible parameters are:
#' \enumerate{
#' \item `remove_duplicates`: A logical variable to indicate whether to remove duplicated records or not. If
#' set `TRUE`, it calls the `remove_duplicate()` function
#' with parameter `remove` set to `NULL` i.e., to keep only the first instance
#' of duplicated rows. If you only want to detect duplicated rows in the dataset, use
#' the `find_duplicates()` function.
#' \item `target_columns`: A vector of columns names or indices to consider
#' when looking for duplicates. this
#' parameter can be set to `tags`from which duplicates to be removed.
#' \item `replace_missing`: A logical variable that indicates whether to replace missing value characters
#' with NA or not. The default value is `FALSE`.
#' \item `na_comes_as`: A string that represents the missing values in
#' the data frame. This only required when `replace_missing=TRUE`.
#' \item `check_timeframe`: A logical variable that determines whether to check if the
#' dates fall under the given time frame of not. default: `FALSE`.
#' \item `timeframe`: A vector of 2 dates that specifies the
#' first and last date. If provided, all Dates in the data frame must be
#' within this range or set to NA during the cleaning.
#' \item `error_tolerance`: A number between 0 and 1 indicating the proportion
#' of entries which cannot be identified as dates to be tolerated; if
#' this proportion is exceeded, the original vector is returned, and a
#' message is issued; defaults to 0.1 (10 percent).
#' \item `subject_id_col_name`: A column name of subject IDs
#' \item `subject_id_format`: A expected subject format
#' \item `prefix`: A prefix used in the subject IDs.
#' \item `suffix`: A suffix used in the subject IDs.
#' \item `range`: A vector with the range of numbers in the subject IDs.
#' \item `dictionary`: A data frame of data
#' dictionary that will be used to clean the specified columns. Use
#' `?clean_using_dictionary` for more details.
#' \item `range`: A vector with the range of numbers in the sample IDs
#' \item `keep`: A vector of column names to be kept as they appear
#' in the original data. default is `NULL`
#' }
#'
#' @return a list of the following 2 elements:
#' \enumerate{
#' \item `data`: A clean data frame according to the user-specified
#' parameters.
#' \item `report`: A list with details from each
#' cleaning operation considered.
#' }
=======
#' @param data the input data frame
#' @param params a list of parameters that define what cleaning operations will
#' be applied on the input data. Possible values are:
Expand Down Expand Up @@ -34,6 +81,7 @@
#' }
#'
#' @return the cleaned data frame according to the user-specified parameters
>>>>>>> 66c2fcf5d40a69a7f0be86afa08fa94a89e0dcad
#' @export
#'
#' @examples
Expand Down
24 changes: 12 additions & 12 deletions R/find_and_remove_duplicates.R
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
#' Remove duplicates and constant rows and columns
#'
#' @description
#' Remove duplicates and noise such as empty rows and
#' Removes duplicates and noise such as empty rows and
#' columns, and constant columns. These operations are
#' automatically performed by default unless specified otherwise.
#' Users can specify a set columns to consider when removing
#' duplicates with the 'target_columns' argument.
#'
#' @param data The input data frame or linelist.
#' @param data A input data frame or linelist.
#' @param target_columns A vector of column names to use when looking for
#' duplicates. When the input data is a `linelist` object, this
#' duplicates. When the input data is a `linelist`, this
#' parameter can be set to `tags`from which duplicates to be removed.
#' Its default value is `NULL`, which considers duplicates across all columns.
#' @param remove A vector of duplicate indices to be removed. Duplicate indices
Expand All @@ -24,7 +24,7 @@
#' @param rm_constant_cols A logical variable that is used to specify whether to remove
#' constant columns or not. Its Default value is `TRUE`.
#'
#' @return The input data without the duplicates values.
#' @return A data frame or linelist without the duplicates values and nor constant columns.
#' @export
#'
#' @examples
Expand Down Expand Up @@ -97,7 +97,7 @@ remove_duplicates <- function(data,

# remove duplicates
if (is.null(remove)) {
# remove duplicates by keeping the first instance of the duplicate in each
# remove duplicates and keep the first instance of the duplicate in each
# duplicate group
dat <- dat %>%
dplyr::distinct_at({{ target_columns }}, .keep_all = TRUE)
Expand Down Expand Up @@ -126,11 +126,11 @@ remove_duplicates <- function(data,

#' Identify and return duplicated rows in a data frame or linelist.
#'
#' @param data The input data frame or linelist.
#' @param data A data frame or linelist.
#' @param target_columns A vector of columns names or indices to consider when
#' looking for duplicates. When the input data is a `linelist` object,
#' this parameter can be set to `tags` if you wish to look for
#' duplicates across the tagged variables only.
#' looking for duplicates. When the input data is a `linelist` object, this
#' parameter can be set to `tags`from which duplicates to be removed.
#' Its default value is `NULL`, which considers duplicates across all columns.
#'
#' @return Data frame or linelist of all duplicated rows with following 2
#' additional columns:
Expand Down Expand Up @@ -178,11 +178,11 @@ find_duplicates <- function(data, target_columns = NULL) {

#' Get the names of the columns from which duplicates will be found
#'
#' @param data the input dataset
#' @param target_columns the user specified target column name
#' @param data A dataframe or linelist
#' @param target_columns A vector of column names
#' @param cols a vector of empty and constant columns
#'
#' @return a `vector` with the target column names or indexes
#' @return a vector with the target column names or indexes
#'
#' @keywords internal
#'
Expand Down

0 comments on commit c6ad63b

Please sign in to comment.