du4tnr
du4tnr.Rmd
Idea
Through working with different cohort of student researchers in NAARE program at CSUF, I recognized that while leaving students to do data cleaning and pre-processing is a valuable part of their training, but it becomes a nightmare when it comes to validating their final work products.
To maintain some consistency and make their preprocessing choices
transparent, I created an R-package and named it NACCdata
,
after NACC, which is the
organization that provided us with the data. Since I did not have the
permission to share the data, I created a new, data-less package:
du4tnr
(Data Utility for Transparency and Reproduciblity).
I also added a new feature for recording datasets/variables in a SQLite
database to facilitate sharing verified datasets among a large group of
collaborators.
Features
cleanDS
: Clean Dataset using instructions in a
Data.Fream
The main feature in the original package was to reduce selection,
renaming, setting missing values to NA, and redefining levels of
categorical variables. This was an important step as NACC
data had a large number of variables (1936) and it used numerical values
for presenting categorical variables and for all missing values.
Let’s use mtcars
as an example:
-
cleanMTcars
contains the instruction for cleaningmtcars
:- The first two rows show how categorical variables are turned into factors with meaningful levels.
- The last row shows how records with
gear==5
are considered asNA
.
cleanMTcars <- data.frame(
Name=c('vs', 'am', 'gear'),
CurrentVal=c('0,1','0,1',''),
NewLevel=c('V-shape,Straight','Manual,Automatic',''),
NA_Vals=c('','','5'),
Rename=c('Engine.Shape','Transmission', 'Forward.Gears'))
cleanMTcars
#> Name CurrentVal NewLevel NA_Vals Rename
#> 1 vs 0,1 V-shape,Straight Engine.Shape
#> 2 am 0,1 Manual,Automatic Transmission
#> 3 gear 5 Forward.Gears
-
cleanDS
uses thecleanMTcars
to select, clean and rename three variables:
cleanDS(mtcars, cleanDF=cleanMTcars, Rename=TRUE) |> tail(10)
#> Engine.Shape Transmission Forward.Gears
#> AMC Javelin V-shape Manual 3
#> Camaro Z28 V-shape Manual 3
#> Pontiac Firebird V-shape Manual 3
#> Fiat X1-9 Straight Automatic 4
#> Porsche 914-2 V-shape Automatic NA
#> Lotus Europa Straight Automatic NA
#> Ford Pantera L V-shape Automatic NA
#> Ferrari Dino V-shape Automatic NA
#> Maserati Bora V-shape Automatic NA
#> Volvo 142E Straight Automatic 4
-
CleanDS
can pass through other variables usingaddCols
parameter, and can skip renaming if needed (Rename=FALSE
):
processedMTcars <- cleanDS(mtcars, cleanDF=cleanMTcars,
Rename=FALSE, addCols=colnames(mtcars))
processedMTcars |> head(10)
#> vs am gear mpg cyl disp hp drat wt qsec
#> Mazda RX4 V-shape Automatic 4 21.0 6 160.0 110 3.90 2.620 16.46
#> Mazda RX4 Wag V-shape Automatic 4 21.0 6 160.0 110 3.90 2.875 17.02
#> Datsun 710 Straight Automatic 4 22.8 4 108.0 93 3.85 2.320 18.61
#> Hornet 4 Drive Straight Manual 3 21.4 6 258.0 110 3.08 3.215 19.44
#> Hornet Sportabout V-shape Manual 3 18.7 8 360.0 175 3.15 3.440 17.02
#> Valiant Straight Manual 3 18.1 6 225.0 105 2.76 3.460 20.22
#> Duster 360 V-shape Manual 3 14.3 8 360.0 245 3.21 3.570 15.84
#> Merc 240D Straight Manual 4 24.4 4 146.7 62 3.69 3.190 20.00
#> Merc 230 Straight Manual 4 22.8 4 140.8 95 3.92 3.150 22.90
#> Merc 280 Straight Manual 4 19.2 6 167.6 123 3.92 3.440 18.30
#> carb
#> Mazda RX4 4
#> Mazda RX4 Wag 4
#> Datsun 710 1
#> Hornet 4 Drive 1
#> Hornet Sportabout 2
#> Valiant 1
#> Duster 360 4
#> Merc 240D 2
#> Merc 230 2
#> Merc 280 4
SQLite Storage
This feature is added to create a repository that can be shared with other collaborator and save time in recreating commonly used datasets. It can also increase the integrity of the data compared to the case that CSV files are shared between colleagues.
Let’s start from scracth:
dbPath <- 'myProject.db'
res <- NULL
if(file.exists(dbPath)){
if(!file.remove(dbPath)){
closeAllConnections()
unlink(dbPath, force = TRUE)
}
}
if(file.exists(dbPath)) stop('Cannot remove ', dbPath, '!\n',
'Close it by `db_close()` or restart R.')
# Create a new SQLite database or open an existing one:
projDB <- dbObj(dp_path = dbPath)
print(projDB)
#> SQLite Variable Storage Object
#> Database path: myProject.db
#> Connection status: Open
#> Stored items: 0
#> Registered files: 0
# Add mtcars, cleanMTcars, and processedMTcars to the database
projDB$add_var(vname = 'MTCars0', value = mtcars)
#> Saved variable: MTCars0
projDB$add_var(vname = 'cleaningInstructions', value = cleanMTcars)
#> Saved variable: cleaningInstructions
projDB$add_var(vname = 'MTcars-Processed', value = processedMTcars,
steps = c("rename and relabel `vs` and `am`", "set `gear==5` to `NA`"))
#> Saved variable: MTcars-Processed
print(projDB)
#> SQLite Variable Storage Object
#> Database path: myProject.db
#> Connection status: Open
#> Stored items: 3
#> Registered files: 0
#>
#> items:
#> name type source_file created_at
#> 1 MTCars0 data.frame <NA> 2025-06-30 16:10:18.710079
#> 2 MTcars-Processed data.frame <NA> 2025-06-30 16:10:18.714412
#> 3 cleaningInstructions data.frame <NA> 2025-06-30 16:10:18.712596
#> updated_at
#> 1 2025-06-30 16:10:18.710079
#> 2 2025-06-30 16:10:18.714412
#> 3 2025-06-30 16:10:18.712596
Let’s close the database:
# Close the database
projDB$close_db()
#> Database connection closed
print(projDB)
#> SQLite Variable Storage Object
#> Database path: myProject.db
#> Connection status: Closed
Let’s reuse it:
projDB$reconnect_db() # No need to reintroduce the path
print(projDB$list_vars()) # Just listing variables
#> name type source_file created_at
#> 1 MTCars0 data.frame <NA> 2025-06-30 16:10:18.710079
#> 2 MTcars-Processed data.frame <NA> 2025-06-30 16:10:18.714412
#> 3 cleaningInstructions data.frame <NA> 2025-06-30 16:10:18.712596
#> updated_at
#> 1 2025-06-30 16:10:18.710079
#> 2 2025-06-30 16:10:18.714412
#> 3 2025-06-30 16:10:18.712596
Now, let’s access the database using a second object:
myDB <- dbObj(dp_path = dbPath)
print(myDB$list_vars())
#> name type source_file created_at
#> 1 MTCars0 data.frame <NA> 2025-06-30 16:10:18.710079
#> 2 MTcars-Processed data.frame <NA> 2025-06-30 16:10:18.714412
#> 3 cleaningInstructions data.frame <NA> 2025-06-30 16:10:18.712596
#> updated_at
#> 1 2025-06-30 16:10:18.710079
#> 2 2025-06-30 16:10:18.714412
#> 3 2025-06-30 16:10:18.712596
And let’s create a source file to remove NAs and add the result as a new variable:
# Create an R Script
fConn <- file("cleanMtcars.R")
writeLines(con = fConn, text = '
# Removing NAs from processedMTcars
cleansedMTcatrs <- processedMTcars[complete.cases(processedMTcars),]
')
close(fConn)
# Run the R Script
source("cleanMtcars.R")
# Add the new variable and its source file to the dataset
myDB$add_var(vname = "MTcars-cleaned", cleansedMTcatrs,
steps = "Remove NAs")
#> Saved variable: MTcars-cleaned
print(myDB)
#> SQLite Variable Storage Object
#> Database path: myProject.db
#> Connection status: Open
#> Stored items: 4
#> Registered files: 0
#>
#> items:
#> name type source_file created_at
#> 1 MTCars0 data.frame <NA> 2025-06-30 16:10:18.710079
#> 2 MTcars-Processed data.frame <NA> 2025-06-30 16:10:18.714412
#> 3 MTcars-cleaned data.frame <NA> 2025-06-30 16:10:19.004587
#> 4 cleaningInstructions data.frame <NA> 2025-06-30 16:10:18.712596
#> updated_at
#> 1 2025-06-30 16:10:18.710079
#> 2 2025-06-30 16:10:18.714412
#> 3 2025-06-30 16:10:19.004587
#> 4 2025-06-30 16:10:18.712596
Let’s check the first instance of the database:
projDB$list_vars()
#> name type source_file created_at
#> 1 MTCars0 data.frame <NA> 2025-06-30 16:10:18.710079
#> 2 MTcars-Processed data.frame <NA> 2025-06-30 16:10:18.714412
#> 3 MTcars-cleaned data.frame <NA> 2025-06-30 16:10:19.004587
#> 4 cleaningInstructions data.frame <NA> 2025-06-30 16:10:18.712596
#> updated_at
#> 1 2025-06-30 16:10:18.710079
#> 2 2025-06-30 16:10:18.714412
#> 3 2025-06-30 16:10:19.004587
#> 4 2025-06-30 16:10:18.712596
myDB$close_db()
#> Database connection closed
Descriptive Functions
The package offers a few simple functions that we have found useful in summarizing and describing data.
con_table
: Creating contingency tables
The following creates a contingency table between vs
and
gear
:
x <- projDB$get_var("MTcars-cleaned")
cont_table(x, vs~gear)
#> gear
#> vs 3 4
#> V-shape 12 2
#> Straight 3 10
con_table()
can use more than one variable for rows or
columns:
cont_table(x, vs~gear+am)
#> gear+am
#> vs 3-Manual 4-Automatic 4-Manual
#> V-shape 12 2 0
#> Straight 3 6 4
This function can also count NAs, when needed:
y <- projDB$get_var("MTcars-Processed")
cont_table(y, vs+am~gear, useNA = 'ifany')
#> gear
#> vs+am 3 4 <NA>
#> Straight-Automatic 0 6 1
#> Straight-Manual 3 4 0
#> V-shape-Automatic 0 2 4
#> V-shape-Manual 12 0 0
total_col()
and total_row()
: Adding
Marginal Sums to Tables and Data.frames
These functions can add total column and row to a count table and data.frame:
cont_table(y, vs+am~gear, useNA = 'ifany') |>
total_row() |> total_col()
#> 3 4 <NA> Total.Col
#> Straight-Automatic 0 6 1 7
#> Straight-Manual 3 4 0 7
#> V-shape-Automatic 0 2 4 6
#> V-shape-Manual 12 0 0 12
#> Total.Row 15 12 5 32
y |> total_row() |> tail(10)
#> vs am gear mpg cyl disp hp drat wt
#> Camaro Z28 V-shape Manual 3 13.3 8 350.0 245 3.73 3.840
#> Pontiac Firebird V-shape Manual 3 19.2 8 400.0 175 3.08 3.845
#> Fiat X1-9 Straight Automatic 4 27.3 4 79.0 66 4.08 1.935
#> Porsche 914-2 V-shape Automatic NA 26.0 4 120.3 91 4.43 2.140
#> Lotus Europa Straight Automatic NA 30.4 4 95.1 113 3.77 1.513
#> Ford Pantera L V-shape Automatic NA 15.8 8 351.0 264 4.22 3.170
#> Ferrari Dino V-shape Automatic NA 19.7 6 145.0 175 3.62 2.770
#> Maserati Bora V-shape Automatic NA 15.0 8 301.0 335 3.54 3.570
#> Volvo 142E Straight Automatic 4 21.4 4 121.0 109 4.11 2.780
#> Total <NA> <NA> NA 642.9 198 7383.1 4694 115.09 102.952
#> qsec carb
#> Camaro Z28 15.41 4
#> Pontiac Firebird 17.05 2
#> Fiat X1-9 18.90 1
#> Porsche 914-2 16.70 2
#> Lotus Europa 16.90 2
#> Ford Pantera L 14.50 4
#> Ferrari Dino 15.50 6
#> Maserati Bora 14.60 8
#> Volvo 142E 18.60 2
#> Total 571.16 90
describe_ds()
: Generates a Summary of Variables
This function creates a summary of each variable, including the counts of NAs, if any.
describe_ds(y, maxLevels = 3, var2row = FALSE)
#> $Numerical
#> gear mpg cyl disp
#> Min. :3.000 Min. :10.40 Min. :4.000 Min. : 71.1
#> 1st Qu.:3.000 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8
#> Median :3.000 Median :19.20 Median :6.000 Median :196.3
#> Mean :3.444 Mean :20.09 Mean :6.188 Mean :230.7
#> 3rd Qu.:4.000 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0
#> Max. :4.000 Max. :33.90 Max. :8.000 Max. :472.0
#> NA's :5
#> hp drat wt qsec
#> Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
#> 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89
#> Median :123.0 Median :3.695 Median :3.325 Median :17.71
#> Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85
#> 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
#> Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
#>
#> carb
#> Min. :1.000
#> 1st Qu.:2.000
#> Median :2.000
#> Mean :2.812
#> 3rd Qu.:4.000
#> Max. :8.000
#>
#>
#> $Categorical
#> vs am
#> [1,] V-shape: 18 Manual: 19
#> [2,] Straight: 14 Automatic: 13
#> [3,]
which_ds()
: Searching datasets for variables
When dealing with multiple large datasets, we can use this functions to see which one has the variable we have in mind:
which_ds(c('Species', 'am', 'mpg', 'gpm'),
iris=iris, x=x, y=y)
#> iris x y
#> Species TRUE FALSE FALSE
#> am FALSE TRUE TRUE
#> mpg FALSE TRUE TRUE
#> gpm FALSE FALSE FALSE
A more visible format: