Last updated: 2025-03-09
Checks: 6 1
Knit directory: SAPPHIRE/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20240923) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 9217a86. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/.DS_Store
Ignored: output/.DS_Store
Unstaged changes:
Modified: analysis/pigmentationdata.Rmd
Modified: analysis/summer-winter.Rmd
Modified: output/coeff_plots_summer-winter/coeff_Forehead_B.png
Modified: output/coeff_plots_summer-winter/coeff_Forehead_CIE_L.png
Modified: output/coeff_plots_summer-winter/coeff_Forehead_CIE_a.png
Modified: output/coeff_plots_summer-winter/coeff_Forehead_CIE_b.png
Modified: output/coeff_plots_summer-winter/coeff_Forehead_E.png
Modified: output/coeff_plots_summer-winter/coeff_Forehead_G.png
Modified: output/coeff_plots_summer-winter/coeff_Forehead_M.png
Modified: output/coeff_plots_summer-winter/coeff_Forehead_R.png
Modified: output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_B.png
Modified: output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_CIE_L.png
Modified: output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_CIE_a.png
Modified: output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_CIE_b.png
Modified: output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_E.png
Modified: output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_G.png
Modified: output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_M.png
Modified: output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_R.png
Modified: output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_B.png
Modified: output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_CIE_L.png
Modified: output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_CIE_a.png
Modified: output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_CIE_b.png
Modified: output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_E.png
Modified: output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_G.png
Modified: output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_M.png
Modified: output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_R.png
Modified: output/withinstats_filtered/Forehead_B.png
Modified: output/withinstats_filtered/Forehead_CIE_L.png
Modified: output/withinstats_filtered/Forehead_CIE_a.png
Modified: output/withinstats_filtered/Forehead_CIE_b.png
Modified: output/withinstats_filtered/Forehead_E.png
Modified: output/withinstats_filtered/Forehead_G.png
Modified: output/withinstats_filtered/Forehead_M.png
Modified: output/withinstats_filtered/Forehead_R.png
Modified: output/withinstats_filtered/LeftUpperInnerArm_B.png
Modified: output/withinstats_filtered/LeftUpperInnerArm_CIE_L.png
Modified: output/withinstats_filtered/LeftUpperInnerArm_CIE_a.png
Modified: output/withinstats_filtered/LeftUpperInnerArm_CIE_b.png
Modified: output/withinstats_filtered/LeftUpperInnerArm_E.png
Modified: output/withinstats_filtered/LeftUpperInnerArm_G.png
Modified: output/withinstats_filtered/LeftUpperInnerArm_M.png
Modified: output/withinstats_filtered/LeftUpperInnerArm_R.png
Modified: output/withinstats_filtered/RightUpperInnerArm_B.png
Modified: output/withinstats_filtered/RightUpperInnerArm_CIE_L.png
Modified: output/withinstats_filtered/RightUpperInnerArm_CIE_a.png
Modified: output/withinstats_filtered/RightUpperInnerArm_CIE_b.png
Modified: output/withinstats_filtered/RightUpperInnerArm_E.png
Modified: output/withinstats_filtered/RightUpperInnerArm_G.png
Modified: output/withinstats_filtered/RightUpperInnerArm_M.png
Modified: output/withinstats_filtered/RightUpperInnerArm_R.png
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/pigmentationdata.Rmd) and
HTML (docs/pigmentationdata.html) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote),
click on the hyperlinks in the table below to view the files as they
were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 9217a86 | Tina Lasisi | 2025-03-07 | Cleaned data and redid analyses |
# Install packages if needed:
# install.packages("readxl")
#
library(readxl)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Path to your Excel file
workbook_path <- "data/serum_vit_D_study_with_lab_results.xlsx"
# Get the list of all sheet names in the workbook
sheets <- excel_sheets(workbook_path)
# Loop through each sheet, read it, and write it to a CSV
for (sheet_name in sheets) {
df <- read_excel(workbook_path, sheet = sheet_name)
# Replace problematic characters in sheet name
safe_sheet_name <- gsub("[^A-Za-z0-9\\-_]+", "_", sheet_name)
# Create file name and save
output_path <- file.path("data", paste0(safe_sheet_name, ".csv"))
write.csv(df, output_path, row.names = FALSE)
message("Created CSV file for sheet '", sheet_name, "': ", output_path)
}
Created CSV file for sheet 'ScreeningDataCollectionSummer': data/ScreeningDataCollectionSummer.csv
Created CSV file for sheet 'ScreeningDataCollectionWinter': data/ScreeningDataCollectionWinter.csv
Created CSV file for sheet 'ScreeningDataCollection6Weeks': data/ScreeningDataCollection6Weeks.csv
Created CSV file for sheet 'FoodFrequencySummer': data/FoodFrequencySummer.csv
Created CSV file for sheet 'FoodFrequencyWinter': data/FoodFrequencyWinter.csv
Created CSV file for sheet 'FoodFrequency6Weeks': data/FoodFrequency6Weeks.csv
Created CSV file for sheet 'SunExposureSummer': data/SunExposureSummer.csv
Created CSV file for sheet 'SunExposureWinter': data/SunExposureWinter.csv
Created CSV file for sheet 'SunExposure6Weeks': data/SunExposure6Weeks.csv
# (Optional) Install packages if needed
# install.packages("dplyr")
# Read each CSV and add a "Season" column
summer_data <- read.csv("data/ScreeningDataCollectionSummer.csv") %>%
mutate(Season = "Summer")
winter_data <- read.csv("data/ScreeningDataCollectionWinter.csv") %>%
mutate(Season = "Winter")
sixweek_data <- read.csv("data/ScreeningDataCollection6Weeks.csv") %>%
mutate(Season = "6Weeks")
# Combine into one data frame
combined_data <- bind_rows(summer_data, winter_data, sixweek_data)
# Write the combined data to a single CSV
write.csv(combined_data, "data/ScreeningDataCollection.csv", row.names = FALSE)
# Step 1: Read & Clean the Data
# Load required libraries
library(tidyverse)
library(readr)
# Read the original data
df_original <- read.csv("data/ScreeningDataCollection.csv")
# Clean and rename variables
df_cleaned <- df_original |>
mutate(
# Convert EthnicityColoured -> Ethnicity (adapt if it's character/string)
Ethnicity = if_else(EthnicityColoured == TRUE, "CapeMixed", "Xhosa")
) |>
select(
ParticipantCentreID,
TodayDate,
Gender,
Ethnicity,
Season,
starts_with("SkinReflectance"),
VitDResult
)
# Get all column names
all_columns <- colnames(df_cleaned)
# Find columns with L., a., and b. and rename them
renamed_columns <- all_columns
# Create a function to rename CIE variables
rename_cie_vars <- function(column_name) {
# Replace L. with CIE_L
column_name <- gsub("SkinReflectanceForehead[L]\\.([1-3])", "SkinReflectanceForeheadCIE_L\\1", column_name)
column_name <- gsub("SkinReflectanceRightUpperInnerArm[L]\\.([1-3])", "SkinReflectanceRightUpperInnerArmCIE_L\\1", column_name)
column_name <- gsub("SkinReflectanceLeftUpperInnerArm[L]\\.([1-3])", "SkinReflectanceLeftUpperInnerArmCIE_L\\1", column_name)
# Replace a. with CIE_a
column_name <- gsub("SkinReflectanceForehead[a]\\.([1-3])", "SkinReflectanceForeheadCIE_a\\1", column_name)
column_name <- gsub("SkinReflectanceRightUpperInnerArm[a]\\.([1-3])", "SkinReflectanceRightUpperInnerArmCIE_a\\1", column_name)
column_name <- gsub("SkinReflectanceLeftUpperInnerArm[a]\\.([1-3])", "SkinReflectanceLeftUpperInnerArmCIE_a\\1", column_name)
# Replace b. with CIE_b
column_name <- gsub("SkinReflectanceForehead[b]\\.([1-3])", "SkinReflectanceForeheadCIE_b\\1", column_name)
column_name <- gsub("SkinReflectanceRightUpperInnerArm[b]\\.([1-3])", "SkinReflectanceRightUpperInnerArmCIE_b\\1", column_name)
column_name <- gsub("SkinReflectanceLeftUpperInnerArm[b]\\.([1-3])", "SkinReflectanceLeftUpperInnerArmCIE_b\\1", column_name)
return(column_name)
}
# Apply the renaming function to each column
for (i in 1:length(all_columns)) {
renamed_columns[i] <- rename_cie_vars(all_columns[i])
}
# Create a rename map
rename_map <- setNames(all_columns, renamed_columns)
# Check if any columns were changed
changed_columns <- which(renamed_columns != all_columns)
if (length(changed_columns) > 0) {
cat("Renamed columns:\n")
for (i in changed_columns) {
cat(paste(" ", all_columns[i], "->", renamed_columns[i], "\n"))
}
} else {
cat("No columns were renamed.\n")
}
Renamed columns:
SkinReflectanceForeheadL.1 -> SkinReflectanceForeheadCIE_L1
SkinReflectanceForeheadL.2 -> SkinReflectanceForeheadCIE_L2
SkinReflectanceForeheadL.3 -> SkinReflectanceForeheadCIE_L3
SkinReflectanceForeheada.1 -> SkinReflectanceForeheadCIE_a1
SkinReflectanceForeheada.2 -> SkinReflectanceForeheadCIE_a2
SkinReflectanceForeheada.3 -> SkinReflectanceForeheadCIE_a3
SkinReflectanceForeheadb.1 -> SkinReflectanceForeheadCIE_b1
SkinReflectanceForeheadb.2 -> SkinReflectanceForeheadCIE_b2
SkinReflectanceForeheadb.3 -> SkinReflectanceForeheadCIE_b3
SkinReflectanceRightUpperInnerArmL.1 -> SkinReflectanceRightUpperInnerArmCIE_L1
SkinReflectanceRightUpperInnerArmL.2 -> SkinReflectanceRightUpperInnerArmCIE_L2
SkinReflectanceRightUpperInnerArmL.3 -> SkinReflectanceRightUpperInnerArmCIE_L3
SkinReflectanceRightUpperInnerArma.1 -> SkinReflectanceRightUpperInnerArmCIE_a1
SkinReflectanceRightUpperInnerArma.2 -> SkinReflectanceRightUpperInnerArmCIE_a2
SkinReflectanceRightUpperInnerArma.3 -> SkinReflectanceRightUpperInnerArmCIE_a3
SkinReflectanceRightUpperInnerArmb.1 -> SkinReflectanceRightUpperInnerArmCIE_b1
SkinReflectanceRightUpperInnerArmb.2 -> SkinReflectanceRightUpperInnerArmCIE_b2
SkinReflectanceRightUpperInnerArmb.3 -> SkinReflectanceRightUpperInnerArmCIE_b3
SkinReflectanceLeftUpperInnerArmL.1 -> SkinReflectanceLeftUpperInnerArmCIE_L1
SkinReflectanceLeftUpperInnerArmL.2 -> SkinReflectanceLeftUpperInnerArmCIE_L2
SkinReflectanceLeftUpperInnerArmL.3 -> SkinReflectanceLeftUpperInnerArmCIE_L3
SkinReflectanceLeftUpperInnerArma.1 -> SkinReflectanceLeftUpperInnerArmCIE_a1
SkinReflectanceLeftUpperInnerArma.2 -> SkinReflectanceLeftUpperInnerArmCIE_a2
SkinReflectanceLeftUpperInnerArma.3 -> SkinReflectanceLeftUpperInnerArmCIE_a3
SkinReflectanceLeftUpperInnerArmb.1 -> SkinReflectanceLeftUpperInnerArmCIE_b1
SkinReflectanceLeftUpperInnerArmb.2 -> SkinReflectanceLeftUpperInnerArmCIE_b2
SkinReflectanceLeftUpperInnerArmb.3 -> SkinReflectanceLeftUpperInnerArmCIE_b3
# Now let's try a different approach since the regex might not be capturing correctly
# Directly identify and rename columns with specific patterns
# Create a new renamed dataframe
df_renamed <- df_cleaned
# Iterate through each column and apply renaming
rename_columns <- function(df) {
col_names <- colnames(df)
new_names <- col_names
for (i in seq_along(col_names)) {
# Check for L. pattern
if (grepl("L\\.", col_names[i])) {
new_names[i] <- gsub("L\\.", "CIE_L", col_names[i])
}
# Check for a. pattern
else if (grepl("a\\.", col_names[i])) {
new_names[i] <- gsub("a\\.", "CIE_a", col_names[i])
}
# Check for b. pattern
else if (grepl("b\\.", col_names[i])) {
new_names[i] <- gsub("b\\.", "CIE_b", col_names[i])
}
}
# Show which columns were renamed
changed <- which(new_names != col_names)
if (length(changed) > 0) {
cat("Renamed columns:\n")
for (i in changed) {
cat(paste(" ", col_names[i], "->", new_names[i], "\n"))
}
}
# Rename the columns
colnames(df) <- new_names
return(df)
}
# Apply the renaming
df_renamed <- rename_columns(df_cleaned)
Renamed columns:
SkinReflectanceForeheadL.1 -> SkinReflectanceForeheadCIE_L1
SkinReflectanceForeheadL.2 -> SkinReflectanceForeheadCIE_L2
SkinReflectanceForeheadL.3 -> SkinReflectanceForeheadCIE_L3
SkinReflectanceForeheada.1 -> SkinReflectanceForeheadCIE_a1
SkinReflectanceForeheada.2 -> SkinReflectanceForeheadCIE_a2
SkinReflectanceForeheada.3 -> SkinReflectanceForeheadCIE_a3
SkinReflectanceForeheadb.1 -> SkinReflectanceForeheadCIE_b1
SkinReflectanceForeheadb.2 -> SkinReflectanceForeheadCIE_b2
SkinReflectanceForeheadb.3 -> SkinReflectanceForeheadCIE_b3
SkinReflectanceRightUpperInnerArmL.1 -> SkinReflectanceRightUpperInnerArmCIE_L1
SkinReflectanceRightUpperInnerArmL.2 -> SkinReflectanceRightUpperInnerArmCIE_L2
SkinReflectanceRightUpperInnerArmL.3 -> SkinReflectanceRightUpperInnerArmCIE_L3
SkinReflectanceRightUpperInnerArma.1 -> SkinReflectanceRightUpperInnerArmCIE_a1
SkinReflectanceRightUpperInnerArma.2 -> SkinReflectanceRightUpperInnerArmCIE_a2
SkinReflectanceRightUpperInnerArma.3 -> SkinReflectanceRightUpperInnerArmCIE_a3
SkinReflectanceRightUpperInnerArmb.1 -> SkinReflectanceRightUpperInnerArmCIE_b1
SkinReflectanceRightUpperInnerArmb.2 -> SkinReflectanceRightUpperInnerArmCIE_b2
SkinReflectanceRightUpperInnerArmb.3 -> SkinReflectanceRightUpperInnerArmCIE_b3
SkinReflectanceLeftUpperInnerArmL.1 -> SkinReflectanceLeftUpperInnerArmCIE_L1
SkinReflectanceLeftUpperInnerArmL.2 -> SkinReflectanceLeftUpperInnerArmCIE_L2
SkinReflectanceLeftUpperInnerArmL.3 -> SkinReflectanceLeftUpperInnerArmCIE_L3
SkinReflectanceLeftUpperInnerArma.1 -> SkinReflectanceLeftUpperInnerArmCIE_a1
SkinReflectanceLeftUpperInnerArma.2 -> SkinReflectanceLeftUpperInnerArmCIE_a2
SkinReflectanceLeftUpperInnerArma.3 -> SkinReflectanceLeftUpperInnerArmCIE_a3
SkinReflectanceLeftUpperInnerArmb.1 -> SkinReflectanceLeftUpperInnerArmCIE_b1
SkinReflectanceLeftUpperInnerArmb.2 -> SkinReflectanceLeftUpperInnerArmCIE_b2
SkinReflectanceLeftUpperInnerArmb.3 -> SkinReflectanceLeftUpperInnerArmCIE_b3
# Write to csv
write.csv(df_renamed, "data/ScreeningDataCollection_cleaned.csv", row.names = FALSE)
# Verify the renamed file
df_check <- read.csv("data/ScreeningDataCollection_cleaned.csv")
cat("\nVerification of renamed columns in saved file:\n")
Verification of renamed columns in saved file:
cie_columns <- grep("CIE_[Lab]", colnames(df_check), value = TRUE)
if (length(cie_columns) > 0) {
cat("CIE columns found in the renamed file:\n")
print(cie_columns)
} else {
cat("No CIE columns found in the renamed file.\n")
}
CIE columns found in the renamed file:
[1] "SkinReflectanceForeheadCIE_L1"
[2] "SkinReflectanceForeheadCIE_L2"
[3] "SkinReflectanceForeheadCIE_L3"
[4] "SkinReflectanceForeheadCIE_a1"
[5] "SkinReflectanceForeheadCIE_a2"
[6] "SkinReflectanceForeheadCIE_a3"
[7] "SkinReflectanceForeheadCIE_b1"
[8] "SkinReflectanceForeheadCIE_b2"
[9] "SkinReflectanceForeheadCIE_b3"
[10] "SkinReflectanceRightUpperInnerArmCIE_L1"
[11] "SkinReflectanceRightUpperInnerArmCIE_L2"
[12] "SkinReflectanceRightUpperInnerArmCIE_L3"
[13] "SkinReflectanceRightUpperInnerArmCIE_a1"
[14] "SkinReflectanceRightUpperInnerArmCIE_a2"
[15] "SkinReflectanceRightUpperInnerArmCIE_a3"
[16] "SkinReflectanceRightUpperInnerArmCIE_b1"
[17] "SkinReflectanceRightUpperInnerArmCIE_b2"
[18] "SkinReflectanceRightUpperInnerArmCIE_b3"
[19] "SkinReflectanceLeftUpperInnerArmCIE_L1"
[20] "SkinReflectanceLeftUpperInnerArmCIE_L2"
[21] "SkinReflectanceLeftUpperInnerArmCIE_L3"
[22] "SkinReflectanceLeftUpperInnerArmCIE_a1"
[23] "SkinReflectanceLeftUpperInnerArmCIE_a2"
[24] "SkinReflectanceLeftUpperInnerArmCIE_a3"
[25] "SkinReflectanceLeftUpperInnerArmCIE_b1"
[26] "SkinReflectanceLeftUpperInnerArmCIE_b2"
[27] "SkinReflectanceLeftUpperInnerArmCIE_b3"
# Print a summary of all column names for verification
cat("\nAll column names in final dataset:\n")
All column names in final dataset:
all_final_columns <- colnames(df_check)
print(all_final_columns)
[1] "ParticipantCentreID"
[2] "TodayDate"
[3] "Gender"
[4] "Ethnicity"
[5] "Season"
[6] "SkinReflectanceForeheadE1"
[7] "SkinReflectanceForeheadE2"
[8] "SkinReflectanceForeheadE3"
[9] "SkinReflectanceForeheadM1"
[10] "SkinReflectanceForeheadM2"
[11] "SkinReflectanceForeheadM3"
[12] "SkinReflectanceForeheadR1"
[13] "SkinReflectanceForeheadR2"
[14] "SkinReflectanceForeheadR3"
[15] "SkinReflectanceForeheadG1"
[16] "SkinReflectanceForeheadG2"
[17] "SkinReflectanceForeheadG3"
[18] "SkinReflectanceForeheadB1"
[19] "SkinReflectanceForeheadB2"
[20] "SkinReflectanceForeheadB3"
[21] "SkinReflectanceForeheadCIE_L1"
[22] "SkinReflectanceForeheadCIE_L2"
[23] "SkinReflectanceForeheadCIE_L3"
[24] "SkinReflectanceForeheadCIE_a1"
[25] "SkinReflectanceForeheadCIE_a2"
[26] "SkinReflectanceForeheadCIE_a3"
[27] "SkinReflectanceForeheadCIE_b1"
[28] "SkinReflectanceForeheadCIE_b2"
[29] "SkinReflectanceForeheadCIE_b3"
[30] "SkinReflectanceRightUpperInnerArmE1"
[31] "SkinReflectanceRightUpperInnerArmE2"
[32] "SkinReflectanceRightUpperInnerArmE3"
[33] "SkinReflectanceRightUpperInnerArmM1"
[34] "SkinReflectanceRightUpperInnerArmM2"
[35] "SkinReflectanceRightUpperInnerArmM3"
[36] "SkinReflectanceRightUpperInnerArmR1"
[37] "SkinReflectanceRightUpperInnerArmR2"
[38] "SkinReflectanceRightUpperInnerArmR3"
[39] "SkinReflectanceRightUpperInnerArmG1"
[40] "SkinReflectanceRightUpperInnerArmG2"
[41] "SkinReflectanceRightUpperInnerArmG3"
[42] "SkinReflectanceRightUpperInnerArmB1"
[43] "SkinReflectanceRightUpperInnerArmB2"
[44] "SkinReflectanceRightUpperInnerArmB3"
[45] "SkinReflectanceRightUpperInnerArmCIE_L1"
[46] "SkinReflectanceRightUpperInnerArmCIE_L2"
[47] "SkinReflectanceRightUpperInnerArmCIE_L3"
[48] "SkinReflectanceRightUpperInnerArmCIE_a1"
[49] "SkinReflectanceRightUpperInnerArmCIE_a2"
[50] "SkinReflectanceRightUpperInnerArmCIE_a3"
[51] "SkinReflectanceRightUpperInnerArmCIE_b1"
[52] "SkinReflectanceRightUpperInnerArmCIE_b2"
[53] "SkinReflectanceRightUpperInnerArmCIE_b3"
[54] "SkinReflectanceLeftUpperInnerArmE1"
[55] "SkinReflectanceLeftUpperInnerArmE2"
[56] "SkinReflectanceLeftUpperInnerArmE3"
[57] "SkinReflectanceLeftUpperInnerArmM1"
[58] "SkinReflectanceLeftUpperInnerArmM2"
[59] "SkinReflectanceLeftUpperInnerArmM3"
[60] "SkinReflectanceLeftUpperInnerArmR1"
[61] "SkinReflectanceLeftUpperInnerArmR2"
[62] "SkinReflectanceLeftUpperInnerArmR3"
[63] "SkinReflectanceLeftUpperInnerArmG1"
[64] "SkinReflectanceLeftUpperInnerArmG2"
[65] "SkinReflectanceLeftUpperInnerArmG3"
[66] "SkinReflectanceLeftUpperInnerArmB1"
[67] "SkinReflectanceLeftUpperInnerArmB2"
[68] "SkinReflectanceLeftUpperInnerArmB3"
[69] "SkinReflectanceLeftUpperInnerArmCIE_L1"
[70] "SkinReflectanceLeftUpperInnerArmCIE_L2"
[71] "SkinReflectanceLeftUpperInnerArmCIE_L3"
[72] "SkinReflectanceLeftUpperInnerArmCIE_a1"
[73] "SkinReflectanceLeftUpperInnerArmCIE_a2"
[74] "SkinReflectanceLeftUpperInnerArmCIE_a3"
[75] "SkinReflectanceLeftUpperInnerArmCIE_b1"
[76] "SkinReflectanceLeftUpperInnerArmCIE_b2"
[77] "SkinReflectanceLeftUpperInnerArmCIE_b3"
[78] "VitDResult"
# Count of each type of column
cat("\nCount of each type of reflectance measurement:\n")
Count of each type of reflectance measurement:
count_pattern <- function(pattern) {
sum(grepl(pattern, all_final_columns))
}
cat("E columns:", count_pattern("E[1-3]$"), "\n")
E columns: 9
cat("M columns:", count_pattern("M[1-3]$"), "\n")
M columns: 9
cat("R columns:", count_pattern("R[1-3]$"), "\n")
R columns: 9
cat("G columns:", count_pattern("G[1-3]$"), "\n")
G columns: 9
cat("B columns:", count_pattern("B[1-3]$"), "\n")
B columns: 9
cat("CIE_L columns:", count_pattern("CIE_L[1-3]$"), "\n")
CIE_L columns: 9
cat("CIE_a columns:", count_pattern("CIE_a[1-3]$"), "\n")
CIE_a columns: 9
cat("CIE_b columns:", count_pattern("CIE_b[1-3]$"), "\n")
CIE_b columns: 9
library(tidyverse)
library(readr)
# 1) Read your dataset (wide format).
skin_data <- read_csv("data/ScreeningDataCollection_cleaned.csv")
Rows: 220 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ParticipantCentreID, Ethnicity, Season
dbl (74): Gender, SkinReflectanceForeheadE1, SkinReflectanceForeheadE2, Ski...
date (1): TodayDate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 2) Convert to long + calculate CV.
# This function is essentially your original pivot plus CV calculation.
prepare_triplicates_for_plotting_with_cv <- function(data) {
body_sites <- c("Forehead", "RightUpperInnerArm", "LeftUpperInnerArm")
measure_types <- c("E", "M", "R", "G", "B", "CIE_L", "CIE_a", "CIE_b")
all_triplicates <- tibble()
for (body_site in body_sites) {
for (measure_type in measure_types) {
# Identify columns like SkinReflectanceForeheadE1..3
col_pattern <- paste0("SkinReflectance", body_site, measure_type, "[1-3]$")
triplicate_cols <- names(data)[grepl(col_pattern, names(data))]
if (length(triplicate_cols) != 3) {
cat("Warning: Did not find exactly 3 columns for", body_site, "-", measure_type,
". Found", length(triplicate_cols), "columns.\n")
next
}
# Pivot these columns to long
trip_long <- data %>%
select(ParticipantCentreID, Gender, Ethnicity, Season, all_of(triplicate_cols)) %>%
pivot_longer(
cols = all_of(triplicate_cols),
names_to = "measurement",
values_to = "value"
) %>%
mutate(
replicate = str_extract(measurement, "\\d$"),
body_site = body_site,
measurement_type = measure_type
)
# Compute CV info for each participant-season-bodySite-metric group
cv_info <- trip_long %>%
group_by(ParticipantCentreID, Season, body_site, measurement_type) %>%
summarize(
mean_val = mean(value, na.rm = TRUE),
sd_val = sd(value, na.rm = TRUE),
n_valid = sum(!is.na(value)),
cv = ifelse(n_valid >= 2, (sd_val / mean_val) * 100, NA_real_),
.groups = "drop"
) %>%
mutate(
cv_category = case_when(
is.na(cv) ~ NA_character_,
cv <= 5 ~ "Low",
cv <= 15 ~ "Medium",
TRUE ~ "High"
)
)
# Join the CV info back to each replicate row
trip_cv <- trip_long %>%
left_join(cv_info,
by = c("ParticipantCentreID","Season","body_site","measurement_type"))
all_triplicates <- bind_rows(all_triplicates, trip_cv)
}
}
return(all_triplicates)
}
library(dplyr)
calculate_triplicate_cv <- function(long_data) {
library(dplyr)
cv_info <- long_data %>%
group_by(ParticipantCentreID, Season, body_site, measurement_type) %>%
summarize(
mean_val = mean(value, na.rm = TRUE),
sd_val = sd(value, na.rm = TRUE),
n_valid = sum(!is.na(value)),
cv = ifelse(n_valid >= 2 & mean_val != 0 & !is.na(mean_val),
(sd_val / mean_val)*100, NA_real_),
.groups = "drop"
) %>%
mutate(
cv_category = case_when(
is.na(cv) ~ NA_character_,
cv <= 5 ~ "Low",
cv <= 15 ~ "Medium",
TRUE ~ "High"
)
)
# Remove old CV columns if they exist, then join new ones
long_data_stripped <- long_data %>%
select(-any_of(c("mean_val","sd_val","n_valid","cv","cv_category")))
long_data_cv <- long_data_stripped %>%
left_join(cv_info, by=c("ParticipantCentreID","Season","body_site","measurement_type"))
return(long_data_cv)
}
library(ggplot2)
plot_triplicate_cv_lines <- function(long_data_cv, output_dir = "output/cv_line_plots") {
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
# Unique combos of body_site & measurement_type
combos <- long_data_cv %>%
select(body_site, measurement_type) %>%
distinct()
all_plots <- list()
for (i in seq_len(nrow(combos))) {
bs <- combos$body_site[i]
mt <- combos$measurement_type[i]
plot_data <- long_data_cv %>%
filter(body_site == bs, measurement_type == mt)
p <- ggplot(plot_data, aes(
x = replicate,
y = value,
group = interaction(ParticipantCentreID, Season),
color = cv_category
)) +
geom_line(alpha = 0.6, na.rm = TRUE) +
geom_point(alpha = 0.6, na.rm = TRUE) +
facet_wrap(~ Season) +
scale_color_manual(
values = c("Low" = "green", "Medium" = "orange", "High" = "red"),
na.value = "grey50",
name = "CV Category"
) +
labs(
title = paste("Triplicate Measurements for", bs, "-", mt),
subtitle = "Colored by CV: ≤5% (Green), 5–15% (Orange), >15% (Red)",
x = "Replicate",
y = "Value"
) +
theme_minimal()
filename <- file.path(output_dir, paste0("cv_", bs, "_", mt, ".png"))
ggsave(filename, p, width = 12, height = 8)
all_plots[[paste(bs, mt, sep="_")]] <- p
}
return(all_plots)
}
## ---- pivot-long-original ----
library(tidyverse)
library(readr)
# 1) Read the wide dataset
skin_data <- read_csv("data/ScreeningDataCollection_cleaned.csv")
Rows: 220 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ParticipantCentreID, Ethnicity, Season
dbl (74): Gender, SkinReflectanceForeheadE1, SkinReflectanceForeheadE2, Ski...
date (1): TodayDate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 2) Convert to long + calculate initial CV
all_triplicates_cv <- prepare_triplicates_for_plotting_with_cv(skin_data)
# 3) Plot lines color-coded by the initial CV
cv_plots_original <- plot_triplicate_cv_lines(
all_triplicates_cv,
output_dir = "output/cv_line_plots_original"
)
detect_distribution_outliers_by_metric <- function(long_data, iqr_factor = 1.5) {
# Summarize distribution stats by measurement_type ONLY
distribution_stats <- long_data %>%
group_by(measurement_type) %>%
summarize(
n = n(),
min_value = min(value, na.rm = TRUE),
q1 = quantile(value, 0.25, na.rm = TRUE),
median_val = median(value, na.rm = TRUE),
q3 = quantile(value, 0.75, na.rm = TRUE),
max_value = max(value, na.rm = TRUE),
iqr = q3 - q1,
lower_thresh= q1 - iqr_factor * iqr,
upper_thresh= q3 + iqr_factor * iqr,
.groups = "drop"
)
# Join thresholds back by measurement_type only
data_with_thresh <- long_data %>%
left_join(distribution_stats, by = "measurement_type")
# Mark outliers
data_out <- data_with_thresh %>%
mutate(is_outlier_allSites = value < lower_thresh | value > upper_thresh)
list(
stats = distribution_stats,
data = data_out
)
}
plot_distribution_outliers_by_metric <- function(data_with_outliers, dist_stats,
output_dir = "output/dist_plots_by_metric") {
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
# Unique measurement_types
metrics <- dist_stats$measurement_type
all_plots <- list()
for (mt in metrics) {
# Pull the row with thresholds for this metric
thresh <- dist_stats %>%
filter(measurement_type == mt)
# If there's no data for that metric, skip
if (nrow(thresh) == 0) next
# Filter the main data to just this metric
plot_data <- data_with_outliers %>%
filter(measurement_type == mt)
# Build a histogram ignoring site
p <- ggplot(plot_data, aes(x = value)) +
geom_histogram(
binwidth = (thresh$max_value - thresh$min_value) / 30,
fill = "skyblue", alpha = 0.7
) +
geom_vline(xintercept = thresh$lower_thresh, color = "red", linetype = "dashed") +
geom_vline(xintercept = thresh$upper_thresh, color = "red", linetype = "dashed") +
labs(
title = paste("Distribution of", mt, "(All Body Sites)"),
subtitle = paste0("Red dashed lines = outlier cutoffs (",
round(thresh$lower_thresh, 2), ", ",
round(thresh$upper_thresh, 2), ")"),
x = paste(mt, "Value"),
y = "Count"
) +
theme_minimal()
# Save the plot
filename <- file.path(output_dir, paste0("dist_", mt, "_allSites.png"))
ggsave(filename, p, width = 10, height = 6)
all_plots[[mt]] <- p
}
return(all_plots)
}
## ---- distribution-detection ----
# 1) Detect outliers ignoring body site
res <- detect_distribution_outliers_by_metric(all_triplicates_cv)
dist_stats <- res$stats
dist_data <- res$data # 'dist_data' has is_outlier_allSites = TRUE/FALSE
# 2) Plot distribution ignoring site
dist_plots_allSites <- plot_distribution_outliers_by_metric(
data_with_outliers = dist_data,
dist_stats = dist_stats,
output_dir = "output/dist_plots_by_metric"
)
Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 18 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 18 rows containing non-finite outside the scale range (`stat_bin()`).
Warning: Removed 19 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_bin()`).
## ---- distribution-removal-stage1 ----
# 1) Remove distribution outliers => set 'value' to NA if is_outlier_allSites == TRUE
dist_data_clean <- dist_data %>%
mutate(value = ifelse(is_outlier_allSites, NA_real_, value))
# 2) Recompute CV on this newly "cleaned" data
dist_data_clean_cv <- calculate_triplicate_cv(dist_data_clean)
# 3) Plot lines with updated CV categories
cv_plots_stage1 <- plot_triplicate_cv_lines(
dist_data_clean_cv,
output_dir = "output/cv_line_plots_stage1"
)
## ---- distribution-removal-stage1 ----
# 1) Remove distribution outliers => set 'value' to NA if is_outlierAllSites == TRUE
dist_data_clean <- dist_data %>%
mutate(value = ifelse(is_outlier_allSites, NA_real_, value))
# 2) (Optional) Save or examine replicate-level data after Stage 1
write_csv(dist_data_clean, "data/ScreeningDataCollection_stage1.csv")
names(dist_data_clean)
[1] "ParticipantCentreID" "Gender" "Ethnicity"
[4] "Season" "measurement" "value"
[7] "replicate" "body_site" "measurement_type"
[10] "mean_val" "sd_val" "n_valid"
[13] "cv" "cv_category" "n"
[16] "min_value" "q1" "median_val"
[19] "q3" "max_value" "iqr"
[22] "lower_thresh" "upper_thresh" "is_outlier_allSites"
head(dist_data_clean)
# A tibble: 6 × 24
ParticipantCentreID Gender Ethnicity Season measurement value replicate
<chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 VDKH001 1 Xhosa Summer SkinReflectanceFo… 17.0 1
2 VDKH001 1 Xhosa Summer SkinReflectanceFo… 18.7 2
3 VDKH001 1 Xhosa Summer SkinReflectanceFo… 18.6 3
4 VDKH002 1 Xhosa Summer SkinReflectanceFo… 17.2 1
5 VDKH002 1 Xhosa Summer SkinReflectanceFo… 19.8 2
6 VDKH002 1 Xhosa Summer SkinReflectanceFo… 17.4 3
# ℹ 17 more variables: body_site <chr>, measurement_type <chr>, mean_val <dbl>,
# sd_val <dbl>, n_valid <int>, cv <dbl>, cv_category <chr>, n <int>,
# min_value <dbl>, q1 <dbl>, median_val <dbl>, q3 <dbl>, max_value <dbl>,
# iqr <dbl>, lower_thresh <dbl>, upper_thresh <dbl>,
# is_outlier_allSites <lgl>
# 2) Recompute CV on this newly "cleaned" data
dist_data_clean_cv <- calculate_triplicate_cv(dist_data_clean)
names(dist_data_clean_cv)
[1] "ParticipantCentreID" "Gender" "Ethnicity"
[4] "Season" "measurement" "value"
[7] "replicate" "body_site" "measurement_type"
[10] "n" "min_value" "q1"
[13] "median_val" "q3" "max_value"
[16] "iqr" "lower_thresh" "upper_thresh"
[19] "is_outlier_allSites" "mean_val" "sd_val"
[22] "n_valid" "cv" "cv_category"
head(dist_data_clean_cv)
# A tibble: 6 × 24
ParticipantCentreID Gender Ethnicity Season measurement value replicate
<chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 VDKH001 1 Xhosa Summer SkinReflectanceFo… 17.0 1
2 VDKH001 1 Xhosa Summer SkinReflectanceFo… 18.7 2
3 VDKH001 1 Xhosa Summer SkinReflectanceFo… 18.6 3
4 VDKH002 1 Xhosa Summer SkinReflectanceFo… 17.2 1
5 VDKH002 1 Xhosa Summer SkinReflectanceFo… 19.8 2
6 VDKH002 1 Xhosa Summer SkinReflectanceFo… 17.4 3
# ℹ 17 more variables: body_site <chr>, measurement_type <chr>, n <int>,
# min_value <dbl>, q1 <dbl>, median_val <dbl>, q3 <dbl>, max_value <dbl>,
# iqr <dbl>, lower_thresh <dbl>, upper_thresh <dbl>,
# is_outlier_allSites <lgl>, mean_val <dbl>, sd_val <dbl>, n_valid <int>,
# cv <dbl>, cv_category <chr>
# 2) Remove high CV replicate
remove_high_cv_replicate <- function(data, cv_threshold = 15) {
library(dplyr)
data_new <- data %>%
group_by(ParticipantCentreID, Season, body_site, measurement_type) %>%
group_modify(~ {
.x <- .x
# "Classic" CV with mean
mean_val <- mean(.x$value, na.rm=TRUE)
sd_val <- sd(.x$value, na.rm=TRUE)
n_valid <- sum(!is.na(.x$value))
cv_val <- if (n_valid >= 2 && !is.na(mean_val) && mean_val != 0) {
(sd_val / mean_val) * 100
} else {
NA_real_
}
# Use median to pick the farthest replicate if CV>threshold
median_val <- median(.x$value, na.rm=TRUE)
if (!is.na(cv_val) && cv_val > cv_threshold) {
idx_farthest <- which.max(abs(.x$value - median_val))
.x$value[idx_farthest] <- NA
}
.x
}) %>%
ungroup()
return(data_new)
}
## ---- stage2-highcv-removal ----
# 5) Remove high CV replicate from Stage 1–cleaned data
data_stage2 <- remove_high_cv_replicate(dist_data_clean, cv_threshold = 15)
# 6) Recompute CV again with final replicate set
data_stage2_cv <- calculate_triplicate_cv(data_stage2)
# 7) Plot final lines color-coded by updated CV for Stage 2
cv_plots_stage2 <- plot_triplicate_cv_lines(
data_stage2_cv,
output_dir = "output/cv_line_plots_stage2"
)
# 8) (Optional) Save replicate-level final dataset (Stage 2 cleaned)
write_csv(data_stage2, "data/ScreeningDataCollection_stage2.csv")
## ---- stage2-final-medians ----
# 9) Create final medians from data_stage2
df_final_medians <- data_stage2 %>%
group_by(ParticipantCentreID, Ethnicity, Gender, Season, body_site, measurement_type) %>%
summarize(
final_median = median(value, na.rm = TRUE),
n_valid = sum(!is.na(value)),
.groups = "drop"
)
# 10) Write final medians
write_csv(df_final_medians, "data/ScreeningDataCollection_medians.csv")
sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Detroit
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 readxl_1.4.3 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.50 bslib_0.9.0 processx_3.8.5
[5] callr_3.7.6 tzdb_0.4.0 vctrs_0.6.5 tools_4.4.3
[9] ps_1.8.1 generics_0.1.3 parallel_4.4.3 pkgconfig_2.0.3
[13] lifecycle_1.0.4 compiler_4.4.3 farver_2.1.2 git2r_0.35.0
[17] textshaping_1.0.0 munsell_0.5.1 getPass_0.2-4 httpuv_1.6.15
[21] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10 later_1.4.1
[25] pillar_1.10.1 crayon_1.5.3 jquerylib_0.1.4 whisker_0.4.1
[29] cachem_1.1.0 tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
[33] labeling_0.4.3 rprojroot_2.0.4 fastmap_1.2.0 grid_4.4.3
[37] colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3 utf8_1.2.4
[41] withr_3.0.2 scales_1.3.0 promises_1.3.2 bit64_4.6.0-1
[45] timechange_0.3.0 rmarkdown_2.29 httr_1.4.7 bit_4.5.0.1
[49] cellranger_1.1.0 ragg_1.3.3 hms_1.1.3 evaluate_1.0.3
[53] knitr_1.49 rlang_1.1.5 Rcpp_1.0.14 glue_1.8.0
[57] rstudioapi_0.17.1 vroom_1.6.5 jsonlite_1.8.9 R6_2.5.1
[61] systemfonts_1.2.1 fs_1.6.5