Sapphire Project Overview

Summary

SAPPHIRE – South African Pigmentation and Physiological Health in Relation to Solar Irradiation and Reflectance of the Epidermis

Study Background

Southern Africa has been a locus of interaction for modern people for more than 2000 years. Cultural contacts and genetic admixture between KhoeSan and Bantu-language-speaking peoples, followed by interactions with populations from western Europe, South and East Asia, and Madagascar, have made the region one of the most diverse and complexly admixed in the world.

Skin color is remarkably diverse in the region, owing to the effects of selection and migration. Much remains to be understood about the genetic basis of constitutive pigmentation and tanning potential among southern Africans, and it is toward this end that we undertook this study.

The people we studied had been the focus of an earlier investigation into the factors determining the levels of serum vitamin D in healthy young adults in South Africa. The groups we studied, designated as Cape Mixed and Xhosa, exhibited great variation in constitutive skin color and tanning potential (as measured by melanin index [MI]), and in this study we sought to achieve a better understanding of the genetic basis of this variation.

Study Objectives

Primary Objective: Achieve better understanding of genetic basis for variation in constitutive skin pigmentation and tanning potential.
Manuscript Goals:
- Describe the novel Xhosa and Cape Mixed sampling across seasons, including population descriptions and pigmentation phenotypes
- Discuss admixture in both the Cape Mixed and Xhosa study populations
- Identify genes/SNPs contributing to pigmentation phenotypes (constitutive and facultative in summer) in Xhosa and Cape Mixed populations separately
- Discuss the potential contribution of ancestry and the genome to variation in measured vitamin D levels

Key Terminology

Constitutive: Genes that are continuously transcribed, or always “on”
Facultative: Reversible and does not exist in all cell types, genes are not always expressed
Polymorphic markers: Molecular marker signals used to reveal differences between individuals due to marker sequence differences
SNP: Single nucleotide polymorphism, genomic variant at a single base position in the DNA
PLINK: Genome association toolset
Epistatic interactions: Interactions among genetic variants at different loci which cause nonlinear effects on the phenotype
LD (Linkage Disequilibrium): Nonrandom association of alleles of loci that are close to each other

Data Overview

This project examines seasonal variation in skin pigmentation across 103 participants measured at three distinct seasonal timepoints. The study design captures within-subject changes in pigmentation from summer baseline through winter conditions and subsequent recovery. Data collection occurred between February 2013 (summer baseline in the Southern Hemisphere) and September 2013, spanning an initial summer measurement, a winter follow-up approximately six months later, and a 6-week post-winter assessment.

Data Structure

Pigmentation (Clinical) Measurements

Summer baseline: 103 participants, 123 attributes per participant
Winter follow-up (~6 months): 88 participants with repeated measurements
Post-winter assessment (6 weeks): 33 participants in final follow-up wave

Behavioral Data

Food Frequency Questionnaire: Approximately 200 variables quantifying dietary vitamin D intake and related nutritional factors
Sun Exposure Surveys: Approximately 30 variables documenting outdoor exposure patterns, time spent in direct sunlight, and protective behaviors

Progress

Database Setup

An SQL database architecture was established for the original dataset, enabling structured querying and efficient data management across the multiple study waves and data types (Database/load_data_to_mysql.py).

Data Integration

Merged datasets were generated by integrating measurements across the three study waves: summer, winter, and 6-week follow-up. This harmonization process aligned participant identifiers and consolidated visit-level measurements to enable longitudinal analyses (preprocessing/step1_data_merging.ipynb; derived output: derived_files/merged_data.xlsx).

Data Cleaning

Data cleaning operations were performed to consolidate redundant columns and remove invariant features. These procedures eliminated constant-valued variables and merged duplicate measurements, improving the signal-to-noise ratio for subsequent statistical modeling (preprocessing/step2_data_cleaning.ipynb; derived output: derived_files/cleaned_data.xlsx).

Replicate Readings analysis

To ensure measurement reliability, three replicate readings were collected for each measurement type (E, M, RGB, CIE Lab) at each body location (Forehead, Right Upper Inner Arm, Left Upper Inner Arm). Performing triplicate analysis to assess within-subject measurement variability and validate averaging methodology.

CV-Based Filtering and Replicate Averaging

Implemented a two-stage filtering approach to remove outlier replicate values and improve data quality:

Global CV Filter: Removes individual replicate outliers across all participants for each measurement type using coefficient of variation (CV) thresholds. This eliminates extreme values that fall outside acceptable ranges based on global statistics.
Per-Participant Filtering: Uses pair-based CV analysis to identify and remove outlier replicates within each participant’s set of three measurements. The method tests all possible pairs and removes values that create unstable pairs using a CV Reduction Factor, ensuring only consistent measurements remain.

After filtering, remaining valid replicates are averaged to produce final measurement values.

Time Analysis and Melanin Index Comparison

Date information was extracted from the original dataset and merged with filtered data to enable temporal analyses. Missing dates were imputed using the mode (most frequent survey date) for each timepoint. A melanin index (M) analysis was conducted to compare pigmentation patterns between sun-exposed sites (forehead) and protected sites (inner arms) across seasonal timepoints. Statistical comparisons using paired t-tests revealed seasonal differences in pigmentation, with the forehead showing higher melanin values than inner arms in both summer and winter, though the difference was only statistically significant in winter (preprocessing/step5_melanin_analysis.ipynb; analysis summary: derived_files/melanin_analysis_summary.txt).

Sun Exposure Variable Encoding

Sun exposure survey data, originally collected in natural language and Yes/No formats, was systematically encoded into quantitative variables suitable for statistical analysis. Time spent outdoors was converted from categorical responses (e.g., “Between 2 and 5 hours”) to numeric hours using midpoint estimates. Time of day responses were encoded as numeric hours (0-24 format) with an additional high UV exposure indicator flagging exposure during peak UV hours (10am-4pm). Composite exposure metrics were created including: (1) Total Weekly Hours combining weekday and weekend exposure, (2) Body Site Exposure Score summing exposed anatomical sites (0-9 range), (3) Sun Protection Score combining hat use, sunscreen use, SPF level, and site-specific protection, and (4) Net Exposure Index quantifying unprotected exposure by accounting for both exposure duration/area and protection level. Original natural language and Yes/No columns were removed after encoding to streamline the dataset (preprocessing/step6_sun_exposure_analysis.ipynb; derived output: derived_files/sun_exposure_encoded.xlsx; encoding documentation: derived_files/sun_exposure_encoding_documentation.txt).

Sun Exposure and Melanin Change Analysis

Relationships between sun exposure variables and melanin changes between summer and winter timepoints were analyzed to understand how behavioral factors influence pigmentation. Simple correlation analyses (Pearson) were conducted between exposure metrics (total weekly hours, body site exposure, sun protection scores) and melanin change variables (forehead and inner arm). Stratified analyses examining the effect of sunscreen use on melanin change. Regression models were built to predict melanin change: Model 1 predicted forehead melanin change using time outdoors, forehead exposure status, sunscreen use, SPF level, and baseline melanin; Model 2 predicted inner arm melanin change using time outdoors, arm exposure status, sunscreen use, and baseline melanin. These models tested whether increased sun exposure predicts greater melanin increase, whether sunscreen protects against melanin increase, and whether baseline melanin level affects the magnitude of change (preprocessing/step7_exposure_vs_melanin.ipynb).

Machine Learning Models for Melanin Change Prediction

Machine learning approaches were applied to predict melanin change from sun exposure variables, providing non-linear modeling capabilities and feature importance rankings. Three model types were implemented: (1) Decision Tree Regressor for interpretable rule-based predictions, (2) Random Forest Regressor as an ensemble method combining multiple trees for improved accuracy, and (3) Gradient Boosting Regressor using advanced boosting algorithms. Models were evaluated using cross-validation, R² scores, RMSE, and mean absolute error metrics. Feature importance analysis identified which exposure variables (e.g., total weekly hours, body site exposure, protection scores) were most predictive of melanin change. Model performance was compared across approaches, and separate models were built for forehead and inner arm melanin change. Visualizations included feature importance plots and predicted vs actual value to assess model fit (preprocessing/step8_exposure_ML_modelling.ipynb).

Vitamin D Analysis

Comprehensive analysis of vitamin D levels across seasons and their relationships with melanin index, sun exposure, and dietary intake. A significant seasonal drop in vitamin D was observed from summer to winter, with mean decrease of 9.79 ng/mL (p<0.0001), representing a clinically meaningful decline. The proportion of participants below the deficiency threshold (<20 ng/mL) increased from 11.9% in summer to 67.9% in winter.

Baseline melanin index showed no significant correlation with summer vitamin D levels, indicating that constitutive pigmentation does not predict baseline vitamin D status. Summer sun exposure variables showed significant positive correlations with summer vitamin D levels, with Net Exposure Index (ρ=0.312, p=0.004) and Total Weekly Hours (ρ=0.250, p=0.022) demonstrating that greater sun exposure was associated with higher vitamin D levels. Dietary intake showed no significant relationship with vitamin D levels, suggesting that UVB exposure is the primary determinant of vitamin D status.

An integrated model predicting vitamin D change achieved moderate fit (R²=0.52) and identified baseline vitamin D as the strongest predictor (β=-0.572, p<0.0001), consistent with regression to the mean. Higher baseline vitamin D levels were associated with less seasonal decline, likely because individuals with high summer levels have less room to drop (preprocessing/step9_vitamin_analysis.ipynb).

Time Point Analysis: Days Between Timepoints and Melanin Index Correlation

Temporal analysis examined whether the interval between summer and winter measurements influenced the magnitude of melanin index changes. Date information was extracted from the original dataset and used to calculate the number of days between timepoints for each participant. Analysis included 77 participants with complete date and melanin index data for both summer and winter timepoints. Correlation analyses revealed no significant relationship between days between timepoints and forehead melanin change (r=-0.075, p=0.519), suggesting that the timing of winter measurements relative to summer baseline did not systematically affect changes in sun-exposed sites. However, a significant negative correlation was observed between days between timepoints and inner arm melanin change (r=-0.236, p=0.039), indicating that longer intervals between measurements were associated with smaller changes in protected sites. This finding may reflect natural variation in measurement timing or suggest that longer intervals allow for more complete seasonal transitions in protected anatomical sites. Visualizations included scatter plots with regression lines, correlation matrices, and distribution histograms to characterize the temporal patterns (preprocessing/step10_Time_Point_analysis.ipynb).