This section outlines the preprocessing steps applied to the raw NDIS data before analysis.
First of all, to ensure consistency for analysis, we fixed jurisdiction names that were not correctly scraped in the dataset as we see here:
Alabama, Alabama Alabama, Alabama Stats Alabama, Alaska, Alaska Alaska, Alaska Stats Alaska, and Legal profiles at NDIS. Statistics as of April 2025 Alabama, and Legal profiles at NDIS. Statistics as of June 2025 Alabama, and Legal profiles at NDIS. Statistics as of March 2025 Alabama, Arizona, Arizona Arizona, Arizona Stats Arizona, Arkansas, Arkansas Arkansas, Arkansas Stats Arkansas, Army, California, California California, California Stats California, Colorado, Colorado Colorado, Colorado Stats Colorado, Connecticut, Connecticut Connecticut, Connecticut Stats Connecticut, DC, DC map pin.) Statistics as of August 2023 Alabama, DC map pin.) Statistics as of February 2024 Alabama, DC map pin.) Statistics as of January 2025 Alabama, DC map pin.) Statistics as of November 2024 Alabama, DC map pin.) Statistics as of October 2024 Alabama, DC map pin). Statistics as of November 2022 Alabama, DC/FBI Lab, DC/Metro PD, Delaware, Delaware Delaware, Delaware Stats Delaware, Florida, Florida Florida, Florida Stats Florida, Georgia, Georgia Georgia, Georgia Stats Georgia, Hawaii, Hawaii Hawaii, Hawaii Stats Hawaii, Idaho, Idaho Idaho, Illinois, Illinois Illinois, Illinois Stats Illinois, Indiana, Indiana Indiana, Indiana Stats Indiana, Iowa, Iowa Iowa, Iowa Stats Iowa, Kansas, Kansas Kansas, Kansas Stats Kansas, Kentucky, Kentucky Kentucky, Kentucky Stats Kentucky, Lab, Louisiana, Louisiana Louisiana, Louisiana Stats Louisiana, Maine, Maine Maine, Maine Stats Maine, Maryland, Maryland Maryland, Maryland Stats Maryland, Massachusetts, Massachusetts Massachusetts, Massachusetts Stats Massachusetts, Mexico Stats New Mexico, Michigan, Michigan Michigan, Michigan Stats Idaho, Michigan Stats Michigan, Michigan Stats Utah, Minnesota, Minnesota Minnesota, Minnesota Stats Minnesota, Mississippi, Mississippi Mississippi, Mississippi Stats Mississippi, Missouri, Missouri Missouri, Missouri Stats Missouri, Montana, Montana Montana, Montana Stats Montana, Nebraska, Nebraska Nebraska, Nebraska Stats Nebraska, Nevada, Nevada Nevada, Nevada Stats Nevada, New Hampshire, New Hampshire New Hampshire, New Hampshire Stats New Hampshire, New Jersey, New Jersey New Jersey, New Jersey Stats New Jersey, New Mexico, New Mexico New Mexico, New York, New York New York, New York Stats New York, North Carolina, North Carolina North Carolina, North Carolina Stats North Carolina, North Dakota, North Dakota North Dakota, North Dakota Stats North Dakota, Ohio, Ohio Ohio, Ohio Stats Ohio, Oklahoma, Oklahoma Oklahoma, Oklahoma Stats Oklahoma, Oregon, Oregon Oregon, Oregon Stats Oregon, Participant Alabama, Pennsylvania, Pennsylvania Pennsylvania, Pennsylvania Stats Pennsylvania, PR, Puerto Rico, Rhode Island, Rhode Island Rhode Island, Rhode Island Stats Rhode Island, South Carolina, South Carolina South Carolina, South Carolina Stats South Carolina, South Dakota, South Dakota South Dakota, South Dakota Stats South Dakota, Tables by NDIS Participant Alabama, Tennessee, Tennessee Stats Tennessee, Tennessee Tennessee, Texas, Texas Stats Texas, Texas Texas, U.S. Army, Utah, Utah Utah, Vermont, Vermont Stats Vermont, Vermont Vermont, Virginia, Virginia Stats Virginia, Virginia Virginia, Washington, Washington State Stats Washington, Washington State Washington, West Virginia, West Virginia Stats West Virginia, West Virginia Stats Wyoming, West Virginia West Virginia, Wisconsin, Wisconsin Stats Wisconsin, Wisconsin Wisconsin, Wyoming, Wyoming Wyoming
Show cleaning code (jurisdiction)
# Clean jurisdiction names with Alabama-specific patternsndis_data_jurisdiction <- ndis_data %>%mutate(jurisdiction =case_when(# Standard state namesstr_detect(jurisdiction, "Alabama$|Alabama Stats") ~"Alabama",str_detect(jurisdiction, "Alaska$|Alaska Stats") ~"Alaska",str_detect(jurisdiction, "Arizona$|Arizona Stats") ~"Arizona",str_detect(jurisdiction, "Arkansas$|Arkansas Stats") ~"Arkansas",str_detect(jurisdiction, "California$|California Stats") ~"California",str_detect(jurisdiction, "Colorado$|Colorado Stats") ~"Colorado",str_detect(jurisdiction, "Connecticut$|Connecticut Stats") ~"Connecticut",str_detect(jurisdiction, "Delaware$|Delaware Stats") ~"Delaware",str_detect(jurisdiction, "Florida$|Florida Stats") ~"Florida",str_detect(jurisdiction, "Georgia$|Georgia Stats") ~"Georgia",str_detect(jurisdiction, "Hawaii$|Hawaii Stats") ~"Hawaii",str_detect(jurisdiction, "Idaho$|Idaho Stats") ~"Idaho",str_detect(jurisdiction, "Illinois$|Illinois Stats") ~"Illinois",str_detect(jurisdiction, "Indiana$|Indiana Stats") ~"Indiana",str_detect(jurisdiction, "Iowa$|Iowa Stats") ~"Iowa",str_detect(jurisdiction, "Kansas$|Kansas Stats") ~"Kansas",str_detect(jurisdiction, "Kentucky$|Kentucky Stats") ~"Kentucky",str_detect(jurisdiction, "Louisiana$|Louisiana Stats") ~"Louisiana",str_detect(jurisdiction, "Maine$|Maine Stats") ~"Maine",str_detect(jurisdiction, "Maryland$|Maryland Stats") ~"Maryland",str_detect(jurisdiction, "Massachusetts$|Massachusetts Stats") ~"Massachusetts",str_detect(jurisdiction, "Michigan$|Michigan Stats") ~"Michigan",str_detect(jurisdiction, "Minnesota$|Minnesota Stats") ~"Minnesota",str_detect(jurisdiction, "Mississippi$|Mississippi Stats") ~"Mississippi",str_detect(jurisdiction, "Missouri$|Missouri Stats") ~"Missouri",str_detect(jurisdiction, "Montana$|Montana Stats") ~"Montana",str_detect(jurisdiction, "Nebraska$|Nebraska Stats") ~"Nebraska",str_detect(jurisdiction, "Nevada$|Nevada Stats") ~"Nevada",str_detect(jurisdiction, "New Hampshire$|New Hampshire Stats") ~"New Hampshire",str_detect(jurisdiction, "New Jersey$|New Jersey Stats") ~"New Jersey",str_detect(jurisdiction, "New Mexico$|New Mexico Stats|Mexico Stats") ~"New Mexico",str_detect(jurisdiction, "New York$|New York Stats") ~"New York",str_detect(jurisdiction, "North Carolina$|North Carolina Stats") ~"North Carolina",str_detect(jurisdiction, "North Dakota$|North Dakota Stats") ~"North Dakota",str_detect(jurisdiction, "Ohio$|Ohio Stats") ~"Ohio",str_detect(jurisdiction, "Oklahoma$|Oklahoma Stats") ~"Oklahoma",str_detect(jurisdiction, "Oregon$|Oregon Stats") ~"Oregon",str_detect(jurisdiction, "Pennsylvania$|Pennsylvania Stats") ~"Pennsylvania",str_detect(jurisdiction, "Rhode Island$|Rhode Island Stats") ~"Rhode Island",str_detect(jurisdiction, "South Carolina$|South Carolina Stats") ~"South Carolina",str_detect(jurisdiction, "South Dakota$|South Dakota Stats") ~"South Dakota",str_detect(jurisdiction, "Tennessee$|Tennessee Stats") ~"Tennessee",str_detect(jurisdiction, "Texas$|Texas Stats") ~"Texas",str_detect(jurisdiction, "Utah$|Utah Stats") ~"Utah",str_detect(jurisdiction, "Vermont$|Vermont Stats") ~"Vermont",str_detect(jurisdiction, "West Virginia$|West Virginia Stats") ~"West Virginia",str_detect(jurisdiction, "Virginia$|Virginia Stats") ~"Virginia",str_detect(jurisdiction, "Washington$|Washington State Stats") ~"Washington",str_detect(jurisdiction, "Wisconsin$|Wisconsin Stats") ~"Wisconsin",str_detect(jurisdiction, "Wyoming$|Wyoming Stats") ~"Wyoming",# Special jurisdictionsstr_detect(jurisdiction, "DC/FBI|Washington DC Stats|Lab") ~"DC/FBI Lab",str_detect(jurisdiction, "DC/Metro|DC") ~"DC/Metro PD",str_detect(jurisdiction, "U.S. Army$|U.S. Army Stats") ~"U.S. Army",str_detect(jurisdiction, "Puerto Rico$|Puerto Rico Stats") ~"Puerto Rico",str_detect(jurisdiction, "Tables by NDIS Participant") ~"Alabama", # Default to AlabamaTRUE~ jurisdiction ),# Clean up any remaining whitespacejurisdiction =str_trim(jurisdiction) ) %>%# Convert to factor with the 54 levels you wantmutate(jurisdiction =factor(jurisdiction,levels =c(sort(state.name), "Puerto Rico", "DC/FBI Lab", "DC/Metro PD", "U.S. Army"))) %>%# Filter out NA jurisdictionsfilter(!is.na(jurisdiction))
Updated Jurisdiction names:
Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming, Puerto Rico, DC/FBI Lab, DC/Metro PD, U.S. Army
Variables were reformatted into consistent date and time structures.
Key profile counts were combined into a total_profiles measure, and missing reporting periods were filled using available capture information.
Finally, year and month variables were standardized, and the dataset was reordered to ensure a clean, consistent structure for validation and analysis.
The cleaned dataset preserves the core NDIS metrics while standardizing temporal and jurisdictional dimensions for consistent analysis. Key structural improvements include:
· Temporal Standardization: Unified date handling with capture_datetime for data extraction timing and asof_month/asof_year for reported periods
· Jurisdictional Harmonization: Normalized 54 jurisdiction names (50 states + Puerto Rico, DC/FBI Lab, DC/Metro PD, U.S. Army) using consistent naming conventions
· Derived Metrics: Added total_profiles as the sum of offender, arrestee, and forensic profiles for comprehensive trend analysis
· Data Integrity: Removed ambiguous records and ensured proper typing for analytical operations
# Save cleaned data to CSVwrite_csv(ndis_intermediate, here::here("data", "ndis", "intermediate", "ndis_intermediate.csv"))message("✅ Intermediate dataset saved to 'data/ndis/intermediate' folder")
The National DNA Index System (NDIS) data for each jurisdiction is expected to show consistent growth over time. However, reporting issues create anomalies that require systematic detection and correction. This section documents the validation framework, with specific rules tailored to each metric following visual verification and analysis of the raw data.
Detection Rules
1. Spike-Dip Detection
A point is flagged as spike_dip if it deviates significantly from adjacent observations:
\(N_{j,t}^{(x)} \text{ is flagged if any of the following holds:}\)
\(N_{j,t}^{(x)} > 2 \times N_{j,t-1}^{(x)}\) (more than double the previous value)
\(N_{j,t}^{(x)} < 0.5 \times N_{j,t-1}^{(x)}\) (less than half the previous value)
\(N_{j,t}^{(x)} > 2 \times N_{j,t+1}^{(x)}\) (more than double the next value)
\(N_{j,t}^{(x)} < 0.5 \times N_{j,t+1}^{(x)}\) (less than half the next value)
A continuation of spike-dip is flagged as cont_spike_dip when the previous point was flagged as spike_dip AND the current point shows recovery:
Description: These flags capture temporary data surges or unexplained dips. A spike followed by recovery to near-normal levels, or isolated low values surrounded by higher values, indicate reporting anomalies rather than real changes in profiles.
2. Zero Error Detection
A point is flagged as zero_error if a zero appears after positive values:
Description: Legitimate DNA profile data cannot drop from positive to zero. When this occurs, it represents a reporting system error. All subsequent zeros until the data recovers to positive values are propagations of the same error and should be marked together.
3. Update Lag Detection
A point is flagged as osc_lag when values oscillate between similar levels in a compressed timeframe:
\[\text{If } \left[ \left( N_{j,t}^{(x)} < N_{j,t-1}^{(x)} \text{ AND } N_{j,t}^{(x)} < N_{j,t+1}^{(x)} \right) \text{ OR } \left( N_{j,t}^{(x)} = N_{j,t-1}^{(x)} \text{ AND } N_{j,t}^{(x)} < N_{j,t+1}^{(x)} \right) \right.\]
\[\text{AND } [t - (t-1) \leq 2 \text{ days}] \text{ AND } [(t+1) - t \leq 2 \text{ days}]\]
Description: When sequential reports within a 48-hour window show values that decrease or remain flat relative to neighbors, this indicates system synchronization delays where data is updating across multiple databases at different times. The same profile count is being reported inconsistently during the synchronization process.
Correction Rules
For spike_dip and cont_spike_dip Flags
Action: Remove the flagged point from the dataset.
Reason: Temporary data surges or isolated dips do not represent actual growth in profiles. Removing these points preserves the genuine underlying trend while eliminating reporting artifacts.
For zero_error and cont_zero_error Flags
Action: Remove all consecutive zero values starting from the first zero that follows a positive value, continuing until the data recovers to positive numbers.
Reason: Zeros appearing after positive counts are reporting failures, not real data. Removing the entire sequence of consecutive zeros eliminates the error cascade while preserving the valid trajectory before and after the error window.
For osc_lag Flags
Action: Within each oscillation cluster, retain only the highest value and remove all other points in the sequence.
Reason: The highest value represents the true data point; lower values in the cluster are transient states during system synchronization. Keeping the maximum preserves the actual profile count while removing the synchronization noise.
For Legitimate Decreases
Action: Preserve all decreases that do not match the patterns above.
Reason: Not all decreases are errors. Some reflect genuine profile removals due to expungements, legal stays, or case dismissals. Decreases outside the detection rules represent real changes in the database and should be retained.
Metric-Specific Validation Rules
Offender Profiles
Following visual verification and analysis of the raw data, the following rules were applied to the Offender Profiles metric:
Spike-Dip Detection: Flags points where values drop below half the previous value or fall below half the next value
Continuation Spike-Dip: Detects recovery points following flagged anomalies
Zero Error Detection: Flags any zero value appearing after positive values in the California jurisdiction
Continuation Zero Error: Tracks consecutive zeros following the initial error
Update Lag Detection: Identifies oscillations within 5-day windows where current values are lower than or equal to previous/next values
Value Propagation: Additionally flags any data point with the same value as previously flagged anomalies within the jurisdiction
All flagged points are removed to produce the cleaned dataset.
Forensic Profiles
Following visual verification and analysis of the raw data, the following rules were applied to the Forensic Profiles metric:
Spike-Dip Detection: Flags points where values drop below half the previous or next value, or exceed 2.5 times the next value
Continuation Spike-Dip: Detects recovery points following flagged anomalies
Zero Error Detection: Flags any zero value appearing after positive values
Continuation Zero Error: Tracks consecutive zeros following the initial error
Update Lag Detection: Identifies oscillations within 2-day windows where current values are lower than or equal to previous/next values
Value Propagation: Additionally flags any data point with the same value as previously flagged anomalies within the jurisdiction
All flagged points are removed to produce the cleaned dataset.
Arrestee Profiles
Following visual verification and analysis of the raw data (filtered to January 1, 2012 onwards), a focused rule set was applied to the Arrestee Profiles metric:
Zero Error Detection: Flags zero values appearing after positive values
No value propagation or other rules were applied based on the observed data patterns
All flagged zero error points are removed to produce the cleaned dataset.
Investigations Aided
Following visual verification and analysis of the raw data, the following rules were applied to the Investigations Aided metric:
Spike-Dip Detection: Flags points where values increase more than 10-fold relative to the previous value
Continuation Spike-Dip: Detects recovery points following flagged anomalies
Zero Error Detection: Flags any zero value appearing after positive values
Continuation Zero Error: Tracks consecutive zeros following the initial error
Value Propagation: Additionally flags any data point with the same value as previously flagged anomalies within the jurisdiction
All flagged points are removed to produce the cleaned dataset.
Participating Laboratories (NDIS Labs)
Following visual verification and analysis of the raw data, jurisdiction-specific rules were applied to the Participating Laboratories metric:
Spike-Dip Detection (Oklahoma): Flags points where the value increases more than 3-fold relative to the previous value
Spike-Dip Detection (Michigan): Flags points where the value drops to 25% or less of the previous value
Continuation Spike-Dip: Detects recovery points following flagged anomalies
No value propagation rule applied for this metric
All flagged points are removed to produce the cleaned dataset.
Note: All metrics employ a data deduplication step that retains only the first observation for each jurisdiction within the same capture datetime (rounded to seconds). Yearly summaries report the maximum value per jurisdiction per year, then aggregate across jurisdictions.
Offender Profiles Correction
Show Offender profiles visualization and correction code
# Filtering for ndis_labs > 0 and deduplication for same jurisdiction in the same capture_datetimendis_intermediate <- ndis_intermediate %>%mutate(capture_datetime = lubridate::round_date(capture_datetime, "second")) %>%group_by(jurisdiction, capture_datetime) %>%slice(1) %>%ungroup()#### Raw Offender profiles plot# Flag anomalies for offender profiles using formal detection rulesoffender_validation <- ndis_intermediate %>%arrange(jurisdiction, capture_datetime) %>%group_by(jurisdiction) %>%mutate(prev_value =lag(offender_profiles),next_value =lead(offender_profiles),# Time between observations (in days)days_prev =as.numeric(difftime(capture_datetime, lag(capture_datetime), units ="days")),days_next =as.numeric(difftime(lead(capture_datetime), capture_datetime, units ="days")),# Rule 1: Spike-Dip Detection# Flag if: (N_t > 2*N_{t-1}) OR (N_t < 0.5*N_{t-1}) OR (N_t > 2*N_{t+1}) OR (N_t < 0.5*N_{t+1})flag_spike_dip = ( (!is.na(prev_value) & offender_profiles <0.5* prev_value) | (!is.na(next_value) & offender_profiles <0.5* next_value) ),# Continuation of spike-dip: previous was flagged AND current shows recoveryprev_was_spike_dip =lag(flag_spike_dip),flag_cont_spike_dip = (!is.na(prev_was_spike_dip) & prev_was_spike_dip & ((!is.na(prev_value) & offender_profiles >0.5* prev_value & offender_profiles <2* prev_value) | (!is.na(prev_value) & offender_profiles == prev_value)) ),# Rule 2: Zero Error Detection# Flag if: (N_t == 0 AND N_{t-1} > 0)flag_zero_error = ( offender_profiles ==0&!is.na(prev_value) & prev_value >0 ),# Continuation of zero error: previous was flagged zero error AND current is zeroprev_was_zero_error =lag(flag_zero_error),flag_cont_zero_error = ( offender_profiles ==0&!is.na(prev_was_zero_error) & prev_was_zero_error ),# Rule 3: Update Lag Detection (oscillation)# Flag if: [(N_t < N_{t-1} AND N_t < N_{t+1}) OR (N_t == N_{t-1} AND N_t < N_{t+1}) OR (N_t < N_{t-1} AND N_t == N_{t+1})]# AND [days_prev <= 2 AND days_next <= 2]flag_osc_lag = (!is.na(prev_value) &!is.na(next_value) & ( (offender_profiles < prev_value & offender_profiles < next_value) | (offender_profiles < prev_value & offender_profiles == next_value) ) & (!is.na(days_next) & days_next <=5) ),prev_was_osc_lag =lag(flag_osc_lag),flag_cont_osc_lag = (!is.na(prev_was_osc_lag) & prev_was_osc_lag &!is.na(prev_value) & offender_profiles == prev_value ),# Combine all anomaly flagsflag_any = flag_spike_dip | flag_cont_spike_dip | flag_zero_error | flag_cont_zero_error | flag_osc_lag,# Replace NA with FALSEacross(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .)),# --- New rule: propagate by metric value within jurisdiction ---# TRUE if this offender_profiles value appears among the flagged values in this jurisdictionflag_same_value_propagate =ifelse(is.na(offender_profiles), FALSE, offender_profiles %in% offender_profiles[flag_any] ),# Update final flag_any to include this propagated-by-value flagflag_any = flag_any | flag_same_value_propagate ) %>%ungroup()# Create initial interactive plot for offender profiles with flagged points by typep_offender_raw <- offender_validation %>%plot_ly(x =~capture_datetime, y =~offender_profiles, color =~jurisdiction, type ='scatter', mode ='lines+markers', alpha =0.7,name =~jurisdiction) %>%add_markers(data = offender_validation %>%filter(flag_any),x =~capture_datetime, y =~offender_profiles,marker =list(size =12, symbol ='x', color ="red",line =list(width =3, color ='red'))) %>%layout(title ="Convicted Offender Profiles - Raw Data (Flagged Points Marked)",xaxis =list(title ="Date and Time",tickformat ="%Y-%m-%d %H:%M"),yaxis =list(title ="Offender Profiles"))p_offender_raw
Show Offender profiles visualization and correction code
#### Offender Profiles Correction ###### Correction for spike_dip and cont_spike_dip: Remove flagged pointsoffender_clean <- offender_validation %>%filter(!(flag_any)) %>%select(-starts_with("flag_"), -starts_with("prev_"), -starts_with("days_"), -next_value)# Plot cleaned datap_offender_clean <- offender_clean %>%plot_ly(x =~capture_datetime, y =~offender_profiles, color =~jurisdiction, type ='scatter', mode ='lines+markers', alpha =0.7) %>%layout(title ="Convicted Offender Profiles - Cleaned Data",xaxis =list(title ="Date"),yaxis =list(title ="Offender Profiles"))p_offender_clean
Show Offender profiles visualization and correction code
# Summarise highest offender profile per jurisdiction per yearoffender_yearly <- offender_clean %>%mutate(year =year(capture_datetime)) %>%group_by(jurisdiction, year) %>%summarise(max_offender =max(offender_profiles, na.rm =TRUE), .groups ="drop") %>%group_by(year) %>%summarise(total_max_offender =sum(max_offender, na.rm =TRUE), .groups ="drop")# Plot yearly sumsp_offender_yearly <- offender_yearly %>%plot_ly(x =~year, y =~total_max_offender,type ='scatter', mode ='lines+markers',line =list(color ="steelblue", width =3),marker =list(size =8, color ="darkred")) %>%layout(title ="Yearly Sum of Max Offender Profiles per Jurisdiction",xaxis =list(title ="Year"),yaxis =list(title ="Total Max Offender Profiles"))p_offender_yearly
Forensic Profiles Correction
Show Forensic profiles visualization and correction code
#### Raw Forensic profiles plot# Flag anomalies for forensic profiles using formal detection rulesforensic_validation <- ndis_intermediate %>%arrange(jurisdiction, capture_datetime) %>%group_by(jurisdiction) %>%mutate(prev_value =lag(forensic_profiles),next_value =lead(forensic_profiles),# Time between observations (in days)days_prev =as.numeric(difftime(capture_datetime, lag(capture_datetime), units ="days")),days_next =as.numeric(difftime(lead(capture_datetime), capture_datetime, units ="days")),# Rule 1: Spike-Dip Detection# Flag if: (N_t > 2*N_{t-1}) OR (N_t < 0.5*N_{t-1}) OR (N_t > 2*N_{t+1}) OR (N_t < 0.5*N_{t+1})flag_spike_dip = ( (!is.na(prev_value) & forensic_profiles <0.5* prev_value) | (!is.na(next_value) & forensic_profiles <0.5* next_value) | (!is.na(next_value) & forensic_profiles >2.5* next_value) ),# Continuation of spike-dip: previous was flagged AND current shows recoveryprev_was_spike_dip =lag(flag_spike_dip),flag_cont_spike_dip = (!is.na(prev_was_spike_dip) & prev_was_spike_dip & ((!is.na(prev_value) & forensic_profiles >0.5* prev_value & forensic_profiles <2* prev_value) | (!is.na(prev_value) & forensic_profiles == prev_value)) ),# Rule 2: Zero Error Detection# Flag if: (N_t == 0 AND N_{t-1} > 0)flag_zero_error = ( forensic_profiles ==0&!is.na(prev_value) & prev_value >0 ),# Continuation of zero error: previous was flagged zero error AND current is zeroprev_was_zero_error =lag(flag_zero_error),flag_cont_zero_error = ( forensic_profiles ==0&!is.na(prev_was_zero_error) & prev_was_zero_error ),# Rule 3: Update Lag Detection (oscillation)# Flag if: [(N_t < N_{t-1} AND N_t < N_{t+1}) OR (N_t == N_{t-1} AND N_t < N_{t+1}) OR (N_t < N_{t-1} AND N_t == N_{t+1})]# AND [days_prev <= 2 AND days_next <= 2]flag_osc_lag = (!is.na(prev_value) &!is.na(next_value) & ( (forensic_profiles < prev_value & forensic_profiles < next_value) | (forensic_profiles < prev_value & forensic_profiles == next_value) ) & (!is.na(days_next) & days_next <=2) ),prev_was_osc_lag =lag(flag_osc_lag),flag_cont_osc_lag = (!is.na(prev_was_osc_lag) & prev_was_osc_lag &!is.na(prev_value) & forensic_profiles == prev_value ),# Combine all anomaly flagsflag_any = flag_spike_dip | flag_cont_spike_dip | flag_zero_error | flag_cont_zero_error | flag_osc_lag,# Replace NA with FALSEacross(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .)),# --- New rule: propagate by metric value within jurisdiction ---# TRUE if this forensic_profiles value appears among the flagged values in this jurisdictionflag_same_value_propagate =ifelse(is.na(forensic_profiles), FALSE, forensic_profiles %in% forensic_profiles[flag_any] ),# Update final flag_any to include this propagated-by-value flagflag_any = flag_any | flag_same_value_propagate ) %>%ungroup()# Create initial interactive plot for forensic profiles with flagged points by typep_forensic_raw <- forensic_validation %>%plot_ly(x =~capture_datetime, y =~forensic_profiles, color =~jurisdiction, type ='scatter', mode ='lines+markers', alpha =0.7,name =~jurisdiction) %>%add_markers(data = forensic_validation %>%filter(flag_any),x =~capture_datetime, y =~forensic_profiles,marker =list(size =12, symbol ='x', color ="red",line =list(width =3, color ='red'))) %>%layout(title ="Forensic Profiles - Raw Data (Flagged Points Marked)",xaxis =list(title ="Date and Time",tickformat ="%Y-%m-%d %H:%M"),yaxis =list(title ="Forensic Profiles"))p_forensic_raw
Show Forensic profiles visualization and correction code
#### Forensic Profiles Correction ###### Correction for spike_dip and cont_spike_dip: Remove flagged pointsforensic_clean <- forensic_validation %>%filter(!(flag_any)) %>%select(-starts_with("flag_"), -starts_with("prev_"), -starts_with("days_"), -next_value)# Plot cleaned datap_forensic_clean <- forensic_clean %>%plot_ly(x =~capture_datetime, y =~forensic_profiles, color =~jurisdiction, type ='scatter', mode ='lines+markers', alpha =0.7) %>%layout(title ="Forensic Profiles - Cleaned Data",xaxis =list(title ="Date"),yaxis =list(title ="Forensic Profiles"))p_forensic_clean
Show Forensic profiles visualization and correction code
# Summarise highest forensic profile per jurisdiction per yearforensic_yearly <- forensic_clean %>%mutate(year =year(capture_datetime)) %>%group_by(jurisdiction, year) %>%summarise(max_forensic =max(forensic_profiles, na.rm =TRUE), .groups ="drop") %>%group_by(year) %>%summarise(total_max_forensic =sum(max_forensic, na.rm =TRUE), .groups ="drop")# Plot yearly sumsp_forensic_yearly <- forensic_yearly %>%plot_ly(x =~year, y =~total_max_forensic,type ='scatter', mode ='lines+markers',line =list(color ="steelblue", width =3),marker =list(size =8, color ="darkred")) %>%layout(title ="Yearly Sum of Max Forensic Profiles per Jurisdiction",xaxis =list(title ="Year"),yaxis =list(title ="Total Max Forensic Profiles"))p_forensic_yearly
Arrestee Profiles Correction
Show Arrestee profiles visualization and correction code
#### Raw Arrestee profiles plot# Flag anomalies for arrestee profiles using formal detection rulesarrestee_validation <- ndis_intermediate %>%filter(capture_datetime >=as.Date("2012-01-01")) %>%arrange(jurisdiction, capture_datetime) %>%group_by(jurisdiction) %>%mutate(prev_value =lag(arrestee),next_value =lead(arrestee),# Time between observations (in days)days_prev =as.numeric(difftime(capture_datetime, lag(capture_datetime), units ="days")),days_next =as.numeric(difftime(lead(capture_datetime), capture_datetime, units ="days")),# Rule 2: Zero Error Detection# Flag if: (N_t == 0 AND N_{t-1} > 0)flag_zero_error = ( arrestee ==0&!is.na(prev_value) & jurisdiction =="California" ),# Combine all anomaly flagsflag_any = flag_zero_error,# Replace NA with FALSEacross(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .)) ) %>%ungroup()# Create initial interactive plot for arrestee profiles with flagged pointsp_arrestee_raw <- arrestee_validation %>%plot_ly(x =~capture_datetime, y =~arrestee, color =~jurisdiction, type ='scatter', mode ='lines+markers', alpha =0.7,name =~jurisdiction) %>%add_markers(data = arrestee_validation %>%filter(flag_any),x =~capture_datetime, y =~arrestee, color =~jurisdiction,marker =list(size =12, symbol ='x', line =list(width =3, color ='red')),name =~paste0(jurisdiction, " - Flagged"),showlegend =FALSE) %>%layout(title ="Arrestee Profiles - Raw Data (Flagged Points Marked)",xaxis =list(title ="Date"),yaxis =list(title ="Arrestee Profiles"))p_arrestee_raw
Show Arrestee profiles visualization and correction code
Show Arrestee profiles visualization and correction code
# Summarise highest arrestee profile per jurisdiction per yeararrestee_yearly <- arrestee_clean %>%mutate(year =year(capture_datetime)) %>%group_by(jurisdiction, year) %>%summarise(max_arrestee =max(arrestee, na.rm =TRUE), .groups ="drop") %>%group_by(year) %>%summarise(total_max_arrestee =sum(max_arrestee, na.rm =TRUE), .groups ="drop")# Plot yearly sumsp_arrestee_yearly <- arrestee_yearly %>%plot_ly(x =~year, y =~total_max_arrestee,type ='scatter', mode ='lines+markers',line =list(color ="purple", width =3),marker =list(size =8, color ="magenta")) %>%layout(title ="Yearly Sum of Max Arrestee Profiles per Jurisdiction",xaxis =list(title ="Year"),yaxis =list(title ="Total Max Arrestee Profiles"))p_arrestee_yearly
Investigations Aided Correction
Show Investigations Aided visualization and correction code
#### Raw Investigations Aided plot# Flag anomalies for investigations aided using formal detection rulesinvestigations_validation <- ndis_intermediate %>%arrange(jurisdiction, capture_datetime) %>%group_by(jurisdiction) %>%mutate(prev_value =lag(investigations_aided),next_value =lead(investigations_aided),# Time between observations (in days)days_prev =as.numeric(difftime(capture_datetime, lag(capture_datetime), units ="days")),days_next =as.numeric(difftime(lead(capture_datetime), capture_datetime, units ="days")),# Rule 1: Spike-Dip Detection# Flag if: (N_t > 2*N_{t-1}) OR (N_t < 0.5*N_{t-1}) OR (N_t > 2*N_{t+1}) OR (N_t < 0.5*N_{t+1})flag_spike_dip = ( (!is.na(prev_value) & investigations_aided >10* prev_value) ),# Continuation of spike-dip: previous was flagged AND current shows recoveryprev_was_spike_dip =lag(flag_spike_dip),flag_cont_spike_dip = (!is.na(prev_was_spike_dip) & prev_was_spike_dip & ((!is.na(prev_value) & investigations_aided >0.5* prev_value & investigations_aided <2* prev_value) | (!is.na(prev_value) & investigations_aided == prev_value)) ),# Rule 2: Zero Error Detection# Flag if: (N_t == 0 AND N_{t-1} > 0)flag_zero_error = ( investigations_aided ==0&!is.na(prev_value) & prev_value >0 ),# Continuation of zero error: previous was flagged zero error AND current is zeroprev_was_zero_error =lag(flag_zero_error),flag_cont_zero_error = ( investigations_aided ==0&!is.na(prev_was_zero_error) & prev_was_zero_error ),# Combine all anomaly flagsflag_any = flag_spike_dip | flag_cont_spike_dip | flag_zero_error | flag_cont_zero_error,# Replace NA with FALSEacross(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .)),# --- New rule: propagate by metric value within jurisdiction ---# TRUE if this investigations_aided value appears among the flagged values in this jurisdictionflag_same_value_propagate =ifelse(is.na(investigations_aided), FALSE, investigations_aided %in% investigations_aided[flag_any] ),# Update final flag_any to include this propagated-by-value flagflag_any = flag_any | flag_same_value_propagate ) %>%ungroup()# Create initial interactive plot for investigations aided with flagged pointsp_investigations_raw <- investigations_validation %>%plot_ly(x =~capture_datetime, y =~investigations_aided, color =~jurisdiction, type ='scatter', mode ='lines+markers', alpha =0.7,name =~jurisdiction) %>%add_markers(data = investigations_validation %>%filter(flag_any),x =~capture_datetime, y =~investigations_aided, color =~jurisdiction,marker =list(size =12, symbol ='x', line =list(width =3, color ='red')),name =~paste0(jurisdiction, " - Flagged"),showlegend =FALSE) %>%layout(title ="Investigations Aided - Raw Data (Flagged Points Marked)",xaxis =list(title ="Date"),yaxis =list(title ="Investigations Aided"))p_investigations_raw
Show Investigations Aided visualization and correction code
The total profiles metric aggregates all DNA profile types (Offender + Arrestee + Forensic) to provide a comprehensive view of the NDIS database size.
The analysis tracks cumulative growth per jurisdiction, shows individual jurisdiction contributions, and reveals relative database sizes across jurisdictions.
Show compiled data visualization and correction code
#### Publication-Ready Static Plot ##### Get the actual date range for proper x-axis limitsdate_range <-range(growth_data_yearly$date)extended_date_range <-c(min(date_range) -years(1), max(date_range))legend_start_date <- extended_date_range[1]y_upper_limit <- max_dna *1.05y_lower_limit <-0p_static <-ggplot() +geom_line(data = dna_data, aes(x = date, y = count_scaled, color = variable), linewidth =1.2) +geom_point(data = dna_data, aes(x = date, y = count_scaled, color = variable), size =2) +geom_line(data = investigations_data, aes(x = date, y = count_scaled, color = variable), linewidth =1.2) +geom_point(data = investigations_data, aes(x = date, y = count_scaled, color = variable), size =2) +scale_x_date(date_breaks ="1 years",date_labels ="%Y",limits = extended_date_range, expand =expansion(mult =0.02) ) +scale_y_continuous(name ="DNA Profiles",labels =function(x) {ifelse(x >=1e6, paste0(x/1e6, "M"), ifelse(x >=1e3, paste0(x/1e3, "K"), x)) },breaks =seq(0, max_dna, by =2e6),limits =c(y_lower_limit, y_upper_limit),sec.axis =sec_axis(~./scale_factor, name ="Investigations Aided",labels =function(x) {ifelse(x >=1e6, paste0(x/1e6, "M"), ifelse(x >=1e3, paste0(x/1e3, "K"), x)) },breaks =seq(0, max_investigations, by =100000)) ) +scale_color_manual(name =NULL,values =c("Offender"="#1f4e79", "Arrestee"="#2e75b6", "Forensic"="#5b9bd5","Investigations"="#c00000") ) +theme_ndis(base_size =12) +theme(panel.grid =element_blank(),axis.line =element_line(color ="black", linewidth =0.5),axis.ticks =element_line(color ="black", linewidth =0.5),axis.text.x =element_text(angle =45, hjust =1),axis.title.x =element_text(color ="black", margin =margin(t =10)),axis.title.y.left =element_text(color ="#1f4e79", margin =margin(r =10)),axis.title.y.right =element_text(color ="#c00000", margin =margin(l =10)),legend.position ="none",plot.margin =margin(5, 10, 5, 10),aspect.ratio =0.6 ) +labs(x ="Year",title =" " ) +# DNA Profiles legend boxannotate("rect", xmin = legend_start_date, xmax = legend_start_date +years(6), ymin = max_dna *0.86, ymax = max_dna, fill ="white", color ="black", alpha =0.9, linewidth =0.3) +# Investigations legend boxannotate("rect", xmin = legend_start_date, xmax = legend_start_date +years(7), ymin = max_dna *0.74, ymax = max_dna *0.80, fill ="white", color ="black", alpha =0.9, linewidth =0.3) +# DNA Profiles legend itemsannotate("point", x = legend_start_date +years(0) +months(6), y =c(max_dna *0.97, max_dna *0.93, max_dna *0.89),color =c("#1f4e79", "#2e75b6", "#5b9bd5"), size =1.5) +annotate("text", x = legend_start_date +years(1), y =c(max_dna *0.97, max_dna *0.93, max_dna *0.89),label =c("Offender Profiles", "Arrestee Profiles", "Forensic Profiles"),hjust =0, size =6) +annotate("text", x = legend_start_date +years(0), y = max_dna *1.01, label ="DNA Profiles (Millions)", fontface ="bold", hjust =0, size =6, vjust =0) +# Investigations Aided legendannotate("point", x = legend_start_date +years(0) +months(6), y = max_dna *0.77, color ="#c00000", size =1.5) +annotate("text", x = legend_start_date +years(1), y = max_dna *0.77, label ="Investigations Aided",hjust =0, size =6) +annotate("text", x = legend_start_date +years(0), y = max_dna *0.81, label ="Investigations (Thousands)", fontface ="bold", hjust =0, size =6 , vjust =0)p_static
Temporal Coverage
The heat map visualizes the temporal coverage of NDIS data submissions across different jurisdictions over the years for the intermediate csv file (with outliers and reporting errors) and for the cleaned dataset. It highlights periods of active reporting and gaps in data submission.
Show heatmap code
# Prepare data for heatmap - CLEANED DATASETtemporal_coverage_clean <- ndis_clean %>%mutate(year =year(capture_datetime)) %>%count(jurisdiction, year) %>%complete(jurisdiction, year =2001:2025, fill =list(n =0)) %>%filter(!is.na(jurisdiction)) %>%mutate(jurisdiction =factor(jurisdiction, levels =rev(sort(unique(jurisdiction)))))# Create the heatmap for cleaned dataheatmap_after_clean <-ggplot(temporal_coverage_clean, aes(x = year, y = jurisdiction, fill = n)) +geom_tile(color ="white", linewidth =0.3) +scale_fill_viridis(name ="Snapshots\nper Year",option ="plasma",direction =-1,breaks =c(0, 3, 6, 10),labels =c("0", "3", "6", "10+") ) +scale_x_continuous(breaks =seq(2001, 2025, by =1),expand =expansion(mult =0.01) ) +labs(x ="Year",y ="Jurisdiction",title =" " ) +theme_ndis(base_size =11) +theme(panel.grid =element_blank(),axis.text.x =element_text(angle =45, hjust =1),legend.position ="right",legend.key.height =unit(0.6, "cm"), legend.key.width =unit(0.2, "cm") )heatmap_after_clean
Comparison with peer-reviewed papers
As an additional check, we compared corrected national aggregates against published NDIS totals from FBI press releases and peer-reviewed articles. As shown in Figure 6, the reconstructed dataset aligns closely with these independent milestones, supporting the technical quality of the NDIS time series.
The NDIS_time_series.csv dataset retains key temporal, jurisdictional, and operational metrics that can be used for further analysis and visualization.
Column
Type
Description
capture_datetime
POSIXct
Full timestamp of data capture, parsed from the raw timestamp field (YYYY-MM-DD HH:MM:SS).
asof_date
Date
Standardized date representing the reporting period (asof_year + asof_month, first day of month).
jurisdiction
Character
Name or code of the reporting jurisdiction (e.g., “California”, “Texas”).
offender_profiles
Numeric
Number of DNA profiles from known offenders in the jurisdiction.
arrestee
Numeric
Number of DNA profiles collected from arrestees.
forensic_profiles
Numeric
Number of DNA profiles developed from forensic (crime scene) samples.
total_profiles
Numeric
Sum of offender, arrestee, and forensic profiles for each record.
ndis_labs
Integer
Count of laboratories actively participating in NDIS for the given jurisdiction and month.
investigations_aided
Numeric
Number of investigations aided by NDIS matches in the reporting period.
After cleaning and processing the NDIS data, the final dataset is exported as a CSV file for further analysis or sharing. The file is saved to data/ndis/final/ so it can be referenced by other analyses. Use analysis/version_freeze.qmd whenever you need a versioned snapshot.