NDIS Database

Analyzing FBI National DNA Index System Statistics

Author

Tina Lasisi | Edited: João P. Donadio

Published

November 14, 2025

Objectives

Analyze patterns of state and federal participation in NDIS

  • Track which jurisdictions are actively contributing data and identify any geographic disparities in participation levels.

Identify periods of rapid growth or stagnation in DNA profile submissions

  • Detect acceleration points and plateaus in the expansion of the national DNA database to understand adoption trends.

Document the expansion of DNA profiles (offender, arrestee, forensic) over time

  • Monitor the growth trajectory of different profile categories to assess program effectiveness and resource allocation.

Setup and Configuration

This section prepares the environment for analysis by:

  • Ensuring all required R packages are available

  • Loading the NDIS dataset with proper type specifications

  • Providing basic data validation checks

Show setup code
# List of required packages
required_packages <- c(
  "tidyverse",    # Data manipulation and visualization
  "lubridate",    # Date-time manipulation
  "DT",           # Interactive tables
  "plotly",       # Interactive visualizations
  "leaflet",      # Geospatial mapping
  "kableExtra",   # Enhanced table formatting
  "scales",       # Axis scaling and formatting
  "dlookr",       # Data validation and diagnostics
  "gt",           # Table generation
  "assertr",      # Data validation and assertions
  "flextable",    # Enhanced table visualization
  "ggridges",     # Ridge plots
  "here",         # File path management
  "patchwork",    # Data visualization  
  "scales",       # Plot aesthetics
  "viridis",      # Color pallete for plots
  "ggrepel"       # Adjust legend location
  )

# Function to install missing packages
install_missing <- function(packages) {
  for (pkg in packages) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
      message(paste("Installing missing package:", pkg))
      install.packages(pkg, dependencies = TRUE)
    }
  }
}

# Install any missing packages
install_missing(required_packages)

# Load all packages
suppressPackageStartupMessages({
  library(tidyverse)
  library(lubridate)
  library(DT)
  library(plotly)
  library(leaflet)
  library(kableExtra)
  library(scales)
  library(dlookr)
  library(gt)
  library(assertr)
  library(flextable)
  library(ggridges)
  library(here)
  library(patchwork)
  library(scales)
  library(viridis)
  library(ggrepel)
})

# Verify all packages loaded successfully
loaded_packages <- sapply(required_packages, require, character.only = TRUE)

if (all(loaded_packages)) {
  message("📚 All packages loaded successfully!")
} else {
  warning("The following packages failed to load: ", 
          paste(names(loaded_packages)[!loaded_packages], collapse = ", "))
}

# Define custom theme for consistent styling across all plots
theme_ndis <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      # Axis styling
      axis.text = element_text(color = "black", size = base_size),
      axis.title = element_text(size = base_size + 1, face = "bold"),
      
      # Plot title and subtitle
      plot.title = element_text(face = "bold", size = base_size + 3, hjust = 0),
      plot.subtitle = element_text(size = base_size + 2, hjust = 0),
      
      # Legend styling
      legend.text = element_text(size = base_size - 1),
      legend.title = element_text(size = base_size, face = "bold"),
      legend.key.size = unit(0.5, "cm"),
      
      # Grid and panel
      panel.grid.major = element_line(color = "gray90", linewidth = 0.3),
      panel.grid.minor = element_blank()
    )
}

Data Import and Validation

Show data import code
# Define expected column structure
expected_cols <- cols(
  timestamp = col_character(),
  report_month = col_character(),
  report_year = col_character(),
  jurisdiction = col_character(),
  offender_profiles = col_double(),
  arrestee = col_double(),
  forensic_profiles = col_double(),
  ndis_labs = col_double(),
  investigations_aided = col_double()
)

# Read data with validation
ndis_data <- read_csv(
  here::here("data", "ndis", "raw", "ndis_data_raw.csv"),
  col_types = expected_cols
)

Dataset Cleaning

This section outlines the preprocessing steps applied to the raw NDIS data before analysis.

First of all, to ensure consistency for analysis, we fixed jurisdiction names that were not correctly scraped in the dataset as we see here:

Alabama, Alabama Alabama, Alabama Stats Alabama, Alaska, Alaska Alaska, Alaska Stats Alaska, and Legal profiles at NDIS. Statistics as of April 2025 Alabama, and Legal profiles at NDIS. Statistics as of June 2025 Alabama, and Legal profiles at NDIS. Statistics as of March 2025 Alabama, Arizona, Arizona Arizona, Arizona Stats Arizona, Arkansas, Arkansas Arkansas, Arkansas Stats Arkansas, Army, California, California California, California Stats California, Colorado, Colorado Colorado, Colorado Stats Colorado, Connecticut, Connecticut Connecticut, Connecticut Stats Connecticut, DC, DC map pin.) Statistics as of August 2023 Alabama, DC map pin.) Statistics as of February 2024 Alabama, DC map pin.) Statistics as of January 2025 Alabama, DC map pin.) Statistics as of November 2024 Alabama, DC map pin.) Statistics as of October 2024 Alabama, DC map pin). Statistics as of November 2022 Alabama, DC/FBI Lab, DC/Metro PD, Delaware, Delaware Delaware, Delaware Stats Delaware, Florida, Florida Florida, Florida Stats Florida, Georgia, Georgia Georgia, Georgia Stats Georgia, Hawaii, Hawaii Hawaii, Hawaii Stats Hawaii, Idaho, Idaho Idaho, Illinois, Illinois Illinois, Illinois Stats Illinois, Indiana, Indiana Indiana, Indiana Stats Indiana, Iowa, Iowa Iowa, Iowa Stats Iowa, Kansas, Kansas Kansas, Kansas Stats Kansas, Kentucky, Kentucky Kentucky, Kentucky Stats Kentucky, Lab, Louisiana, Louisiana Louisiana, Louisiana Stats Louisiana, Maine, Maine Maine, Maine Stats Maine, Maryland, Maryland Maryland, Maryland Stats Maryland, Massachusetts, Massachusetts Massachusetts, Massachusetts Stats Massachusetts, Mexico Stats New Mexico, Michigan, Michigan Michigan, Michigan Stats Idaho, Michigan Stats Michigan, Michigan Stats Utah, Minnesota, Minnesota Minnesota, Minnesota Stats Minnesota, Mississippi, Mississippi Mississippi, Mississippi Stats Mississippi, Missouri, Missouri Missouri, Missouri Stats Missouri, Montana, Montana Montana, Montana Stats Montana, Nebraska, Nebraska Nebraska, Nebraska Stats Nebraska, Nevada, Nevada Nevada, Nevada Stats Nevada, New Hampshire, New Hampshire New Hampshire, New Hampshire Stats New Hampshire, New Jersey, New Jersey New Jersey, New Jersey Stats New Jersey, New Mexico, New Mexico New Mexico, New York, New York New York, New York Stats New York, North Carolina, North Carolina North Carolina, North Carolina Stats North Carolina, North Dakota, North Dakota North Dakota, North Dakota Stats North Dakota, Ohio, Ohio Ohio, Ohio Stats Ohio, Oklahoma, Oklahoma Oklahoma, Oklahoma Stats Oklahoma, Oregon, Oregon Oregon, Oregon Stats Oregon, Participant Alabama, Pennsylvania, Pennsylvania Pennsylvania, Pennsylvania Stats Pennsylvania, PR, Puerto Rico, Rhode Island, Rhode Island Rhode Island, Rhode Island Stats Rhode Island, South Carolina, South Carolina South Carolina, South Carolina Stats South Carolina, South Dakota, South Dakota South Dakota, South Dakota Stats South Dakota, Tables by NDIS Participant Alabama, Tennessee, Tennessee Stats Tennessee, Tennessee Tennessee, Texas, Texas Stats Texas, Texas Texas, U.S. Army, Utah, Utah Utah, Vermont, Vermont Stats Vermont, Vermont Vermont, Virginia, Virginia Stats Virginia, Virginia Virginia, Washington, Washington State Stats Washington, Washington State Washington, West Virginia, West Virginia Stats West Virginia, West Virginia Stats Wyoming, West Virginia West Virginia, Wisconsin, Wisconsin Stats Wisconsin, Wisconsin Wisconsin, Wyoming, Wyoming Wyoming
Show cleaning code (jurisdiction)
# Clean jurisdiction names with Alabama-specific patterns
ndis_data_jurisdiction <- ndis_data %>%
  mutate(
    jurisdiction = case_when(
      # Standard state names
      str_detect(jurisdiction, "Alabama$|Alabama Stats") ~ "Alabama",
      str_detect(jurisdiction, "Alaska$|Alaska Stats") ~ "Alaska",
      str_detect(jurisdiction, "Arizona$|Arizona Stats") ~ "Arizona",
      str_detect(jurisdiction, "Arkansas$|Arkansas Stats") ~ "Arkansas",
      str_detect(jurisdiction, "California$|California Stats") ~ "California",
      str_detect(jurisdiction, "Colorado$|Colorado Stats") ~ "Colorado",
      str_detect(jurisdiction, "Connecticut$|Connecticut Stats") ~ "Connecticut",
      str_detect(jurisdiction, "Delaware$|Delaware Stats") ~ "Delaware",
      str_detect(jurisdiction, "Florida$|Florida Stats") ~ "Florida",
      str_detect(jurisdiction, "Georgia$|Georgia Stats") ~ "Georgia",
      str_detect(jurisdiction, "Hawaii$|Hawaii Stats") ~ "Hawaii",
      str_detect(jurisdiction, "Idaho$|Idaho Stats") ~ "Idaho",
      str_detect(jurisdiction, "Illinois$|Illinois Stats") ~ "Illinois",
      str_detect(jurisdiction, "Indiana$|Indiana Stats") ~ "Indiana",
      str_detect(jurisdiction, "Iowa$|Iowa Stats") ~ "Iowa",
      str_detect(jurisdiction, "Kansas$|Kansas Stats") ~ "Kansas",
      str_detect(jurisdiction, "Kentucky$|Kentucky Stats") ~ "Kentucky",
      str_detect(jurisdiction, "Louisiana$|Louisiana Stats") ~ "Louisiana",
      str_detect(jurisdiction, "Maine$|Maine Stats") ~ "Maine",
      str_detect(jurisdiction, "Maryland$|Maryland Stats") ~ "Maryland",
      str_detect(jurisdiction, "Massachusetts$|Massachusetts Stats") ~ "Massachusetts",
      str_detect(jurisdiction, "Michigan$|Michigan Stats") ~ "Michigan",
      str_detect(jurisdiction, "Minnesota$|Minnesota Stats") ~ "Minnesota",
      str_detect(jurisdiction, "Mississippi$|Mississippi Stats") ~ "Mississippi",
      str_detect(jurisdiction, "Missouri$|Missouri Stats") ~ "Missouri",
      str_detect(jurisdiction, "Montana$|Montana Stats") ~ "Montana",
      str_detect(jurisdiction, "Nebraska$|Nebraska Stats") ~ "Nebraska",
      str_detect(jurisdiction, "Nevada$|Nevada Stats") ~ "Nevada",
      str_detect(jurisdiction, "New Hampshire$|New Hampshire Stats") ~ "New Hampshire",
      str_detect(jurisdiction, "New Jersey$|New Jersey Stats") ~ "New Jersey",
      str_detect(jurisdiction, "New Mexico$|New Mexico Stats|Mexico Stats") ~ "New Mexico",
      str_detect(jurisdiction, "New York$|New York Stats") ~ "New York",
      str_detect(jurisdiction, "North Carolina$|North Carolina Stats") ~ "North Carolina",
      str_detect(jurisdiction, "North Dakota$|North Dakota Stats") ~ "North Dakota",
      str_detect(jurisdiction, "Ohio$|Ohio Stats") ~ "Ohio",
      str_detect(jurisdiction, "Oklahoma$|Oklahoma Stats") ~ "Oklahoma",
      str_detect(jurisdiction, "Oregon$|Oregon Stats") ~ "Oregon",
      str_detect(jurisdiction, "Pennsylvania$|Pennsylvania Stats") ~ "Pennsylvania",
      str_detect(jurisdiction, "Rhode Island$|Rhode Island Stats") ~ "Rhode Island",
      str_detect(jurisdiction, "South Carolina$|South Carolina Stats") ~ "South Carolina",
      str_detect(jurisdiction, "South Dakota$|South Dakota Stats") ~ "South Dakota",
      str_detect(jurisdiction, "Tennessee$|Tennessee Stats") ~ "Tennessee",
      str_detect(jurisdiction, "Texas$|Texas Stats") ~ "Texas",
      str_detect(jurisdiction, "Utah$|Utah Stats") ~ "Utah",
      str_detect(jurisdiction, "Vermont$|Vermont Stats") ~ "Vermont",
      str_detect(jurisdiction, "West Virginia$|West Virginia Stats") ~ "West Virginia",
      str_detect(jurisdiction, "Virginia$|Virginia Stats") ~ "Virginia",
      str_detect(jurisdiction, "Washington$|Washington State Stats") ~ "Washington",
      str_detect(jurisdiction, "Wisconsin$|Wisconsin Stats") ~ "Wisconsin",
      str_detect(jurisdiction, "Wyoming$|Wyoming Stats") ~ "Wyoming",
      
      # Special jurisdictions
      str_detect(jurisdiction, "DC/FBI|Washington DC Stats|Lab") ~ "DC/FBI Lab",
      str_detect(jurisdiction, "DC/Metro|DC") ~ "DC/Metro PD",
      str_detect(jurisdiction, "U.S. Army$|U.S. Army Stats") ~ "U.S. Army",
            str_detect(jurisdiction, "Puerto Rico$|Puerto Rico Stats") ~ "Puerto Rico",
          
      str_detect(jurisdiction, "Tables by NDIS Participant") ~ "Alabama", # Default to Alabama
      
      TRUE ~ jurisdiction
    ),
    
    # Clean up any remaining whitespace
    jurisdiction = str_trim(jurisdiction)
     )  %>%
  
  # Convert to factor with the 54 levels you want
  mutate(
    jurisdiction = factor(jurisdiction,
                         levels = c(sort(state.name), "Puerto Rico", "DC/FBI Lab", "DC/Metro PD", "U.S. Army"))) %>%
  
  # Filter out NA jurisdictions
  filter(!is.na(jurisdiction))

Updated Jurisdiction names:

Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming, Puerto Rico, DC/FBI Lab, DC/Metro PD, U.S. Army

Variables were reformatted into consistent date and time structures.

Key profile counts were combined into a total_profiles measure, and missing reporting periods were filled using available capture information.

Finally, year and month variables were standardized, and the dataset was reordered to ensure a clean, consistent structure for validation and analysis.

Show cleaning code (general)
ndis_data <- ndis_data_jurisdiction %>%
  mutate(
    capture_datetime = as_datetime(timestamp, format = "%Y%m%d%H%M%S"),
    total_profiles = offender_profiles + arrestee + forensic_profiles,
    asof_date = make_date(report_year, report_month, 1)
  )

ndis_intermediate <- ndis_data %>%
  select(
    capture_datetime, asof_date, jurisdiction,
    offender_profiles, arrestee, forensic_profiles,
    total_profiles, ndis_labs, investigations_aided
  ) %>%
  arrange(jurisdiction, capture_datetime)

Saving Intermediate Cleaned Data

The cleaned dataset preserves the core NDIS metrics while standardizing temporal and jurisdictional dimensions for consistent analysis. Key structural improvements include:

· Temporal Standardization: Unified date handling with capture_datetime for data extraction timing and asof_month/asof_year for reported periods

· Jurisdictional Harmonization: Normalized 54 jurisdiction names (50 states + Puerto Rico, DC/FBI Lab, DC/Metro PD, U.S. Army) using consistent naming conventions

· Derived Metrics: Added total_profiles as the sum of offender, arrestee, and forensic profiles for comprehensive trend analysis

· Data Integrity: Removed ambiguous records and ensured proper typing for analytical operations

Show intermediate dataset saving code
# Glimpse

enhanced_glimpse <- function(df) {
  glimpse_data <- data.frame(
    Column = names(df),
    Type = sapply(df, function(x) paste(class(x), collapse = ", ")),
    Rows = nrow(df),
    Missing = sapply(df, function(x) sum(is.na(x))),
    Unique = sapply(df, function(x) length(unique(x))),
    First_Values = sapply(df, function(x) {
      if(is.numeric(x)) {
        paste(round(head(x, 3), 2), collapse = ", ")
      } else {
        paste(encodeString(head(as.character(x), 3)), collapse = ", ")
      }
    })
  )
  
  ft <- flextable(glimpse_data) %>%
    theme_zebra() %>%
    set_caption(paste("Enhanced Data Glimpse:", deparse(substitute(df)))) %>%
    autofit() %>%
    align(align = "left", part = "all") %>%
    colformat_num(j = c("Rows", "Missing", "Unique"), big.mark = "") %>%
    bg(j = "Missing", bg = function(x) ifelse(x > 0, "#FFF3CD", "transparent")) %>%
    bg(j = "Unique", bg = function(x) ifelse(x == 1, "#FFF3CD", "transparent")) %>%
    add_footer_lines(paste("Data frame dimensions:", nrow(df), "rows ×", ncol(df), "columns")) %>%
    fontsize(size = 10, part = "all") %>%
    set_table_properties(layout = "autofit", width = 1)
  
  return(ft)
}

enhanced_glimpse(ndis_intermediate)

Column

Type

Rows

Missing

Unique

First_Values

capture_datetime

POSIXct, POSIXt

31873

0

11269

2001-07-15 04:15:59, 2001-08-22 11:55:31, 2001-09-13 00:17:54

asof_date

Date

31873

10919

123

<NA>, <NA>, <NA>

jurisdiction

factor

31873

0

54

Alabama, Alabama, Alabama

offender_profiles

numeric

31873

0

7608

0, 0, 0

arrestee

numeric

31873

0

2515

0, 0, 0

forensic_profiles

numeric

31873

0

6325

0, 0, 0

total_profiles

numeric

31873

0

8360

0, 0, 0

ndis_labs

numeric

31873

0

23

4, 4, 4

investigations_aided

numeric

31873

0

4306

88, 88, 88

Data frame dimensions: 31873 rows × 9 columns

Show intermediate dataset saving code
# Save cleaned data to CSV
write_csv(ndis_intermediate, here::here("data", "ndis", "intermediate", "ndis_intermediate.csv"))

message("✅ Intermediate dataset saved to 'data/ndis/intermediate' folder")

Raw Data Distribution

Show raw data visualization code
# Heatmap
temporal_coverage_intermediate <-  ndis_intermediate %>%
  mutate(year = year(capture_datetime)) %>%
  count(jurisdiction, year) %>%
  complete(jurisdiction, year = 2001:2025, fill = list(n = 0)) %>%
  filter(!is.na(jurisdiction)) %>%
  mutate(jurisdiction = factor(jurisdiction, levels = rev(sort(unique(jurisdiction)))))

heatmap_raw <- ggplot(temporal_coverage_intermediate, aes(x = year, y = jurisdiction, fill = n)) +
  geom_tile(color = "white", linewidth = 0.3) +
  scale_fill_viridis(
    name = "Snapshots\nper Year",
    option = "plasma",
    direction = -1,
    breaks = c(0, 12, 24, 48),
    labels = c("0", "12", "24", "48+")
  ) +
  scale_x_continuous(
    breaks = seq(2001, 2025, by = 1),
    expand = expansion(mult = 0.01)
  ) +
  labs(
    x = "Year",
    y = "Jurisdiction",
    title = " "
  ) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "right",
    legend.key.height = unit(0.6, "cm"), 
    legend.key.width = unit(0.2, "cm")
  )

heatmap_raw

Variables Growth and Corrections

The National DNA Index System (NDIS) data for each jurisdiction is expected to show consistent growth over time. However, reporting issues create anomalies that require systematic detection and correction. This section documents the validation framework, with specific rules tailored to each metric following visual verification and analysis of the raw data.

Detection Rules

1. Spike-Dip Detection

A point is flagged as spike_dip if it deviates significantly from adjacent observations:

\(N_{j,t}^{(x)} \text{ is flagged if any of the following holds:}\)

  • \(N_{j,t}^{(x)} > 2 \times N_{j,t-1}^{(x)}\) (more than double the previous value)
  • \(N_{j,t}^{(x)} < 0.5 \times N_{j,t-1}^{(x)}\) (less than half the previous value)
  • \(N_{j,t}^{(x)} > 2 \times N_{j,t+1}^{(x)}\) (more than double the next value)
  • \(N_{j,t}^{(x)} < 0.5 \times N_{j,t+1}^{(x)}\) (less than half the next value)

A continuation of spike-dip is flagged as cont_spike_dip when the previous point was flagged as spike_dip AND the current point shows recovery:

\[\text{If } N_{j,t-1}^{(x)} \text{ flag} = \text{spike\_dip AND} \left[ N_{j,t}^{(x)} > 0.5 \times N_{j,t-1}^{(x)} \text{ OR } N_{j,t}^{(x)} < 2 \times N_{j,t-1}^{(x)} \right]\]

Description: These flags capture temporary data surges or unexplained dips. A spike followed by recovery to near-normal levels, or isolated low values surrounded by higher values, indicate reporting anomalies rather than real changes in profiles.

2. Zero Error Detection

A point is flagged as zero_error if a zero appears after positive values:

\[\text{If } N_{j,t}^{(x)} = 0 \text{ AND } N_{j,t-1}^{(x)} > 0\]

Consecutive zeros after an initial error are flagged as cont_zero_error:

\[\text{If } N_{j,t}^{(x)} = 0 \text{ AND } N_{j,t-1}^{(x)} \text{ flag} = \text{zero\_error}\]

Description: Legitimate DNA profile data cannot drop from positive to zero. When this occurs, it represents a reporting system error. All subsequent zeros until the data recovers to positive values are propagations of the same error and should be marked together.

3. Update Lag Detection

A point is flagged as osc_lag when values oscillate between similar levels in a compressed timeframe:

\[\text{If } \left[ \left( N_{j,t}^{(x)} < N_{j,t-1}^{(x)} \text{ AND } N_{j,t}^{(x)} < N_{j,t+1}^{(x)} \right) \text{ OR } \left( N_{j,t}^{(x)} = N_{j,t-1}^{(x)} \text{ AND } N_{j,t}^{(x)} < N_{j,t+1}^{(x)} \right) \right.\]

\[\left. \text{OR } \left( N_{j,t}^{(x)} < N_{j,t-1}^{(x)} \text{ AND } N_{j,t}^{(x)} = N_{j,t+1}^{(x)} \right) \right]\]

\[\text{AND } [t - (t-1) \leq 2 \text{ days}] \text{ AND } [(t+1) - t \leq 2 \text{ days}]\]

Description: When sequential reports within a 48-hour window show values that decrease or remain flat relative to neighbors, this indicates system synchronization delays where data is updating across multiple databases at different times. The same profile count is being reported inconsistently during the synchronization process.

Correction Rules

For spike_dip and cont_spike_dip Flags

Action: Remove the flagged point from the dataset.

Reason: Temporary data surges or isolated dips do not represent actual growth in profiles. Removing these points preserves the genuine underlying trend while eliminating reporting artifacts.

For zero_error and cont_zero_error Flags

Action: Remove all consecutive zero values starting from the first zero that follows a positive value, continuing until the data recovers to positive numbers.

Reason: Zeros appearing after positive counts are reporting failures, not real data. Removing the entire sequence of consecutive zeros eliminates the error cascade while preserving the valid trajectory before and after the error window.

For osc_lag Flags

Action: Within each oscillation cluster, retain only the highest value and remove all other points in the sequence.

Reason: The highest value represents the true data point; lower values in the cluster are transient states during system synchronization. Keeping the maximum preserves the actual profile count while removing the synchronization noise.

For Legitimate Decreases

Action: Preserve all decreases that do not match the patterns above.

Reason: Not all decreases are errors. Some reflect genuine profile removals due to expungements, legal stays, or case dismissals. Decreases outside the detection rules represent real changes in the database and should be retained.

Metric-Specific Validation Rules

Offender Profiles

Following visual verification and analysis of the raw data, the following rules were applied to the Offender Profiles metric:

  • Spike-Dip Detection: Flags points where values drop below half the previous value or fall below half the next value
  • Continuation Spike-Dip: Detects recovery points following flagged anomalies
  • Zero Error Detection: Flags any zero value appearing after positive values in the California jurisdiction
  • Continuation Zero Error: Tracks consecutive zeros following the initial error
  • Update Lag Detection: Identifies oscillations within 5-day windows where current values are lower than or equal to previous/next values
  • Value Propagation: Additionally flags any data point with the same value as previously flagged anomalies within the jurisdiction

All flagged points are removed to produce the cleaned dataset.

Forensic Profiles

Following visual verification and analysis of the raw data, the following rules were applied to the Forensic Profiles metric:

  • Spike-Dip Detection: Flags points where values drop below half the previous or next value, or exceed 2.5 times the next value
  • Continuation Spike-Dip: Detects recovery points following flagged anomalies
  • Zero Error Detection: Flags any zero value appearing after positive values
  • Continuation Zero Error: Tracks consecutive zeros following the initial error
  • Update Lag Detection: Identifies oscillations within 2-day windows where current values are lower than or equal to previous/next values
  • Value Propagation: Additionally flags any data point with the same value as previously flagged anomalies within the jurisdiction

All flagged points are removed to produce the cleaned dataset.

Arrestee Profiles

Following visual verification and analysis of the raw data (filtered to January 1, 2012 onwards), a focused rule set was applied to the Arrestee Profiles metric:

  • Zero Error Detection: Flags zero values appearing after positive values
  • No value propagation or other rules were applied based on the observed data patterns

All flagged zero error points are removed to produce the cleaned dataset.

Investigations Aided

Following visual verification and analysis of the raw data, the following rules were applied to the Investigations Aided metric:

  • Spike-Dip Detection: Flags points where values increase more than 10-fold relative to the previous value
  • Continuation Spike-Dip: Detects recovery points following flagged anomalies
  • Zero Error Detection: Flags any zero value appearing after positive values
  • Continuation Zero Error: Tracks consecutive zeros following the initial error
  • Value Propagation: Additionally flags any data point with the same value as previously flagged anomalies within the jurisdiction

All flagged points are removed to produce the cleaned dataset.

Participating Laboratories (NDIS Labs)

Following visual verification and analysis of the raw data, jurisdiction-specific rules were applied to the Participating Laboratories metric:

  • Spike-Dip Detection (Oklahoma): Flags points where the value increases more than 3-fold relative to the previous value
  • Spike-Dip Detection (Michigan): Flags points where the value drops to 25% or less of the previous value
  • Continuation Spike-Dip: Detects recovery points following flagged anomalies
  • No value propagation rule applied for this metric

All flagged points are removed to produce the cleaned dataset.


Note: All metrics employ a data deduplication step that retains only the first observation for each jurisdiction within the same capture datetime (rounded to seconds). Yearly summaries report the maximum value per jurisdiction per year, then aggregate across jurisdictions.


Offender Profiles Correction

Show Offender profiles visualization and correction code
# Filtering for ndis_labs > 0 and deduplication for same jurisdiction in the same capture_datetime
ndis_intermediate <- ndis_intermediate %>%
  mutate(capture_datetime = lubridate::round_date(capture_datetime, "second")) %>%
  group_by(jurisdiction, capture_datetime) %>%
  slice(1) %>%
  ungroup()

#### Raw Offender profiles plot

# Flag anomalies for offender profiles using formal detection rules
offender_validation <- ndis_intermediate %>%
  arrange(jurisdiction, capture_datetime) %>%
  group_by(jurisdiction) %>%
  mutate(
    prev_value = lag(offender_profiles),
    next_value = lead(offender_profiles),
    
    # Time between observations (in days)
    days_prev = as.numeric(difftime(capture_datetime, lag(capture_datetime), units = "days")),
    days_next = as.numeric(difftime(lead(capture_datetime), capture_datetime, units = "days")),
    
    # Rule 1: Spike-Dip Detection
    # Flag if: (N_t > 2*N_{t-1}) OR (N_t < 0.5*N_{t-1}) OR (N_t > 2*N_{t+1}) OR (N_t < 0.5*N_{t+1})
    flag_spike_dip = (
      (!is.na(prev_value) & offender_profiles < 0.5 * prev_value) |
      (!is.na(next_value) & offender_profiles < 0.5 * next_value)
    ),
    
    # Continuation of spike-dip: previous was flagged AND current shows recovery
    prev_was_spike_dip = lag(flag_spike_dip),
    flag_cont_spike_dip = (
      !is.na(prev_was_spike_dip) & prev_was_spike_dip &
      ((!is.na(prev_value) & offender_profiles > 0.5 * prev_value & offender_profiles < 2 * prev_value) |
      (!is.na(prev_value) & offender_profiles == prev_value))
    ),
    
    # Rule 2: Zero Error Detection
    # Flag if: (N_t == 0 AND N_{t-1} > 0)
    flag_zero_error = (
      offender_profiles == 0 & !is.na(prev_value) & prev_value > 0
    ),
    
    # Continuation of zero error: previous was flagged zero error AND current is zero
    prev_was_zero_error = lag(flag_zero_error),
    flag_cont_zero_error = (
      offender_profiles == 0 & !is.na(prev_was_zero_error) & prev_was_zero_error
    ),
    
    # Rule 3: Update Lag Detection (oscillation)
    # Flag if: [(N_t < N_{t-1} AND N_t < N_{t+1}) OR (N_t == N_{t-1} AND N_t < N_{t+1}) OR (N_t < N_{t-1} AND N_t == N_{t+1})]
    # AND [days_prev <= 2 AND days_next <= 2]
    flag_osc_lag = (
      !is.na(prev_value) & !is.na(next_value) &
      (
        (offender_profiles < prev_value & offender_profiles < next_value) |
        (offender_profiles < prev_value & offender_profiles == next_value)
      ) &
      (!is.na(days_next) & days_next <= 5)
    ),

    prev_was_osc_lag = lag(flag_osc_lag),

    flag_cont_osc_lag = (
      !is.na(prev_was_osc_lag) & prev_was_osc_lag &
      !is.na(prev_value) & offender_profiles == prev_value
    ),

    # Combine all anomaly flags
    flag_any = flag_spike_dip | flag_cont_spike_dip | flag_zero_error | flag_cont_zero_error | flag_osc_lag,
    
    # Replace NA with FALSE
    across(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .)),

    # --- New rule: propagate by metric value within jurisdiction ---
    # TRUE if this offender_profiles value appears among the flagged values in this jurisdiction
    flag_same_value_propagate = ifelse(
      is.na(offender_profiles), 
      FALSE,
      offender_profiles %in% offender_profiles[flag_any]
    ),

    # Update final flag_any to include this propagated-by-value flag
    flag_any = flag_any | flag_same_value_propagate
  ) %>%
  ungroup()

# Create initial interactive plot for offender profiles with flagged points by type
p_offender_raw <- offender_validation %>%
  plot_ly(x = ~capture_datetime, y = ~offender_profiles, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7,
          name = ~jurisdiction) %>%
  add_markers(data = offender_validation %>% filter(flag_any),
              x = ~capture_datetime, y = ~offender_profiles,
              marker = list(size = 12, symbol = 'x', color = "red",
                           line = list(width = 3, color = 'red'))) %>%
  layout(title = "Convicted Offender Profiles - Raw Data (Flagged Points Marked)",
         xaxis = list(title = "Date and Time",
         tickformat = "%Y-%m-%d %H:%M"),
         yaxis = list(title = "Offender Profiles"))

p_offender_raw
Show Offender profiles visualization and correction code
#### Offender Profiles Correction #####

# Correction for spike_dip and cont_spike_dip: Remove flagged points
offender_clean <- offender_validation %>%
  filter(!(flag_any)) %>%
  select(-starts_with("flag_"), -starts_with("prev_"), -starts_with("days_"), -next_value)

# Plot cleaned data
p_offender_clean <- offender_clean %>%
  plot_ly(x = ~capture_datetime, y = ~offender_profiles, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7) %>%
  layout(title = "Convicted Offender Profiles - Cleaned Data",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Offender Profiles"))

p_offender_clean
Show Offender profiles visualization and correction code
# Summarise highest offender profile per jurisdiction per year
offender_yearly <- offender_clean %>%
  mutate(year = year(capture_datetime)) %>%
  group_by(jurisdiction, year) %>%
  summarise(max_offender = max(offender_profiles, na.rm = TRUE), .groups = "drop") %>%
  group_by(year) %>%
  summarise(total_max_offender = sum(max_offender, na.rm = TRUE), .groups = "drop")

# Plot yearly sums
p_offender_yearly <- offender_yearly %>%
  plot_ly(x = ~year, y = ~total_max_offender,
          type = 'scatter', mode = 'lines+markers',
          line = list(color = "steelblue", width = 3),
          marker = list(size = 8, color = "darkred")) %>%
  layout(title = "Yearly Sum of Max Offender Profiles per Jurisdiction",
         xaxis = list(title = "Year"),
         yaxis = list(title = "Total Max Offender Profiles"))

p_offender_yearly

Forensic Profiles Correction

Show Forensic profiles visualization and correction code
#### Raw Forensic profiles plot

# Flag anomalies for forensic profiles using formal detection rules
forensic_validation <- ndis_intermediate %>%
  arrange(jurisdiction, capture_datetime) %>%
  group_by(jurisdiction) %>%
  mutate(
    prev_value = lag(forensic_profiles),
    next_value = lead(forensic_profiles),
    
    # Time between observations (in days)
    days_prev = as.numeric(difftime(capture_datetime, lag(capture_datetime), units = "days")),
    days_next = as.numeric(difftime(lead(capture_datetime), capture_datetime, units = "days")),
    
    # Rule 1: Spike-Dip Detection
    # Flag if: (N_t > 2*N_{t-1}) OR (N_t < 0.5*N_{t-1}) OR (N_t > 2*N_{t+1}) OR (N_t < 0.5*N_{t+1})
    flag_spike_dip = (
      (!is.na(prev_value) & forensic_profiles < 0.5 * prev_value) |
      (!is.na(next_value) & forensic_profiles < 0.5 * next_value) |
      (!is.na(next_value) & forensic_profiles > 2.5 * next_value)
    ),
    
    # Continuation of spike-dip: previous was flagged AND current shows recovery
    prev_was_spike_dip = lag(flag_spike_dip),
    flag_cont_spike_dip = (
      !is.na(prev_was_spike_dip) & prev_was_spike_dip &
      ((!is.na(prev_value) & forensic_profiles > 0.5 * prev_value & forensic_profiles < 2 * prev_value) |
      (!is.na(prev_value) & forensic_profiles == prev_value))
    ),
    
    # Rule 2: Zero Error Detection
    # Flag if: (N_t == 0 AND N_{t-1} > 0)
    flag_zero_error = (
      forensic_profiles == 0 & !is.na(prev_value) & prev_value > 0
    ),
    
    # Continuation of zero error: previous was flagged zero error AND current is zero
    prev_was_zero_error = lag(flag_zero_error),
    flag_cont_zero_error = (
      forensic_profiles == 0 & !is.na(prev_was_zero_error) & prev_was_zero_error
    ),
    
    # Rule 3: Update Lag Detection (oscillation)
    # Flag if: [(N_t < N_{t-1} AND N_t < N_{t+1}) OR (N_t == N_{t-1} AND N_t < N_{t+1}) OR (N_t < N_{t-1} AND N_t == N_{t+1})]
    # AND [days_prev <= 2 AND days_next <= 2]
    flag_osc_lag = (
      !is.na(prev_value) & !is.na(next_value) &
      (
        (forensic_profiles < prev_value & forensic_profiles < next_value) |
        (forensic_profiles < prev_value & forensic_profiles == next_value)
      ) &
      (!is.na(days_next) & days_next <= 2)
    ),

    prev_was_osc_lag = lag(flag_osc_lag),

    flag_cont_osc_lag = (
      !is.na(prev_was_osc_lag) & prev_was_osc_lag &
      !is.na(prev_value) & forensic_profiles == prev_value
    ),

    # Combine all anomaly flags
    flag_any = flag_spike_dip | flag_cont_spike_dip | flag_zero_error | flag_cont_zero_error | flag_osc_lag,
    
    # Replace NA with FALSE
    across(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .)),

    # --- New rule: propagate by metric value within jurisdiction ---
    # TRUE if this forensic_profiles value appears among the flagged values in this jurisdiction
    flag_same_value_propagate = ifelse(
      is.na(forensic_profiles), 
      FALSE,
      forensic_profiles %in% forensic_profiles[flag_any]
    ),

    # Update final flag_any to include this propagated-by-value flag
    flag_any = flag_any | flag_same_value_propagate
  ) %>%
  ungroup()

# Create initial interactive plot for forensic profiles with flagged points by type
p_forensic_raw <- forensic_validation %>%
  plot_ly(x = ~capture_datetime, y = ~forensic_profiles, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7,
          name = ~jurisdiction) %>%
  add_markers(data = forensic_validation %>% filter(flag_any),
              x = ~capture_datetime, y = ~forensic_profiles,
              marker = list(size = 12, symbol = 'x', color = "red",
                           line = list(width = 3, color = 'red'))) %>%
  layout(title = "Forensic Profiles - Raw Data (Flagged Points Marked)",
         xaxis = list(title = "Date and Time",
         tickformat = "%Y-%m-%d %H:%M"),
         yaxis = list(title = "Forensic Profiles"))

p_forensic_raw
Show Forensic profiles visualization and correction code
#### Forensic Profiles Correction #####

# Correction for spike_dip and cont_spike_dip: Remove flagged points
forensic_clean <- forensic_validation %>%
  filter(!(flag_any)) %>%
  select(-starts_with("flag_"), -starts_with("prev_"), -starts_with("days_"), -next_value)

# Plot cleaned data
p_forensic_clean <- forensic_clean %>%
  plot_ly(x = ~capture_datetime, y = ~forensic_profiles, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7) %>%
  layout(title = "Forensic Profiles - Cleaned Data",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Forensic Profiles"))

p_forensic_clean
Show Forensic profiles visualization and correction code
# Summarise highest forensic profile per jurisdiction per year
forensic_yearly <- forensic_clean %>%
  mutate(year = year(capture_datetime)) %>%
  group_by(jurisdiction, year) %>%
  summarise(max_forensic = max(forensic_profiles, na.rm = TRUE), .groups = "drop") %>%
  group_by(year) %>%
  summarise(total_max_forensic = sum(max_forensic, na.rm = TRUE), .groups = "drop")

# Plot yearly sums
p_forensic_yearly <- forensic_yearly %>%
  plot_ly(x = ~year, y = ~total_max_forensic,
          type = 'scatter', mode = 'lines+markers',
          line = list(color = "steelblue", width = 3),
          marker = list(size = 8, color = "darkred")) %>%
  layout(title = "Yearly Sum of Max Forensic Profiles per Jurisdiction",
         xaxis = list(title = "Year"),
         yaxis = list(title = "Total Max Forensic Profiles"))

p_forensic_yearly

Arrestee Profiles Correction

Show Arrestee profiles visualization and correction code
#### Raw Arrestee profiles plot

# Flag anomalies for arrestee profiles using formal detection rules
arrestee_validation <- ndis_intermediate %>%
  filter(capture_datetime >= as.Date("2012-01-01")) %>%
  arrange(jurisdiction, capture_datetime) %>%
  group_by(jurisdiction) %>%
  mutate(
    prev_value = lag(arrestee),
    next_value = lead(arrestee),
    
    # Time between observations (in days)
    days_prev = as.numeric(difftime(capture_datetime, lag(capture_datetime), units = "days")),
    days_next = as.numeric(difftime(lead(capture_datetime), capture_datetime, units = "days")),
    
    # Rule 2: Zero Error Detection
    # Flag if: (N_t == 0 AND N_{t-1} > 0)
    flag_zero_error = (
      arrestee == 0 & !is.na(prev_value) & jurisdiction == "California"
    ),

    # Combine all anomaly flags
    flag_any = flag_zero_error,
    
    # Replace NA with FALSE
    across(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .))
  ) %>%
  ungroup()

# Create initial interactive plot for arrestee profiles with flagged points
p_arrestee_raw <- arrestee_validation %>%
  plot_ly(x = ~capture_datetime, y = ~arrestee, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7,
          name = ~jurisdiction) %>%
  add_markers(data = arrestee_validation %>% filter(flag_any),
              x = ~capture_datetime, y = ~arrestee, 
              color = ~jurisdiction,
              marker = list(size = 12, symbol = 'x', 
                           line = list(width = 3, color = 'red')),
              name = ~paste0(jurisdiction, " - Flagged"),
              showlegend = FALSE) %>%
  layout(title = "Arrestee Profiles - Raw Data (Flagged Points Marked)",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Arrestee Profiles"))

p_arrestee_raw
Show Arrestee profiles visualization and correction code
#### Arrestee Profiles Correction #####

arrestee_clean <- arrestee_validation %>%
  filter(!(flag_any)) %>%
  select(-starts_with("flag_"), -starts_with("prev_"), -starts_with("days_"), -next_value)

# Plot cleaned data
p_arrestee_clean <- arrestee_clean %>%
  plot_ly(x = ~capture_datetime, y = ~arrestee, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7) %>%
  layout(title = "Arrestee Profiles - Cleaned Data",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Arrestee Profiles"))

p_arrestee_clean
Show Arrestee profiles visualization and correction code
# Summarise highest arrestee profile per jurisdiction per year
arrestee_yearly <- arrestee_clean %>%
  mutate(year = year(capture_datetime)) %>%
  group_by(jurisdiction, year) %>%
  summarise(max_arrestee = max(arrestee, na.rm = TRUE), .groups = "drop") %>%
  group_by(year) %>%
  summarise(total_max_arrestee = sum(max_arrestee, na.rm = TRUE), .groups = "drop")

# Plot yearly sums
p_arrestee_yearly <- arrestee_yearly %>%
  plot_ly(x = ~year, y = ~total_max_arrestee,
          type = 'scatter', mode = 'lines+markers',
          line = list(color = "purple", width = 3),
          marker = list(size = 8, color = "magenta")) %>%
  layout(title = "Yearly Sum of Max Arrestee Profiles per Jurisdiction",
         xaxis = list(title = "Year"),
         yaxis = list(title = "Total Max Arrestee Profiles"))

p_arrestee_yearly

Investigations Aided Correction

Show Investigations Aided visualization and correction code
#### Raw Investigations Aided plot

# Flag anomalies for investigations aided using formal detection rules
investigations_validation <- ndis_intermediate %>%
  arrange(jurisdiction, capture_datetime) %>%
  group_by(jurisdiction) %>%
  mutate(
    prev_value = lag(investigations_aided),
    next_value = lead(investigations_aided),
    
    # Time between observations (in days)
    days_prev = as.numeric(difftime(capture_datetime, lag(capture_datetime), units = "days")),
    days_next = as.numeric(difftime(lead(capture_datetime), capture_datetime, units = "days")),
    
    # Rule 1: Spike-Dip Detection
    # Flag if: (N_t > 2*N_{t-1}) OR (N_t < 0.5*N_{t-1}) OR (N_t > 2*N_{t+1}) OR (N_t < 0.5*N_{t+1})
    flag_spike_dip = (
      (!is.na(prev_value) & investigations_aided > 10 * prev_value)
    ),
    
    # Continuation of spike-dip: previous was flagged AND current shows recovery
    prev_was_spike_dip = lag(flag_spike_dip),
    flag_cont_spike_dip = (
      !is.na(prev_was_spike_dip) & prev_was_spike_dip &
      ((!is.na(prev_value) & investigations_aided > 0.5 * prev_value & investigations_aided < 2 * prev_value) |
      (!is.na(prev_value) & investigations_aided == prev_value))
    ),
    
    # Rule 2: Zero Error Detection
    # Flag if: (N_t == 0 AND N_{t-1} > 0)
    flag_zero_error = (
      investigations_aided == 0 & !is.na(prev_value) & prev_value > 0
    ),
    
    # Continuation of zero error: previous was flagged zero error AND current is zero
    prev_was_zero_error = lag(flag_zero_error),
    flag_cont_zero_error = (
      investigations_aided == 0 & !is.na(prev_was_zero_error) & prev_was_zero_error
    ),

    # Combine all anomaly flags
    flag_any = flag_spike_dip | flag_cont_spike_dip | flag_zero_error | flag_cont_zero_error,
    
    # Replace NA with FALSE
    across(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .)),

    # --- New rule: propagate by metric value within jurisdiction ---
    # TRUE if this investigations_aided value appears among the flagged values in this jurisdiction
    flag_same_value_propagate = ifelse(
      is.na(investigations_aided), 
      FALSE,
      investigations_aided %in% investigations_aided[flag_any]
    ),

    # Update final flag_any to include this propagated-by-value flag
    flag_any = flag_any | flag_same_value_propagate
  ) %>%
  ungroup()

# Create initial interactive plot for investigations aided with flagged points
p_investigations_raw <- investigations_validation %>%
  plot_ly(x = ~capture_datetime, y = ~investigations_aided, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7,
          name = ~jurisdiction) %>%
  add_markers(data = investigations_validation %>% filter(flag_any),
              x = ~capture_datetime, y = ~investigations_aided, 
              color = ~jurisdiction,
              marker = list(size = 12, symbol = 'x', 
                           line = list(width = 3, color = 'red')),
              name = ~paste0(jurisdiction, " - Flagged"),
              showlegend = FALSE) %>%
  layout(title = "Investigations Aided - Raw Data (Flagged Points Marked)",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Investigations Aided"))

p_investigations_raw
Show Investigations Aided visualization and correction code
#### Investigations Aided Correction #####

# Remove spike_dip and cont_spike_dip
investigations_clean <- investigations_validation %>%
  filter(!(flag_any)) %>%
  select(-starts_with("flag_"), -starts_with("prev_"), -starts_with("days_"), -next_value)

# Plot cleaned data
p_investigations_clean <- investigations_clean %>%
  plot_ly(x = ~capture_datetime, y = ~investigations_aided, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7) %>%
  layout(title = "Investigations Aided - Cleaned Data",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Investigations Aided"))

p_investigations_clean
Show Investigations Aided visualization and correction code
# Summarise highest investigations_aided per jurisdiction per year
investigations_yearly <- investigations_clean %>%
  mutate(year = year(capture_datetime)) %>%
  group_by(jurisdiction, year) %>%
  summarise(max_investigations = max(investigations_aided, na.rm = TRUE), .groups = "drop") %>%
  group_by(year) %>%
  summarise(total_max_investigations = sum(max_investigations, na.rm = TRUE), .groups = "drop")

# Plot yearly sums
p_investigations_yearly <- investigations_yearly %>%
  plot_ly(x = ~year, y = ~total_max_investigations,
          type = 'scatter', mode = 'lines+markers',
          line = list(color = "darkblue", width = 3),
          marker = list(size = 8, color = "red")) %>%
  layout(title = "Yearly Sum of Max Investigations Aided per Jurisdiction",
         xaxis = list(title = "Year"),
         yaxis = list(title = "Total Max Investigations Aided"))

p_investigations_yearly

Participating Laboratories Correction

Show Labs visualization and correction code
#### Raw NDIS Labs plot

# Flag anomalies for ndis_labs using formal detection rules
labs_validation <- ndis_intermediate %>%
  arrange(jurisdiction, capture_datetime) %>%
  group_by(jurisdiction) %>%
  mutate(
    prev_value = lag(ndis_labs),
    next_value = lead(ndis_labs),
    
    # Time between observations (in days)
    days_prev = as.numeric(difftime(capture_datetime, lag(capture_datetime), units = "days")),
    days_next = as.numeric(difftime(lead(capture_datetime), capture_datetime, units = "days")),
    
    # Rule 1: Spike-Dip Detection
    # Flag if: (N_t > 2*N_{t-1}) OR (N_t < 0.5*N_{t-1}) OR (N_t > 2*N_{t+1}) OR (N_t < 0.5*N_{t+1})
    flag_spike_dip = (
      (!is.na(prev_value) & ndis_labs > 3 * prev_value & jurisdiction == "Oklahoma") |
      (!is.na(prev_value) & ndis_labs <= 0.25 * prev_value & jurisdiction == "Michigan")
    ),
    
    # Continuation of spike-dip: previous was flagged AND current shows recovery
    prev_was_spike_dip = lag(flag_spike_dip),
    flag_cont_spike_dip = (
      !is.na(prev_was_spike_dip) & prev_was_spike_dip &
      ((!is.na(prev_value) & ndis_labs > 0.5 * prev_value & ndis_labs < 2 * prev_value) |
      (!is.na(prev_value) & ndis_labs == prev_value))
    ),

    # Combine all anomaly flags
    flag_any = flag_spike_dip | flag_cont_spike_dip,
    
    # Replace NA with FALSE
    across(starts_with("flag_"), ~ifelse(is.na(.), FALSE, .))
  ) %>%
  ungroup()


# Create initial interactive plot for ndis_labs with flagged points
p_labs_raw <- labs_validation %>%
  plot_ly(x = ~capture_datetime, y = ~ndis_labs, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7,
          name = ~jurisdiction) %>%
  add_markers(data = labs_validation %>% filter(flag_any),
              x = ~capture_datetime, y = ~ndis_labs, 
              color = ~jurisdiction,
              marker = list(size = 12, symbol = 'x', 
                           line = list(width = 3, color = 'red')),
              name = ~paste0(jurisdiction, " - Flagged"),
              showlegend = FALSE) %>%
  layout(title = "NDIS Labs - Raw Data (Flagged Points Marked)",
         xaxis = list(title = "Date"),
         yaxis = list(title = "NDIS Labs"))

p_labs_raw
Show Labs visualization and correction code
#### NDIS Labs Correction #####

# Remove spike_dip and cont_spike_dip
labs_clean <- labs_validation %>%
  filter(!(flag_any)) %>%
  select(-starts_with("flag_"), -starts_with("prev_"), -starts_with("days_"), -next_value)

# Plot cleaned data
p_labs_clean <- labs_clean %>%
  plot_ly(x = ~capture_datetime, y = ~ndis_labs, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7) %>%
  layout(title = "NDIS Labs - Cleaned Data",
         xaxis = list(title = "Date"),
         yaxis = list(title = "NDIS Labs"))

p_labs_clean

Total Profiles Aggregation and Correction

Analysis Approach:

The total profiles metric aggregates all DNA profile types (Offender + Arrestee + Forensic) to provide a comprehensive view of the NDIS database size.

The analysis tracks cumulative growth per jurisdiction, shows individual jurisdiction contributions, and reveals relative database sizes across jurisdictions.

Show Total Profiles correction code
# Combine cleaned datasets - preserve all variables
ndis_joined <- forensic_clean %>%
  select(jurisdiction, capture_datetime, asof_date, forensic_profiles) %>%
  distinct(jurisdiction, capture_datetime, .keep_all = TRUE) %>%
  full_join(
    arrestee_clean %>%
      select(jurisdiction, capture_datetime, asof_date, arrestee) %>%
      distinct(jurisdiction, capture_datetime, .keep_all = TRUE),
    by = c("jurisdiction", "capture_datetime", "asof_date")
  ) %>%
  mutate(arrestee = ifelse(is.na(arrestee), 0, arrestee)) %>%
  arrange(jurisdiction, capture_datetime)

# Join investigations_aided
ndis_joined <- ndis_joined %>%
  inner_join(
    investigations_clean %>%
      select(jurisdiction, capture_datetime, asof_date, investigations_aided) %>%
      distinct(jurisdiction, capture_datetime, .keep_all = TRUE),
    by = c("jurisdiction", "capture_datetime", "asof_date")
  ) %>%
  distinct(jurisdiction, capture_datetime, .keep_all = TRUE)

# Join offender_profiles
ndis_joined <- ndis_joined %>%
  inner_join(
    offender_clean %>%
      select(jurisdiction, capture_datetime, asof_date, offender_profiles) %>%
      distinct(jurisdiction, capture_datetime, .keep_all = TRUE),
    by = c("jurisdiction", "capture_datetime", "asof_date")
  ) %>%
  distinct(jurisdiction, capture_datetime, .keep_all = TRUE)

# Join ndis_labs
ndis_joined <- ndis_joined %>%
  inner_join(
    labs_clean %>%
      select(jurisdiction, capture_datetime, asof_date, ndis_labs) %>%
      distinct(jurisdiction, capture_datetime, .keep_all = TRUE),
    by = c("jurisdiction", "capture_datetime", "asof_date")
  ) %>%
  distinct(jurisdiction, capture_datetime, .keep_all = TRUE) %>%
  arrange(jurisdiction, capture_datetime)

# Create final clean dataset with all variables
ndis_clean <- ndis_joined %>%
  mutate(
    year = year(capture_datetime),
    offender_profiles = ifelse(is.na(offender_profiles), 0, offender_profiles),
    arrestee = ifelse(is.na(arrestee), 0, arrestee),
    forensic_profiles = ifelse(is.na(forensic_profiles), 0, forensic_profiles),
    investigations_aided = ifelse(is.na(investigations_aided), 0, investigations_aided),
    ndis_labs = ifelse(is.na(ndis_labs), 0, ndis_labs),
    total_profiles = offender_profiles + arrestee + forensic_profiles
  ) %>%
  filter(!(jurisdiction == "California" & arrestee == 0 & year > 2013)) %>%
  arrange(jurisdiction, capture_datetime)

# Plot cleaned total
p_total_cleaned <- ndis_clean %>%
  plot_ly(x = ~capture_datetime, y = ~total_profiles, color = ~jurisdiction, 
          type = 'scatter', mode = 'lines+markers', alpha = 0.7, connectgaps = TRUE) %>%
  layout(title = "Total Profiles - Cleaned Data",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Total Profiles"))

p_total_cleaned
Show Total Profiles correction code
# Get most recent data for stacked bar plot
most_recent_date <- max(ndis_clean$capture_datetime, na.rm = TRUE)

profiles_latest <- ndis_clean %>%
  filter(capture_datetime == most_recent_date) %>%
  select(jurisdiction, offender_profiles, arrestee, forensic_profiles) %>%
  pivot_longer(
    cols = c(offender_profiles, arrestee, forensic_profiles),
    names_to = "profile_type",
    values_to = "count"
  ) %>%
  mutate(
    profile_type = case_when(
      profile_type == "offender_profiles" ~ "Offender Profiles",
      profile_type == "arrestee" ~ "Arrestee Profiles",
      profile_type == "forensic_profiles" ~ "Forensic Profiles"
    )
  )

p_profiles_stacked <- ggplot(profiles_latest, 
                             aes(x = jurisdiction, y = count, fill = profile_type)) +
  geom_bar(stat = "identity", color = "black", linewidth = 0.3) +
  scale_fill_manual(
    name = "Profile Type",
    values = c(
      "Offender Profiles" = "#1f4e79",
      "Arrestee Profiles" = "#2e75b6", 
      "Forensic Profiles" = "#5b9bd5"
    )
  ) +
  scale_y_continuous(
    labels = function(x) {
      ifelse(x >= 1e6, paste0(x/1e6, "M"), 
             ifelse(x >= 1e3, paste0(x/1e3, "K"), x))
    },
    limits = c(0, NA),
    expand = c(0, 0)
  ) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", linewidth = 0.4),
    axis.ticks = element_line(color = "black", linewidth = 0.4),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.key.width = unit(0.4, "cm"),
    legend.key.height = unit(0.4, "cm"),
    aspect.ratio = 0.65
  ) +
  labs(
    x = "Jurisdiction",
    y = "Number of DNA Profiles",
    title = " "
  )

p_profiles_stacked

Anomaly Detection and Metadata Logging

This section systematically documents all data anomalies detected during the validation process.

Each flagged observation is recorded with comprehensive metadata including anomaly type, jurisdiction, timestamp, and contextual values.

The log serves as both an audit trail for data quality decisions and a source for transparency reporting.

Key outputs include:

  • Detailed anomaly records for technical review
  • Summary statistics for quality assessment
  • Visualization of anomaly distribution patterns
  • Formatted summaries for reporting
Show anomaly detection and logging code
### Anomaly Detection and Metadata Logging

# Helper function to ensure all flag columns exist and are properly set to FALSE
ensure_flag_columns <- function(df) {
  flag_cols <- c("flag_spike_dip", "flag_cont_spike_dip", "flag_zero_error", 
                 "flag_cont_zero_error", "flag_osc_lag")
  
  for (col in flag_cols) {
    if (!col %in% names(df)) {
      df[[col]] <- FALSE
    } else {
      df[[col]] <- coalesce(df[[col]], FALSE)
    }
  }
  return(df)
}

# Function to create standardized anomaly records with formal detection rule names
create_anomaly_record <- function(metric_name, jurisdiction, date, value, 
                                 flag_spike_dip = FALSE, flag_cont_spike_dip = FALSE,
                                 flag_zero_error = FALSE, flag_cont_zero_error = FALSE,
                                 flag_osc_lag = FALSE, prev_value = NA, next_value = NA) {
  
  # Determine primary anomaly type based on flags
  anomaly_type <- case_when(
    flag_spike_dip ~ "spike_dip",
    flag_cont_spike_dip ~ "cont_spike_dip",
    flag_zero_error ~ "zero_error",
    flag_cont_zero_error ~ "cont_zero_error",
    flag_osc_lag ~ "osc_lag",
    TRUE ~ "other"
  )
  
  data.frame(
    metric = metric_name,
    jurisdiction = jurisdiction,
    date = date,
    value = value,
    previous_value = prev_value,
    next_value = next_value,
    flag_spike_dip = flag_spike_dip,
    flag_cont_spike_dip = flag_cont_spike_dip,
    flag_zero_error = flag_zero_error,
    flag_cont_zero_error = flag_cont_zero_error,
    flag_osc_lag = flag_osc_lag,
    anomaly_type = anomaly_type,
    stringsAsFactors = FALSE
  )
}

# Initialize empty anomaly log
anomaly_log <- data.frame()

### Offender Profiles Anomalies
offender_anomalies <- offender_validation %>%
  filter(flag_any) %>%
  select(jurisdiction, capture_datetime, offender_profiles, prev_value, next_value,
         any_of(c("flag_spike_dip", "flag_cont_spike_dip", "flag_zero_error", 
                  "flag_cont_zero_error", "flag_osc_lag"))) %>%
  ensure_flag_columns()

if(nrow(offender_anomalies) > 0) {
  for(i in 1:nrow(offender_anomalies)) {
    row <- offender_anomalies[i, ]
    anomaly_record <- create_anomaly_record(
      metric_name = "Offender Profiles",
      jurisdiction = row$jurisdiction,
      date = row$capture_datetime,
      value = row$offender_profiles,
      flag_spike_dip = row$flag_spike_dip,
      flag_cont_spike_dip = row$flag_cont_spike_dip,
      flag_zero_error = row$flag_zero_error,
      flag_cont_zero_error = row$flag_cont_zero_error,
      flag_osc_lag = row$flag_osc_lag,
      prev_value = row$prev_value,
      next_value = row$next_value
    )
    anomaly_log <- bind_rows(anomaly_log, anomaly_record)
  }
}

### Forensic Profiles Anomalies
forensic_anomalies <- forensic_validation %>%
  filter(flag_any) %>%
  select(jurisdiction, capture_datetime, forensic_profiles, prev_value, next_value,
         any_of(c("flag_spike_dip", "flag_cont_spike_dip", "flag_zero_error", 
                  "flag_cont_zero_error", "flag_osc_lag"))) %>%
  ensure_flag_columns()

if(nrow(forensic_anomalies) > 0) {
  for(i in 1:nrow(forensic_anomalies)) {
    row <- forensic_anomalies[i, ]
    anomaly_record <- create_anomaly_record(
      metric_name = "Forensic Profiles",
      jurisdiction = row$jurisdiction,
      date = row$capture_datetime,
      value = row$forensic_profiles,
      flag_spike_dip = row$flag_spike_dip,
      flag_cont_spike_dip = row$flag_cont_spike_dip,
      flag_zero_error = row$flag_zero_error,
      flag_cont_zero_error = row$flag_cont_zero_error,
      flag_osc_lag = row$flag_osc_lag,
      prev_value = row$prev_value,
      next_value = row$next_value
    )
    anomaly_log <- bind_rows(anomaly_log, anomaly_record)
  }
}

### Arrestee Profiles Anomalies
arrestee_anomalies <- arrestee_validation %>%
  filter(flag_any) %>%
  select(jurisdiction, capture_datetime, arrestee, prev_value, next_value,
         any_of(c("flag_spike_dip", "flag_cont_spike_dip", "flag_zero_error", 
                  "flag_cont_zero_error", "flag_osc_lag"))) %>%
  ensure_flag_columns()

if(nrow(arrestee_anomalies) > 0) {
  for(i in 1:nrow(arrestee_anomalies)) {
    row <- arrestee_anomalies[i, ]
    anomaly_record <- create_anomaly_record(
      metric_name = "Arrestee Profiles",
      jurisdiction = row$jurisdiction,
      date = row$capture_datetime,
      value = row$arrestee,
      flag_spike_dip = row$flag_spike_dip,
      flag_cont_spike_dip = row$flag_cont_spike_dip,
      flag_zero_error = row$flag_zero_error,
      flag_cont_zero_error = row$flag_cont_zero_error,
      flag_osc_lag = row$flag_osc_lag,
      prev_value = row$prev_value,
      next_value = row$next_value
    )
    anomaly_log <- bind_rows(anomaly_log, anomaly_record)
  }
}

### Investigations Aided Anomalies
investigations_anomalies <- investigations_validation %>%
  filter(flag_any) %>%
  select(jurisdiction, capture_datetime, investigations_aided, prev_value, next_value,
         any_of(c("flag_spike_dip", "flag_cont_spike_dip", "flag_zero_error", 
                  "flag_cont_zero_error", "flag_osc_lag"))) %>%
  ensure_flag_columns()

if(nrow(investigations_anomalies) > 0) {
  for(i in 1:nrow(investigations_anomalies)) {
    row <- investigations_anomalies[i, ]
    anomaly_record <- create_anomaly_record(
      metric_name = "Investigations Aided",
      jurisdiction = row$jurisdiction,
      date = row$capture_datetime,
      value = row$investigations_aided,
      flag_spike_dip = row$flag_spike_dip,
      flag_cont_spike_dip = row$flag_cont_spike_dip,
      flag_zero_error = row$flag_zero_error,
      flag_cont_zero_error = row$flag_cont_zero_error,
      flag_osc_lag = row$flag_osc_lag,
      prev_value = row$prev_value,
      next_value = row$next_value
    )
    anomaly_log <- bind_rows(anomaly_log, anomaly_record)
  }
}

### NDIS Labs Anomalies
labs_anomalies <- labs_validation %>%
  filter(flag_any) %>%
  select(jurisdiction, capture_datetime, ndis_labs, prev_value, next_value,
         any_of(c("flag_spike_dip", "flag_cont_spike_dip", "flag_zero_error", 
                  "flag_cont_zero_error", "flag_osc_lag"))) %>%
  ensure_flag_columns()

if(nrow(labs_anomalies) > 0) {
  for(i in 1:nrow(labs_anomalies)) {
    row <- labs_anomalies[i, ]
    anomaly_record <- create_anomaly_record(
      metric_name = "NDIS Labs",
      jurisdiction = row$jurisdiction,
      date = row$capture_datetime,
      value = row$ndis_labs,
      flag_spike_dip = row$flag_spike_dip,
      flag_cont_spike_dip = row$flag_cont_spike_dip,
      flag_zero_error = row$flag_zero_error,
      flag_cont_zero_error = row$flag_cont_zero_error,
      flag_osc_lag = row$flag_osc_lag,
      prev_value = row$prev_value,
      next_value = row$next_value
    )
    anomaly_log <- bind_rows(anomaly_log, anomaly_record)
  }
}

### Set metric and anomaly type ordering for consistent display
metric_order <- c("Offender Profiles", "Forensic Profiles", "Arrestee Profiles", 
                  "Investigations Aided", "NDIS Labs")
anomaly_type_order <- c("spike_dip", "cont_spike_dip", "zero_error", "cont_zero_error", "osc_lag")

# Convert to factors with desired order
anomaly_log <- anomaly_log %>%
  mutate(
    metric = factor(metric, levels = metric_order),
    anomaly_type = factor(anomaly_type, levels = anomaly_type_order)
  )

### Create Summary Tables for Reporting

# Overview of detected anomalies by metric
anomaly_overview <- anomaly_log %>%
  group_by(metric) %>%
  summarise(
    total_anomalies = n(),
    affected_jurisdictions = n_distinct(jurisdiction),
    .groups = "drop"
  ) %>%
  arrange(metric)

print(knitr::kable(anomaly_overview, format = "simple", 
                   caption = "Overview of Detected Anomalies by Metric"))


Table: Overview of Detected Anomalies by Metric

metric                  total_anomalies   affected_jurisdictions
---------------------  ----------------  -----------------------
Offender Profiles                  1290                       52
Forensic Profiles                  1011                       54
Arrestee Profiles                     6                        1
Investigations Aided               1002                       25
NDIS Labs                            10                        2
Show anomaly detection and logging code
# Detailed summary by metric, jurisdiction, and anomaly type
anomaly_detailed <- anomaly_log %>%
  group_by(metric, jurisdiction, anomaly_type) %>%
  summarise(
    count = n(),
    earliest_date = min(date),
    latest_date = max(date),
    avg_value = mean(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(metric, jurisdiction, anomaly_type)

print(knitr::kable(anomaly_detailed, format = "simple",
                   caption = "Detailed Anomaly Summary by Metric, Jurisdiction, and Type"))


Table: Detailed Anomaly Summary by Metric, Jurisdiction, and Type

metric                 jurisdiction     anomaly_type       count  earliest_date         latest_date               avg_value
---------------------  ---------------  ----------------  ------  --------------------  --------------------  -------------
Offender Profiles      Alabama          spike_dip              4  2002-10-03 07:38:04   2015-01-07 16:29:32    1.187225e+04
Offender Profiles      Alabama          cont_spike_dip         1  2014-12-26 09:29:17   2014-12-26 09:29:17    2.219200e+04
Offender Profiles      Alabama          osc_lag                1  2008-09-16 15:14:50   2008-09-16 15:14:50    1.571870e+05
Offender Profiles      Alabama          NA                    16  2001-07-15 04:15:59   2009-08-25 08:29:04    3.929675e+04
Offender Profiles      Alaska           osc_lag                1  2008-09-16 15:09:02   2008-09-16 15:09:02    1.095400e+04
Offender Profiles      Alaska           NA                     6  2008-05-24 12:00:21   2009-08-25 08:30:46    1.095400e+04
Offender Profiles      Arizona          osc_lag                3  2008-09-16 15:04:19   2008-11-08 14:56:53    1.256920e+05
Offender Profiles      Arizona          NA                     3  2008-07-09 09:37:20   2009-08-25 08:25:42    1.256920e+05
Offender Profiles      Arkansas         osc_lag                1  2008-09-16 15:08:24   2008-09-16 15:08:24    8.791300e+04
Offender Profiles      Arkansas         NA                     2  2008-07-09 09:33:18   2009-08-25 08:03:42    8.791300e+04
Offender Profiles      California       spike_dip              2  2002-02-07 23:56:00   2002-10-03 07:29:55    2.882400e+04
Offender Profiles      California       osc_lag                4  2008-09-15 03:04:04   2009-05-12 12:40:30    9.965390e+05
Offender Profiles      California       NA                    25  2001-07-15 04:23:36   2010-04-09 22:37:24    6.052421e+05
Offender Profiles      Colorado         osc_lag                3  2008-09-16 15:04:55   2009-05-12 12:40:53    8.436600e+04
Offender Profiles      Colorado         NA                    13  2008-07-09 09:34:09   2010-04-09 22:40:08    8.436600e+04
Offender Profiles      Connecticut      osc_lag                3  2008-09-16 15:06:28   2021-01-24 09:07:28    6.763433e+04
Offender Profiles      Connecticut      NA                    10  2008-07-09 09:32:17   2021-02-14 00:46:27    9.670640e+04
Offender Profiles      Delaware         spike_dip              1  2002-06-06 18:27:45   2002-06-06 18:27:45    0.000000e+00
Offender Profiles      Delaware         osc_lag                1  2021-01-24 09:07:28   2021-01-24 09:07:28    1.970300e+04
Offender Profiles      Delaware         NA                    20  2001-07-15 04:09:05   2021-02-14 00:46:27    6.896050e+03
Offender Profiles      Florida          osc_lag                1  2008-09-16 15:21:04   2008-09-16 15:21:04    4.715620e+05
Offender Profiles      Florida          NA                     4  2008-07-09 09:38:11   2009-08-25 08:21:28    4.715620e+05
Offender Profiles      Hawaii           spike_dip              1  2004-08-05 06:45:30   2004-08-05 06:45:30    0.000000e+00
Offender Profiles      Hawaii           osc_lag                1  2008-09-16 15:14:13   2008-09-16 15:14:13    9.247000e+03
Offender Profiles      Hawaii           NA                    32  2001-07-15 04:14:32   2009-08-25 08:10:39    5.779375e+02
Offender Profiles      Idaho            spike_dip              2  2001-12-29 00:12:07   2002-08-27 02:57:39    2.030000e+02
Offender Profiles      Idaho            osc_lag                1  2015-04-30 12:57:08   2015-04-30 12:57:08    3.020400e+04
Offender Profiles      Idaho            NA                    11  2001-07-15 04:19:55   2015-06-05 17:24:00    5.491636e+03
Offender Profiles      Illinois         osc_lag                1  2008-09-16 15:08:48   2008-09-16 15:08:48    2.989750e+05
Offender Profiles      Illinois         NA                     4  2008-07-09 09:37:29   2009-08-25 08:12:58    2.989750e+05
Offender Profiles      Indiana          osc_lag                3  2005-03-02 11:44:05   2012-10-21 06:13:38    1.084710e+05
Offender Profiles      Indiana          NA                    16  2005-03-06 12:50:22   2012-11-14 07:24:58    1.384814e+05
Offender Profiles      Iowa             spike_dip              1  2002-06-06 18:33:59   2002-06-06 18:33:59    0.000000e+00
Offender Profiles      Iowa             osc_lag                1  2008-09-16 15:19:51   2008-09-16 15:19:51    4.551900e+04
Offender Profiles      Iowa             NA                     6  2001-09-13 04:17:46   2009-08-25 08:17:59    1.517300e+04
Offender Profiles      Kansas           osc_lag                6  2005-03-02 11:46:08   2018-04-25 12:54:59    4.843417e+04
Offender Profiles      Kansas           NA                    30  2005-03-06 12:51:20   2018-05-17 02:41:14    3.575687e+04
Offender Profiles      Kentucky         osc_lag                1  2008-09-16 15:08:05   2008-09-16 15:08:05    1.314200e+04
Offender Profiles      Kentucky         NA                     2  2008-07-09 09:33:43   2009-08-25 07:59:34    1.314200e+04
Offender Profiles      Louisiana        spike_dip              3  2002-10-17 18:05:12   2003-12-23 15:11:16    2.071000e+03
Offender Profiles      Louisiana        osc_lag                1  2006-10-04 09:04:03   2006-10-04 09:04:03    4.559100e+04
Offender Profiles      Louisiana        NA                    15  2001-07-15 04:08:16   2006-11-01 10:14:46    1.836980e+04
Offender Profiles      Maine            osc_lag                1  2008-09-16 15:18:03   2008-09-16 15:18:03    8.853000e+03
Offender Profiles      Maine            NA                     3  2008-06-11 15:59:05   2009-08-25 08:00:50    8.853000e+03
Offender Profiles      Maryland         spike_dip              1  2001-11-05 20:27:23   2001-11-05 20:27:23    2.466000e+03
Offender Profiles      Maryland         osc_lag                3  2004-10-27 02:52:19   2008-09-16 15:15:40    3.522500e+04
Offender Profiles      Maryland         NA                    29  2004-10-28 11:35:44   2009-08-25 07:52:59    2.432641e+04
Offender Profiles      Massachusetts    spike_dip              6  2001-11-09 20:39:14   2007-05-09 05:37:43    1.714667e+03
Offender Profiles      Massachusetts    cont_spike_dip         1  2004-09-29 16:25:10   2004-09-29 16:25:10    0.000000e+00
Offender Profiles      Massachusetts    osc_lag                4  2001-11-05 20:25:21   2009-01-14 20:15:24    3.267475e+04
Offender Profiles      Massachusetts    NA                    51  2001-07-15 04:11:54   2009-08-25 08:28:33    4.101412e+03
Offender Profiles      Michigan         spike_dip              9  2001-07-15 04:16:43   2008-05-23 18:17:46    1.959622e+04
Offender Profiles      Michigan         osc_lag                2  2008-09-16 15:05:50   2024-07-13 14:39:31    3.166135e+05
Offender Profiles      Michigan         NA                     8  2008-07-09 09:34:50   2024-11-28 02:53:27    3.533068e+05
Offender Profiles      Minnesota        osc_lag                5  2008-09-16 15:04:36   2021-05-07 09:09:54    1.304742e+05
Offender Profiles      Minnesota        NA                    18  2008-07-09 09:37:58   2021-05-27 10:08:01    1.506762e+05
Offender Profiles      Mississippi      spike_dip              2  2004-08-05 06:44:45   2008-04-09 08:53:33    3.231500e+03
Offender Profiles      Mississippi      osc_lag                1  2008-09-16 15:19:34   2008-09-16 15:19:34    1.992700e+04
Offender Profiles      Mississippi      NA                    32  2001-07-15 04:24:22   2009-08-25 08:17:29    2.255281e+03
Offender Profiles      Missouri         osc_lag                1  2008-09-16 15:21:16   2008-09-16 15:21:16    1.625010e+05
Offender Profiles      Missouri         NA                     2  2008-07-09 09:38:04   2009-08-25 08:07:44    1.625010e+05
Offender Profiles      Montana          spike_dip              4  2002-02-08 02:44:21   2010-04-09 22:38:43    4.443000e+03
Offender Profiles      Montana          osc_lag                4  2008-09-16 15:15:06   2019-05-02 13:50:53    1.421725e+04
Offender Profiles      Montana          NA                    23  2001-07-15 04:24:31   2019-05-11 00:03:25    9.375130e+03
Offender Profiles      Nebraska         spike_dip              3  2011-04-10 20:28:36   2017-05-16 14:40:07    2.674367e+04
Offender Profiles      Nebraska         cont_spike_dip         1  2017-05-16 20:14:25   2017-05-16 20:14:25    3.762800e+04
Offender Profiles      Nebraska         osc_lag                2  2009-01-14 20:19:16   2009-05-12 12:39:49    4.019000e+03
Offender Profiles      Nebraska         NA                    19  2008-07-09 09:35:43   2017-05-20 05:02:54    7.556789e+03
Offender Profiles      Nevada           osc_lag                2  2008-09-16 15:06:46   2021-01-24 09:07:28    6.589250e+04
Offender Profiles      Nevada           NA                     9  2008-07-09 09:35:04   2021-02-14 00:46:27    8.471778e+04
Offender Profiles      New Hampshire    spike_dip              2  2003-12-23 15:10:10   2004-12-29 03:27:43    2.050000e+01
Offender Profiles      New Hampshire    osc_lag                1  2008-09-16 15:07:27   2008-09-16 15:07:27    1.803000e+03
Offender Profiles      New Hampshire    NA                    49  2001-07-15 04:12:24   2009-08-25 08:11:06    1.045510e+02
Offender Profiles      New Jersey       osc_lag                7  2008-10-12 11:31:42   2010-07-12 18:11:34    1.673140e+05
Offender Profiles      New Jersey       NA                    15  2008-07-09 09:34:57   2010-08-19 16:15:27    1.673140e+05
Offender Profiles      New Mexico       spike_dip              1  2001-09-13 19:14:45   2001-09-13 19:14:45    2.899000e+03
Offender Profiles      New Mexico       osc_lag                3  2008-09-16 15:03:11   2009-05-12 12:39:43    4.632400e+04
Offender Profiles      New Mexico       NA                    17  2001-07-15 04:14:54   2010-04-09 22:40:25    3.610635e+04
Offender Profiles      New York         osc_lag                1  2008-09-16 15:13:11   2008-09-16 15:13:11    2.657890e+05
Offender Profiles      New York         NA                     3  2008-07-09 09:34:31   2010-01-18 20:33:35    2.657890e+05
Offender Profiles      North Carolina   spike_dip              2  2017-04-09 20:26:27   2017-05-13 04:05:12    1.213300e+04
Offender Profiles      North Carolina   cont_spike_dip         1  2017-05-05 05:42:22   2017-05-05 05:42:22    1.213300e+04
Offender Profiles      North Carolina   osc_lag                1  2008-09-16 15:14:33   2008-09-16 15:14:33    1.410710e+05
Offender Profiles      North Carolina   NA                     2  2008-07-09 09:32:04   2009-08-25 08:14:39    1.410710e+05
Offender Profiles      North Dakota     spike_dip              4  2012-09-21 12:24:19   2017-05-16 14:40:07    1.079200e+04
Offender Profiles      North Dakota     cont_spike_dip         2  2012-10-22 16:23:30   2017-05-16 20:14:25    1.092100e+04
Offender Profiles      North Dakota     osc_lag                3  2008-09-16 15:12:02   2009-05-12 12:39:32    4.553000e+03
Offender Profiles      North Dakota     NA                    30  2008-07-09 09:32:24   2017-05-20 05:02:54    7.548467e+03
Offender Profiles      Ohio             osc_lag                2  2008-09-16 15:07:13   2021-09-24 18:42:45    3.743470e+05
Offender Profiles      Ohio             NA                    11  2008-07-09 09:32:44   2021-11-14 09:28:57    4.414827e+05
Offender Profiles      Oklahoma         spike_dip              6  2002-08-27 03:04:00   2008-09-20 14:18:46    4.497250e+04
Offender Profiles      Oklahoma         cont_spike_dip         1  2008-10-12 11:50:00   2008-10-12 11:50:00    7.132500e+04
Offender Profiles      Oklahoma         osc_lag                2  2009-01-14 20:17:02   2009-05-12 12:41:21    5.396700e+04
Offender Profiles      Oklahoma         NA                    20  2001-07-15 04:21:52   2010-04-09 22:42:35    4.134315e+04
Offender Profiles      Oregon           osc_lag                4  2004-08-11 20:52:55   2009-05-12 12:40:17    7.766600e+04
Offender Profiles      Oregon           NA                    26  2004-08-12 02:11:07   2010-04-09 22:38:24    6.859400e+04
Offender Profiles      Pennsylvania     osc_lag                1  2008-09-16 15:12:20   2008-09-16 15:12:20    1.787080e+05
Offender Profiles      Pennsylvania     NA                     2  2008-07-09 09:35:23   2009-08-25 07:57:42    1.787080e+05
Offender Profiles      Rhode Island     spike_dip              2  2003-12-23 15:09:03   2009-08-25 07:54:13    1.109500e+03
Offender Profiles      Rhode Island     osc_lag                1  2008-09-16 15:16:45   2008-09-16 15:16:45    2.219000e+03
Offender Profiles      Rhode Island     NA                    13  2001-09-13 00:18:10   2008-07-09 09:36:54    1.706923e+02
Offender Profiles      South Carolina   spike_dip              1  2002-08-27 02:42:17   2002-08-27 02:42:17    1.658000e+03
Offender Profiles      South Carolina   osc_lag                3  2004-09-19 07:01:32   2015-04-30 12:57:08    1.100120e+05
Offender Profiles      South Carolina   NA                    12  2001-07-15 04:10:37   2015-05-04 10:01:16    4.617083e+04
Offender Profiles      South Dakota     spike_dip              1  2002-10-17 18:37:05   2002-10-17 18:37:05    0.000000e+00
Offender Profiles      South Dakota     osc_lag                3  2008-09-16 15:19:06   2009-05-12 12:40:58    1.554900e+04
Offender Profiles      South Dakota     NA                    20  2001-07-15 04:13:01   2010-04-09 22:41:08    1.010685e+04
Offender Profiles      Tennessee        spike_dip              1  2002-02-08 03:56:33   2002-02-08 03:56:33    0.000000e+00
Offender Profiles      Tennessee        osc_lag                7  2008-09-16 15:13:59   2019-10-26 04:32:49    1.710951e+05
Offender Profiles      Tennessee        NA                    29  2001-07-15 04:11:23   2019-11-24 09:52:29    1.584466e+05
Offender Profiles      Texas            osc_lag                1  2008-09-16 15:06:04   2008-09-16 15:06:04    3.493860e+05
Offender Profiles      Texas            NA                     3  2008-07-09 09:33:12   2009-08-25 08:19:46    3.493860e+05
Offender Profiles      Utah             osc_lag                2  2008-09-16 15:12:34   2014-02-09 00:23:04    5.668850e+04
Offender Profiles      Utah             NA                     4  2008-06-11 15:58:09   2014-02-09 23:32:30    4.115775e+04
Offender Profiles      Vermont          spike_dip              1  2002-02-08 04:05:43   2002-02-08 04:05:43    2.070000e+02
Offender Profiles      Vermont          osc_lag                7  2008-09-16 15:18:25   2021-09-24 18:42:45    1.794729e+04
Offender Profiles      Vermont          NA                    36  2001-08-22 05:47:08   2021-11-14 09:28:57    1.410894e+04
Offender Profiles      Virginia         osc_lag                1  2008-09-16 15:05:19   2008-09-16 15:05:19    2.727530e+05
Offender Profiles      Virginia         NA                     2  2008-07-09 09:36:40   2009-08-25 08:09:15    2.727530e+05
Offender Profiles      Washington       osc_lag                1  2008-09-16 15:18:43   2008-09-16 15:18:43    1.190070e+05
Offender Profiles      Washington       NA                     4  2008-07-09 09:36:25   2009-08-25 08:24:54    1.190070e+05
Offender Profiles      West Virginia    spike_dip             10  2001-09-14 21:17:13   2008-04-09 09:00:28    4.317000e+02
Offender Profiles      West Virginia    cont_spike_dip         1  2005-02-03 08:04:30   2005-02-03 08:04:30    1.630000e+02
Offender Profiles      West Virginia    NA                   118  2001-07-15 04:05:59   2007-02-14 08:03:09    2.293729e+02
Offender Profiles      Wisconsin        osc_lag                1  2008-09-16 15:13:41   2008-09-16 15:13:41    1.104870e+05
Offender Profiles      Wisconsin        NA                     2  2008-07-09 09:35:30   2009-08-25 07:50:46    1.104870e+05
Offender Profiles      Wyoming          spike_dip              8  2004-10-27 02:53:45   2010-04-09 22:39:20    3.914750e+03
Offender Profiles      Wyoming          cont_spike_dip         2  2004-10-28 11:44:56   2007-06-13 05:59:56    4.380000e+02
Offender Profiles      Wyoming          osc_lag                4  2004-08-11 20:54:11   2015-01-20 15:58:13    9.540750e+03
Offender Profiles      Wyoming          NA                   185  2001-07-15 04:06:47   2015-02-14 03:59:09    2.893459e+03
Offender Profiles      Puerto Rico      spike_dip              1  2010-09-02 17:21:59   2010-09-02 17:21:59    0.000000e+00
Offender Profiles      Puerto Rico      NA                     7  2010-01-05 05:42:22   2010-08-19 01:20:22    0.000000e+00
Offender Profiles      DC/FBI Lab       spike_dip              4  2002-08-27 02:41:23   2025-02-01 15:23:05    4.706662e+05
Offender Profiles      DC/FBI Lab       cont_spike_dip         1  2025-01-16 20:53:11   2025-01-16 20:53:11    9.408700e+05
Offender Profiles      DC/FBI Lab       osc_lag                1  2008-09-16 15:17:24   2008-09-16 15:17:24    6.125000e+04
Offender Profiles      DC/FBI Lab       NA                    49  2001-07-15 04:08:23   2025-01-17 08:45:27    2.194880e+04
Offender Profiles      U.S. Army        spike_dip              4  2001-12-29 01:16:26   2007-12-12 09:05:10    1.428750e+03
Offender Profiles      U.S. Army        NA                     9  2001-08-22 11:58:53   2001-11-05 16:10:38    0.000000e+00
Forensic Profiles      Alabama          spike_dip              5  2002-10-03 07:38:04   2005-01-23 03:43:06    7.540000e+01
Forensic Profiles      Alabama          osc_lag                1  2008-09-16 15:14:50   2008-09-16 15:14:50    3.564000e+03
Forensic Profiles      Alabama          NA                    66  2001-07-15 04:15:59   2009-08-25 08:29:04    3.408788e+02
Forensic Profiles      Alaska           spike_dip              1  2003-02-02 14:56:19   2003-02-02 14:56:19    7.500000e+01
Forensic Profiles      Alaska           osc_lag                1  2009-03-11 03:19:28   2009-03-11 03:19:28    7.040000e+02
Forensic Profiles      Alaska           NA                     3  2009-01-14 20:15:24   2009-03-11 16:25:59    7.040000e+02
Forensic Profiles      Arizona          osc_lag                1  2008-09-16 15:04:19   2008-09-16 15:04:19    6.783000e+03
Forensic Profiles      Arizona          NA                     5  2008-07-09 09:37:20   2009-08-25 08:25:42    6.783000e+03
Forensic Profiles      Arkansas         osc_lag                1  2008-09-16 15:08:24   2008-09-16 15:08:24    2.019000e+03
Forensic Profiles      Arkansas         NA                     2  2008-07-09 09:33:18   2009-08-25 08:03:42    2.019000e+03
Forensic Profiles      California       spike_dip              1  2003-02-02 15:06:53   2003-02-02 15:06:53    1.029000e+03
Forensic Profiles      California       osc_lag                3  2008-09-15 03:04:04   2009-05-12 12:40:30    1.637300e+04
Forensic Profiles      California       NA                    16  2008-07-09 09:34:44   2010-04-09 22:37:24    1.637300e+04
Forensic Profiles      Colorado         osc_lag                3  2008-09-16 15:04:55   2014-02-09 00:23:04    5.236667e+03
Forensic Profiles      Colorado         NA                    15  2008-07-09 09:34:09   2014-02-09 23:32:30    3.743333e+03
Forensic Profiles      Connecticut      osc_lag                1  2008-09-16 15:06:28   2008-09-16 15:06:28    2.097000e+03
Forensic Profiles      Connecticut      NA                     2  2008-07-09 09:32:17   2009-08-25 08:11:56    2.097000e+03
Forensic Profiles      Delaware         spike_dip              4  2002-06-06 18:27:45   2006-08-03 02:40:43    4.867500e+02
Forensic Profiles      Delaware         cont_spike_dip         1  2006-08-09 15:31:06   2006-08-09 15:31:06    1.700000e+02
Forensic Profiles      Delaware         osc_lag                2  2008-09-16 15:15:22   2009-03-11 03:19:28    2.600000e+02
Forensic Profiles      Delaware         NA                    31  2001-07-15 04:09:05   2009-08-25 08:15:26    2.521935e+02
Forensic Profiles      Florida          osc_lag                1  2008-09-16 15:21:04   2008-09-16 15:21:04    1.871500e+04
Forensic Profiles      Florida          NA                     4  2008-07-09 09:38:11   2009-08-25 08:21:28    1.871500e+04
Forensic Profiles      Georgia          osc_lag                1  2008-09-16 15:17:05   2008-09-16 15:17:05    6.418000e+03
Forensic Profiles      Georgia          NA                     2  2008-07-09 09:37:07   2009-08-25 08:06:05    6.418000e+03
Forensic Profiles      Hawaii           spike_dip              1  2003-09-19 04:30:50   2003-09-19 04:30:50    0.000000e+00
Forensic Profiles      Hawaii           osc_lag                1  2008-09-16 15:14:13   2008-09-16 15:14:13    1.340000e+02
Forensic Profiles      Hawaii           NA                    19  2001-07-15 04:14:32   2009-08-25 08:10:39    1.410526e+01
Forensic Profiles      Idaho            spike_dip              1  2001-12-29 00:12:07   2001-12-29 00:12:07    0.000000e+00
Forensic Profiles      Idaho            osc_lag                1  2008-09-16 15:13:22   2008-09-16 15:13:22    1.860000e+02
Forensic Profiles      Idaho            NA                    12  2001-07-15 04:19:55   2009-08-25 07:53:31    4.650000e+01
Forensic Profiles      Illinois         osc_lag                1  2008-09-16 15:08:48   2008-09-16 15:08:48    1.539700e+04
Forensic Profiles      Illinois         NA                     4  2008-07-09 09:37:29   2009-08-25 08:12:58    1.539700e+04
Forensic Profiles      Indiana          spike_dip              5  2001-07-15 04:25:09   2007-06-08 23:40:56    3.276600e+03
Forensic Profiles      Indiana          cont_spike_dip         1  2007-06-13 05:58:14   2007-06-13 05:58:14    2.190000e+03
Forensic Profiles      Indiana          osc_lag                1  2008-09-16 15:16:23   2008-09-16 15:16:23    3.409000e+03
Forensic Profiles      Indiana          NA                     3  2007-04-05 08:51:55   2009-08-25 08:32:20    2.916000e+03
Forensic Profiles      Iowa             spike_dip              1  2002-06-06 18:33:59   2002-06-06 18:33:59    0.000000e+00
Forensic Profiles      Iowa             osc_lag                1  2008-09-16 15:19:51   2008-09-16 15:19:51    2.083000e+03
Forensic Profiles      Iowa             NA                     6  2001-09-13 04:17:46   2009-08-25 08:17:59    6.943333e+02
Forensic Profiles      Kansas           spike_dip              1  2002-10-17 18:01:57   2002-10-17 18:01:57    3.100000e+02
Forensic Profiles      Kansas           osc_lag                2  2008-09-16 15:03:27   2009-05-12 12:39:54    2.524000e+03
Forensic Profiles      Kansas           NA                    14  2008-07-09 09:33:49   2010-04-09 22:42:22    2.524000e+03
Forensic Profiles      Kentucky         osc_lag                1  2008-09-16 15:08:05   2008-09-16 15:08:05    2.398000e+03
Forensic Profiles      Kentucky         NA                     2  2008-07-09 09:33:43   2009-08-25 07:59:34    2.398000e+03
Forensic Profiles      Louisiana        spike_dip              5  2002-04-13 05:47:33   2003-08-03 13:36:09    3.680000e+01
Forensic Profiles      Louisiana        osc_lag                1  2008-09-16 15:20:50   2008-09-16 15:20:50    3.923000e+03
Forensic Profiles      Louisiana        NA                     9  2001-07-15 04:08:16   2009-08-25 08:29:57    8.717778e+02
Forensic Profiles      Maine            spike_dip              3  2006-09-06 20:55:22   2006-10-04 09:02:34    2.398667e+03
Forensic Profiles      Maine            cont_spike_dip         1  2006-10-09 04:01:30   2006-10-09 04:01:30    1.130000e+03
Forensic Profiles      Maine            osc_lag                1  2008-09-16 15:18:03   2008-09-16 15:18:03    1.542000e+03
Forensic Profiles      Maine            NA                    17  2006-08-03 02:44:50   2009-08-25 08:00:50    1.841294e+03
Forensic Profiles      Maryland         spike_dip              1  2002-10-03 07:23:52   2002-10-03 07:23:52    3.420000e+02
Forensic Profiles      Maryland         osc_lag                1  2008-09-16 15:15:40   2008-09-16 15:15:40    4.351000e+03
Forensic Profiles      Maryland         NA                     3  2002-08-27 05:55:52   2009-08-25 07:52:59    3.014667e+03
Forensic Profiles      Massachusetts    osc_lag                1  2008-09-16 15:04:08   2008-09-16 15:04:08    3.173000e+03
Forensic Profiles      Massachusetts    NA                     2  2008-07-09 09:35:10   2009-08-25 08:28:33    3.173000e+03
Forensic Profiles      Michigan         spike_dip             15  2001-11-05 20:35:14   2008-05-23 18:17:46    2.414200e+03
Forensic Profiles      Michigan         osc_lag                1  2008-09-16 15:05:50   2008-09-16 15:05:50    7.125000e+03
Forensic Profiles      Michigan         NA                     7  2001-07-15 04:16:43   2009-08-25 08:16:17    4.050857e+03
Forensic Profiles      Minnesota        osc_lag                1  2008-09-16 15:04:36   2008-09-16 15:04:36    3.990000e+03
Forensic Profiles      Minnesota        NA                     3  2008-07-09 09:37:58   2009-08-25 08:33:31    3.990000e+03
Forensic Profiles      Mississippi      spike_dip              2  2005-01-23 03:41:58   2005-02-23 11:06:07    5.000000e-01
Forensic Profiles      Mississippi      osc_lag                1  2008-09-16 15:19:34   2008-09-16 15:19:34    2.010000e+02
Forensic Profiles      Mississippi      NA                    77  2001-07-15 04:24:22   2009-08-25 08:17:29    5.311688e+00
Forensic Profiles      Missouri         osc_lag                1  2008-09-16 15:21:16   2008-09-16 15:21:16    6.680000e+03
Forensic Profiles      Missouri         NA                     2  2008-07-09 09:38:04   2009-08-25 08:07:44    6.680000e+03
Forensic Profiles      Montana          spike_dip              2  2002-10-03 07:38:42   2003-02-13 00:59:12    9.000000e+00
Forensic Profiles      Montana          osc_lag                2  2008-09-16 15:15:06   2009-05-12 12:41:15    2.120000e+02
Forensic Profiles      Montana          NA                    25  2001-07-15 04:24:31   2010-04-09 22:38:43    1.187200e+02
Forensic Profiles      Nebraska         spike_dip              2  2001-09-14 21:16:46   2002-02-08 02:55:59    7.000000e+00
Forensic Profiles      Nebraska         osc_lag                2  2008-09-16 15:09:35   2009-05-12 12:39:49    4.380000e+02
Forensic Profiles      Nebraska         NA                    23  2001-07-15 04:09:23   2010-04-09 22:42:09    2.672174e+02
Forensic Profiles      Nevada           spike_dip              1  2002-02-08 03:35:36   2002-02-08 03:35:36    1.170000e+02
Forensic Profiles      Nevada           osc_lag                1  2008-09-16 15:06:46   2008-09-16 15:06:46    2.094000e+03
Forensic Profiles      Nevada           NA                     3  2001-12-29 00:11:15   2009-08-25 08:02:14    1.435000e+03
Forensic Profiles      New Hampshire    spike_dip              2  2003-02-13 01:19:05   2003-06-25 06:58:07    5.500000e+00
Forensic Profiles      New Hampshire    osc_lag                1  2008-09-16 15:07:27   2008-09-16 15:07:27    4.550000e+02
Forensic Profiles      New Hampshire    NA                    10  2001-07-15 04:12:24   2009-08-25 08:11:06    9.100000e+01
Forensic Profiles      New Jersey       osc_lag                5  2008-10-12 11:31:42   2010-07-12 18:11:34    5.669000e+03
Forensic Profiles      New Jersey       NA                    17  2008-07-09 09:34:57   2010-08-19 16:15:27    5.669000e+03
Forensic Profiles      New Mexico       osc_lag                2  2008-09-16 15:03:11   2009-05-12 12:39:43    1.540000e+03
Forensic Profiles      New Mexico       NA                    14  2008-07-09 09:36:05   2010-04-09 22:40:25    1.540000e+03
Forensic Profiles      New York         spike_dip              1  2001-12-29 00:31:51   2001-12-29 00:31:51    2.234000e+03
Forensic Profiles      New York         osc_lag                1  2008-09-16 15:13:11   2008-09-16 15:13:11    2.217300e+04
Forensic Profiles      New York         NA                     3  2008-07-09 09:34:31   2010-01-18 20:33:35    2.217300e+04
Forensic Profiles      North Carolina   osc_lag                1  2008-09-16 15:14:33   2008-09-16 15:14:33    3.427000e+03
Forensic Profiles      North Carolina   NA                     2  2008-07-09 09:32:04   2009-08-25 08:14:39    3.427000e+03
Forensic Profiles      North Dakota     spike_dip              1  2003-02-13 01:07:34   2003-02-13 01:07:34    0.000000e+00
Forensic Profiles      North Dakota     osc_lag                2  2008-09-16 15:12:02   2009-05-12 12:39:32    2.160000e+02
Forensic Profiles      North Dakota     NA                    23  2001-07-15 04:06:22   2010-04-09 22:40:43    1.314783e+02
Forensic Profiles      Ohio             spike_dip              1  2002-02-08 03:05:31   2002-02-08 03:05:31    4.810000e+02
Forensic Profiles      Ohio             osc_lag                1  2008-09-16 15:07:13   2008-09-16 15:07:13    1.399200e+04
Forensic Profiles      Ohio             NA                     3  2001-12-29 00:29:54   2009-08-25 08:26:43    9.488333e+03
Forensic Profiles      Oklahoma         spike_dip              9  2002-08-27 03:04:00   2008-09-20 14:18:46    5.199556e+03
Forensic Profiles      Oklahoma         cont_spike_dip         1  2008-10-12 11:50:00   2008-10-12 11:50:00    1.042000e+03
Forensic Profiles      Oklahoma         osc_lag                1  2009-05-12 12:41:21   2009-05-12 12:41:21    9.640000e+02
Forensic Profiles      Oklahoma         NA                    19  2001-07-15 04:21:52   2010-04-09 22:42:35    7.144211e+02
Forensic Profiles      Oregon           osc_lag                2  2008-09-16 15:10:51   2009-05-12 12:40:17    5.161000e+03
Forensic Profiles      Oregon           NA                    14  2008-07-09 09:35:17   2010-04-09 22:38:24    5.161000e+03
Forensic Profiles      Pennsylvania     osc_lag                1  2008-09-16 15:12:20   2008-09-16 15:12:20    5.483000e+03
Forensic Profiles      Pennsylvania     NA                     2  2008-07-09 09:35:23   2009-08-25 07:57:42    5.483000e+03
Forensic Profiles      Rhode Island     spike_dip              1  2003-12-23 15:09:03   2003-12-23 15:09:03    0.000000e+00
Forensic Profiles      Rhode Island     NA                    12  2001-09-13 00:18:10   2003-08-03 15:26:50    0.000000e+00
Forensic Profiles      South Carolina   spike_dip              2  2003-02-02 15:11:34   2003-06-25 02:19:18    2.385000e+02
Forensic Profiles      South Carolina   osc_lag                1  2008-09-16 15:07:48   2008-09-16 15:07:48    3.888000e+03
Forensic Profiles      South Carolina   NA                     3  2008-07-09 09:32:38   2010-10-07 21:01:44    3.888000e+03
Forensic Profiles      South Dakota     spike_dip              3  2002-10-17 18:37:05   2003-08-03 18:38:24    4.333333e+00
Forensic Profiles      South Dakota     osc_lag                1  2009-05-12 12:40:58   2009-05-12 12:40:58    1.980000e+02
Forensic Profiles      South Dakota     NA                    25  2001-07-15 04:13:01   2010-04-09 22:41:08    1.425600e+02
Forensic Profiles      Tennessee        spike_dip              2  2002-02-08 03:56:33   2002-04-13 04:34:36    1.165000e+02
Forensic Profiles      Tennessee        osc_lag                3  2008-09-16 15:13:59   2012-09-27 16:19:52    2.651333e+03
Forensic Profiles      Tennessee        NA                    14  2001-07-15 04:11:23   2012-10-18 12:40:13    2.028357e+03
Forensic Profiles      Texas            osc_lag                1  2008-09-16 15:06:04   2008-09-16 15:06:04    1.767400e+04
Forensic Profiles      Texas            NA                     3  2008-07-09 09:33:12   2009-08-25 08:19:46    1.767400e+04
Forensic Profiles      Utah             osc_lag                1  2008-09-16 15:12:34   2008-09-16 15:12:34    3.270000e+02
Forensic Profiles      Utah             NA                     2  2008-07-09 09:36:18   2009-08-25 07:56:01    3.270000e+02
Forensic Profiles      Vermont          spike_dip              4  2001-07-15 04:19:37   2005-07-14 12:01:40    1.150000e+01
Forensic Profiles      Vermont          osc_lag                1  2008-09-16 15:18:25   2008-09-16 15:18:25    2.160000e+02
Forensic Profiles      Vermont          NA                    18  2001-08-22 05:47:08   2009-08-25 08:24:09    3.688889e+01
Forensic Profiles      Virginia         spike_dip              1  2002-10-03 07:25:03   2002-10-03 07:25:03    3.720000e+02
Forensic Profiles      Virginia         osc_lag                1  2008-09-16 15:05:19   2008-09-16 15:05:19    8.973000e+03
Forensic Profiles      Virginia         NA                     2  2008-07-09 09:36:40   2009-08-25 08:09:15    8.973000e+03
Forensic Profiles      Washington       spike_dip              2  2007-09-12 10:04:11   2008-04-09 09:01:02    6.000000e+02
Forensic Profiles      Washington       cont_spike_dip         1  2007-10-10 11:28:54   2007-10-10 11:28:54    5.990000e+02
Forensic Profiles      Washington       osc_lag                1  2008-09-16 15:18:43   2008-09-16 15:18:43    1.672000e+03
Forensic Profiles      Washington       NA                    12  2007-11-14 12:45:59   2009-08-25 08:24:54    9.747500e+02
Forensic Profiles      West Virginia    spike_dip             13  2003-02-13 00:34:21   2008-05-14 09:20:10    1.278462e+02
Forensic Profiles      West Virginia    osc_lag                1  2013-05-11 07:10:18   2013-05-11 07:10:18    7.380000e+02
Forensic Profiles      West Virginia    NA                    17  2001-07-15 04:05:59   2013-05-24 01:53:37    1.172941e+02
Forensic Profiles      Wisconsin        osc_lag                1  2008-09-16 15:13:41   2008-09-16 15:13:41    4.642000e+03
Forensic Profiles      Wisconsin        NA                     2  2008-07-09 09:35:30   2009-08-25 07:50:46    4.642000e+03
Forensic Profiles      Wyoming          spike_dip              2  2007-04-11 11:45:53   2010-04-09 22:39:20    4.250000e+01
Forensic Profiles      Wyoming          osc_lag                1  2009-05-12 12:39:39   2009-05-12 12:39:39    8.100000e+01
Forensic Profiles      Wyoming          NA                   199  2001-07-15 04:06:47   2009-08-25 07:58:54    1.057789e+01
Forensic Profiles      Puerto Rico      spike_dip              4  2010-09-02 17:21:59   2013-06-01 10:41:14    5.425000e+01
Forensic Profiles      Puerto Rico      cont_spike_dip         1  2011-10-22 09:39:59   2011-10-22 09:39:59    3.300000e+01
Forensic Profiles      Puerto Rico      osc_lag                1  2013-05-11 07:10:18   2013-05-11 07:10:18    4.100000e+01
Forensic Profiles      Puerto Rico      NA                    13  2010-01-05 05:42:22   2013-05-24 01:53:37    2.492308e+01
Forensic Profiles      DC/FBI Lab       osc_lag                1  2008-09-16 15:17:24   2008-09-16 15:17:24    1.649000e+03
Forensic Profiles      DC/FBI Lab       NA                     2  2008-07-09 09:37:38   2009-08-25 08:04:28    1.649000e+03
Forensic Profiles      DC/Metro PD      osc_lag                1  2021-08-08 19:06:00   2021-08-08 19:06:00    3.386000e+03
Forensic Profiles      DC/Metro PD      NA                     1  2021-08-09 03:58:09   2021-08-09 03:58:09    3.386000e+03
Forensic Profiles      U.S. Army        spike_dip              1  2007-12-12 09:05:10   2007-12-12 09:05:10    9.630000e+02
Forensic Profiles      U.S. Army        osc_lag                2  2004-08-11 20:53:04   2004-10-27 02:52:13    9.275000e+02
Forensic Profiles      U.S. Army        NA                    25  2004-08-12 02:11:13   2004-12-29 03:27:20    9.272800e+02
Arrestee Profiles      California       zero_error             6  2015-02-09 22:49:06   2015-03-21 23:28:48    0.000000e+00
Investigations Aided   California       spike_dip              1  2025-01-05 16:40:14   2025-01-05 16:40:14    1.304657e+06
Investigations Aided   California       cont_spike_dip         1  2025-01-16 20:53:11   2025-01-16 20:53:11    1.304657e+06
Investigations Aided   Delaware         spike_dip              1  2004-04-16 01:38:23   2004-04-16 01:38:23    1.000000e+00
Investigations Aided   Delaware         cont_spike_dip         1  2004-06-05 15:11:55   2004-06-05 15:11:55    1.000000e+00
Investigations Aided   Delaware         NA                    73  2004-06-27 06:22:41   2005-05-21 05:05:36    1.000000e+00
Investigations Aided   Hawaii           spike_dip              1  2004-04-16 01:55:34   2004-04-16 01:55:34    1.000000e+00
Investigations Aided   Hawaii           cont_spike_dip         1  2004-06-05 15:29:06   2004-06-05 15:29:06    1.000000e+00
Investigations Aided   Hawaii           NA                    73  2004-06-27 06:29:36   2005-04-27 06:34:58    1.000000e+00
Investigations Aided   Idaho            spike_dip              1  2003-06-25 05:09:04   2003-06-25 05:09:04    4.000000e+00
Investigations Aided   Idaho            cont_spike_dip         1  2003-08-03 12:53:50   2003-08-03 12:53:50    4.000000e+00
Investigations Aided   Idaho            NA                   137  2003-12-23 15:10:52   2006-06-29 02:47:50    4.000000e+00
Investigations Aided   Indiana          spike_dip              1  2001-11-05 20:02:54   2001-11-05 20:02:54    5.700000e+01
Investigations Aided   Indiana          cont_spike_dip         1  2001-12-29 00:30:48   2001-12-29 00:30:48    6.200000e+01
Investigations Aided   Indiana          NA                     1  2002-02-08 01:48:48   2002-02-08 01:48:48    6.200000e+01
Investigations Aided   Iowa             spike_dip              1  2003-06-25 05:01:36   2003-06-25 05:01:36    2.000000e+00
Investigations Aided   Kansas           spike_dip              2  2003-12-23 15:11:03   2004-04-16 02:19:56    2.950000e+01
Investigations Aided   Kansas           cont_spike_dip         2  2004-02-05 20:09:39   2004-06-05 15:58:20    2.950000e+01
Investigations Aided   Kansas           NA                     9  2004-06-27 06:30:07   2004-08-05 06:45:40    5.400000e+01
Investigations Aided   Louisiana        spike_dip              1  2004-06-05 16:12:54   2004-06-05 16:12:54    3.250000e+02
Investigations Aided   Louisiana        cont_spike_dip         1  2004-06-27 06:30:39   2004-06-27 06:30:39    3.250000e+02
Investigations Aided   Louisiana        NA                    46  2004-06-30 09:15:59   2004-12-29 03:29:29    3.250000e+02
Investigations Aided   Massachusetts    spike_dip              2  2001-09-13 00:18:01   2002-06-06 18:12:43    5.000000e+01
Investigations Aided   Massachusetts    cont_spike_dip         2  2001-09-13 04:17:52   2002-08-27 02:24:08    5.000000e+01
Investigations Aided   Massachusetts    zero_error             1  2001-11-05 20:25:21   2001-11-05 20:25:21    0.000000e+00
Investigations Aided   Massachusetts    cont_zero_error        1  2001-11-09 20:39:14   2001-11-09 20:39:14    0.000000e+00
Investigations Aided   Massachusetts    NA                     4  2001-07-15 04:11:54   2002-02-08 02:10:24    7.000000e+00
Investigations Aided   Michigan         spike_dip              5  2008-01-03 18:55:32   2008-05-23 18:59:32    2.441800e+03
Investigations Aided   Michigan         cont_spike_dip         3  2008-01-09 18:34:28   2008-06-11 15:55:27    2.474333e+03
Investigations Aided   Michigan         NA                     1  2007-12-12 09:03:36   2007-12-12 09:03:36    2.369000e+03
Investigations Aided   Mississippi      spike_dip              1  2004-08-11 20:53:32   2004-08-11 20:53:32    1.000000e+00
Investigations Aided   Mississippi      cont_spike_dip         1  2004-08-12 02:11:37   2004-08-12 02:11:37    1.000000e+00
Investigations Aided   Mississippi      NA                    11  2004-08-12 07:54:48   2004-09-15 08:05:23    1.000000e+00
Investigations Aided   Montana          spike_dip              1  2003-02-13 00:59:12   2003-02-13 00:59:12    3.000000e+00
Investigations Aided   Nebraska         spike_dip              1  2002-10-17 18:17:12   2002-10-17 18:17:12    3.000000e+00
Investigations Aided   Nebraska         cont_spike_dip         1  2003-02-13 01:08:13   2003-02-13 01:08:13    4.000000e+00
Investigations Aided   New Hampshire    spike_dip              1  2003-06-25 06:58:07   2003-06-25 06:58:07    1.000000e+00
Investigations Aided   North Dakota     spike_dip              1  2004-06-05 17:15:24   2004-06-05 17:15:24    2.000000e+00
Investigations Aided   North Dakota     cont_spike_dip         1  2004-06-27 06:32:10   2004-06-27 06:32:10    2.000000e+00
Investigations Aided   North Dakota     NA                    71  2004-06-30 09:16:13   2005-04-27 06:36:12    2.000000e+00
Investigations Aided   Ohio             spike_dip              1  2006-07-05 15:59:59   2006-07-05 15:59:59    3.120950e+05
Investigations Aided   Ohio             cont_spike_dip         1  2006-07-13 13:28:58   2006-07-13 13:28:58    3.120950e+05
Investigations Aided   Ohio             NA                     2  2006-07-20 04:43:59   2006-07-27 08:26:47    3.120950e+05
Investigations Aided   Oklahoma         spike_dip              4  2002-10-17 18:29:16   2008-09-18 00:33:09    3.239750e+03
Investigations Aided   Oklahoma         cont_spike_dip         1  2003-02-02 14:56:52   2003-02-02 14:56:52    1.000000e+01
Investigations Aided   Oklahoma         NA                     1  2003-12-23 15:12:11   2003-12-23 15:12:11    1.000000e+01
Investigations Aided   Rhode Island     spike_dip              1  2005-05-27 04:35:27   2005-05-27 04:35:27    1.000000e+00
Investigations Aided   Rhode Island     cont_spike_dip         1  2005-06-02 05:03:11   2005-06-02 05:03:11    1.000000e+00
Investigations Aided   Rhode Island     NA                     2  2005-06-09 23:27:12   2005-06-18 02:45:49    1.000000e+00
Investigations Aided   South Carolina   spike_dip              1  2006-03-15 22:20:41   2006-03-15 22:20:41    3.750000e+02
Investigations Aided   South Carolina   cont_spike_dip         1  2006-03-23 00:42:15   2006-03-23 00:42:15    3.750000e+02
Investigations Aided   South Carolina   zero_error             1  2005-06-24 08:24:39   2005-06-24 08:24:39    0.000000e+00
Investigations Aided   South Carolina   cont_zero_error        1  2005-06-30 08:39:15   2005-06-30 08:39:15    0.000000e+00
Investigations Aided   South Carolina   NA                    35  2005-07-14 11:54:00   2006-03-10 12:19:46    0.000000e+00
Investigations Aided   South Dakota     spike_dip              1  2005-08-26 04:56:50   2005-08-26 04:56:50    2.000000e+00
Investigations Aided   South Dakota     cont_spike_dip         1  2005-09-16 10:28:00   2005-09-16 10:28:00    2.000000e+00
Investigations Aided   South Dakota     NA                    12  2005-09-23 21:59:14   2005-12-17 09:07:45    2.000000e+00
Investigations Aided   Vermont          spike_dip              2  2008-12-10 17:11:35   2009-09-03 17:21:06    4.900000e+01
Investigations Aided   Vermont          cont_spike_dip         2  2009-01-14 20:15:24   2009-09-09 15:58:21    4.900000e+01
Investigations Aided   Vermont          zero_error             1  2009-08-25 08:24:09   2009-08-25 08:24:09    0.000000e+00
Investigations Aided   Vermont          NA                   218  2001-07-15 04:19:37   2009-03-11 16:25:59    7.522936e-01
Investigations Aided   West Virginia    spike_dip              1  2002-10-17 18:52:08   2002-10-17 18:52:08    2.000000e+00
Investigations Aided   West Virginia    cont_spike_dip         1  2003-02-13 00:34:21   2003-02-13 00:34:21    2.000000e+00
Investigations Aided   West Virginia    NA                    57  2003-08-03 18:56:16   2005-01-23 03:41:51    2.000000e+00
Investigations Aided   Wisconsin        spike_dip              1  2010-03-17 12:00:23   2010-03-17 12:00:23    2.267000e+03
Investigations Aided   Wisconsin        cont_spike_dip         1  2010-04-09 22:36:56   2010-04-09 22:36:56    2.346000e+03
Investigations Aided   Wyoming          spike_dip              2  2003-02-13 00:31:36   2004-10-27 02:53:45    4.000000e+00
Investigations Aided   Wyoming          cont_spike_dip         2  2003-06-25 09:03:14   2004-10-28 11:44:56    4.000000e+00
Investigations Aided   Wyoming          zero_error             1  2004-08-11 20:54:11   2004-08-11 20:54:11    0.000000e+00
Investigations Aided   Wyoming          cont_zero_error        1  2004-08-12 02:13:10   2004-08-12 02:13:10    0.000000e+00
Investigations Aided   Wyoming          NA                   178  2001-07-15 04:06:47   2007-06-13 05:59:56    4.134831e+00
Investigations Aided   Puerto Rico      spike_dip              1  2011-01-11 09:38:35   2011-01-11 09:38:35    1.900000e+01
Investigations Aided   Puerto Rico      cont_spike_dip         1  2011-01-27 07:55:31   2011-01-27 07:55:31    2.000000e+01
NDIS Labs              Michigan         spike_dip              5  2008-01-03 18:43:43   2008-05-23 18:17:46    1.000000e+00
NDIS Labs              Michigan         cont_spike_dip         2  2008-03-12 06:43:20   2008-05-14 09:16:40    1.000000e+00
NDIS Labs              Oklahoma         spike_dip              3  2008-08-13 17:45:04   2008-09-18 00:33:09    1.100000e+01
Show anomaly detection and logging code
### Publication-ready Visualization

# Prepare data for plot
anomaly_plot_data <- anomaly_log %>%
  group_by(metric, jurisdiction) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(metric = factor(metric, levels = metric_order))

# Custom colors matching scheme
metric_colors <- c(
  "Offender Profiles"    = "#1f4e79",
  "Arrestee Profiles"    = "#2e75b6",
  "Forensic Profiles"    = "#5b9bd5",
  "Investigations Aided" = "#c00000",
  "NDIS Labs"            = "#7030a0"
)

# Create stacked bar plot
p_anomaly_distribution <- ggplot(anomaly_plot_data, 
                                aes(x = jurisdiction, y = count, fill = metric)) +
  geom_bar(stat = "identity", position = "stack", color = "black", 
           linewidth = 0.3, width = 1) +
  scale_fill_manual(name = "Metric", values = metric_colors) +
  scale_y_continuous(limits = c(0, NA), expand = c(0, 0)) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", linewidth = 0.4),
    axis.ticks = element_line(color = "black", linewidth = 0.4),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.key.width = unit(0.4, "cm"),
    legend.key.height = unit(0.4, "cm"),
    aspect.ratio = 0.65
  ) +
  labs(
    title = " ",
    x = "Jurisdiction",
    y = "Number of Anomalies"
  )

p_anomaly_distribution

Show anomaly detection and logging code
# Anomaly type distribution
anomaly_type_summary <- anomaly_log %>%
  group_by(anomaly_type) %>%
  summarise(
    count = n(),
    metrics_affected = n_distinct(metric),
    .groups = "drop"
  ) %>%
  arrange(anomaly_type)

print(knitr::kable(anomaly_type_summary, format = "simple",
                   caption = "Anomaly Count by Detection Rule"))


Table: Anomaly Count by Detection Rule

anomaly_type       count   metrics_affected
----------------  ------  -----------------
spike_dip            231                  4
cont_spike_dip        47                  4
zero_error            10                  2
cont_zero_error        3                  1
osc_lag              186                  2
NA                  2842                  3
Show anomaly detection and logging code
# Export anomaly log for archival and transparency
anomaly_export_path <- here("data", "ndis", "intermediate", "anomaly_log.csv")
write.csv(anomaly_log, anomaly_export_path, row.names = FALSE)

Compiled Data Growth

Validation Approach:

· Cross-validation across all metrics

· Yearly aggregation using maximum values

· Consistency checks between profile types

· Scale-appropriate visualization

Correction Methods:

· Unified compilation from cleaned sources

· Dual-axis visualization for different scales

· Analysis-ready dataset creation

· Comprehensive trend analysis

Show compiled data visualization and correction code
# Get yearly data
growth_data_yearly <- ndis_clean %>%
  mutate(
    offender_profiles = ifelse(is.na(offender_profiles), 0, offender_profiles),
    arrestee = ifelse(is.na(arrestee), 0, arrestee),
    forensic_profiles = ifelse(is.na(forensic_profiles), 0, forensic_profiles),
    investigations_aided = ifelse(is.na(investigations_aided), 0, investigations_aided)
  ) %>%
  # Get max per jurisdiction per year
  group_by(year, jurisdiction) %>%
  summarise(
    offender = max(offender_profiles, na.rm = TRUE),
    arrestee = max(arrestee, na.rm = TRUE),
    forensic = max(forensic_profiles, na.rm = TRUE),
    investigations = max(investigations_aided, na.rm = TRUE),
    ndis_labs = max(ndis_labs, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  # Sum across jurisdictions per year
  group_by(year) %>%
  summarise(
    jurisdictions = n(),
    offender_total = sum(offender, na.rm = TRUE),
    arrestee_total = sum(arrestee, na.rm = TRUE),
    forensic_total = sum(forensic, na.rm = TRUE),
    investigations_total = sum(investigations, na.rm = TRUE),
    ndis_labs_total = sum(ndis_labs, na.rm = TRUE),
    total_profiles = offender_total + arrestee_total + forensic_total,
    .groups = 'drop'
  ) %>%
  arrange(year) %>%
  mutate(date = as.Date(paste0(year, "-01-01")))

# Calculate scale factor for dual y-axes
max_dna <- max(c(growth_data_yearly$offender_total, 
                 growth_data_yearly$arrestee_total, 
                 growth_data_yearly$forensic_total), na.rm = TRUE)
max_investigations <- max(growth_data_yearly$investigations_total, na.rm = TRUE)
scale_factor <- max_dna / max_investigations

# Prepare data for plotting
dna_data <- growth_data_yearly %>%
  select(date, offender_total, arrestee_total, forensic_total) %>%
  pivot_longer(
    cols = c(offender_total, arrestee_total, forensic_total),
    names_to = "variable",
    values_to = "count"
  ) %>%
  mutate(
    variable = case_when(
      variable == "offender_total" ~ "Offender",
      variable == "arrestee_total" ~ "Arrestee", 
      variable == "forensic_total" ~ "Forensic"
    ),
    count_scaled = count 
  )

investigations_data <- growth_data_yearly %>%
  select(date, investigations_total) %>%
  mutate(
    variable = "Investigations",
    count_scaled = investigations_total * scale_factor
  )

#### Interactive Plot ####

p_interactive <- plot_ly() %>%
  # DNA Profiles
  add_trace(
    data = dna_data %>% filter(variable == "Offender"),
    x = ~date, y = ~count,
    type = 'scatter', mode = 'lines+markers',
    name = 'Offender Profiles',
    line = list(color = '#0072B2', width = 2),
    marker = list(color = '#0072B2', size = 6),
    yaxis = 'y1'
  ) %>%
  add_trace(
    data = dna_data %>% filter(variable == "Arrestee"),
    x = ~date, y = ~count,
    type = 'scatter', mode = 'lines+markers',
    name = 'Arrestee Profiles',
    line = list(color = '#D55E00', width = 2),
    marker = list(color = '#D55E00', size = 6),
    yaxis = 'y1'
  ) %>%
  add_trace(
    data = dna_data %>% filter(variable == "Forensic"),
    x = ~date, y = ~count,
    type = 'scatter', mode = 'lines+markers',
    name = 'Forensic Profiles',
    line = list(color = '#009E73', width = 2),
    marker = list(color = '#009E73', size = 6),
    yaxis = 'y1'
  ) %>%
  # Investigations Aided
  add_trace(
    data = investigations_data,
    x = ~date, y = ~investigations_total,
    type = 'scatter', mode = 'lines+markers',
    name = 'Investigations Aided',
    line = list(color = '#CC79A7', width = 2),
    marker = list(color = '#CC79A7', size = 6),
    yaxis = 'y2'
  ) %>%
  layout(
    title = "DNA Profiles and Investigations Aided Over Time (Yearly)",
    xaxis = list(
      title = "Year",
      tickformat = "%Y"
    ),
    yaxis = list(
      title = "DNA Profiles",
      side = 'left',
      showgrid = TRUE,
      zeroline = TRUE,
      automargin = TRUE
    ),
    yaxis2 = list(
      title = "Investigations Aided",
      side = 'right',
      overlaying = 'y',
      anchor = 'x',
      position = 1,
      showgrid = FALSE,
      zeroline = FALSE,
      automargin = TRUE,
      titlefont = list(size = 12),
      tickfont = list(size = 10)
    ),
    legend = list(
      x = 0.01,
      y = 0.99,
      bgcolor = 'rgba(255,255,255,0.9)',
      bordercolor = 'black',
      borderwidth = 1
    ),
    hovermode = 'x unified'
  )

p_interactive
Show compiled data visualization and correction code
#### Publication-Ready Static Plot ####

# Get the actual date range for proper x-axis limits
date_range <- range(growth_data_yearly$date)
extended_date_range <- c(min(date_range) - years(1), max(date_range))
legend_start_date <- extended_date_range[1]

y_upper_limit <- max_dna * 1.05
y_lower_limit <- 0

p_static <- ggplot() +
  geom_line(data = dna_data, 
            aes(x = date, y = count_scaled, color = variable), 
            linewidth = 1.2) +
  geom_point(data = dna_data, 
             aes(x = date, y = count_scaled, color = variable), 
             size = 2) +
  geom_line(data = investigations_data, 
            aes(x = date, y = count_scaled, color = variable), 
            linewidth = 1.2) +
  geom_point(data = investigations_data, 
             aes(x = date, y = count_scaled, color = variable), 
             size = 2) +
  scale_x_date(
    date_breaks = "1 years",
    date_labels = "%Y",
    limits = extended_date_range, 
    expand = expansion(mult = 0.02)
  ) +
  scale_y_continuous(
    name = "DNA Profiles",
    labels = function(x) {
      ifelse(x >= 1e6, paste0(x/1e6, "M"), 
             ifelse(x >= 1e3, paste0(x/1e3, "K"), x))
    },
    breaks = seq(0, max_dna, by = 2e6),
    limits = c(y_lower_limit, y_upper_limit),
    sec.axis = sec_axis(~./scale_factor, 
                        name = "Investigations Aided",
                        labels = function(x) {
                          ifelse(x >= 1e6, paste0(x/1e6, "M"), 
                                 ifelse(x >= 1e3, paste0(x/1e3, "K"), x))
                        },
                        breaks = seq(0, max_investigations, by = 100000))
  ) +
  scale_color_manual(
    name = NULL,
    values = c("Offender" = "#1f4e79", 
               "Arrestee" = "#2e75b6", 
               "Forensic" = "#5b9bd5",
               "Investigations" = "#c00000") 
  ) +
  theme_ndis(base_size = 12) +
  theme(
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", linewidth = 0.5),
    axis.ticks = element_line(color = "black", linewidth = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title.x = element_text(color = "black", margin = margin(t = 10)),
    axis.title.y.left = element_text(color = "#1f4e79", margin = margin(r = 10)),
    axis.title.y.right = element_text(color = "#c00000", margin = margin(l = 10)),
    legend.position = "none",
    plot.margin = margin(5, 10, 5, 10),
    aspect.ratio = 0.6
  ) +
  labs(
    x = "Year",
    title = " "
  ) +
  # DNA Profiles legend box
  annotate("rect", xmin = legend_start_date, 
           xmax = legend_start_date + years(6), 
           ymin = max_dna * 0.86, ymax = max_dna, 
           fill = "white", color = "black", alpha = 0.9, linewidth = 0.3) +
  # Investigations legend box
  annotate("rect", xmin = legend_start_date, 
           xmax = legend_start_date + years(7), 
           ymin = max_dna * 0.74, ymax = max_dna * 0.80, 
           fill = "white", color = "black", alpha = 0.9, linewidth = 0.3) +
  # DNA Profiles legend items
  annotate("point", 
           x = legend_start_date + years(0) + months(6), 
           y = c(max_dna * 0.97, max_dna * 0.93, max_dna * 0.89),
           color = c("#1f4e79", "#2e75b6", "#5b9bd5"), size = 1.5) +
  annotate("text", 
           x = legend_start_date + years(1), 
           y = c(max_dna * 0.97, max_dna * 0.93, max_dna * 0.89),
           label = c("Offender Profiles", "Arrestee Profiles", "Forensic Profiles"),
           hjust = 0, size = 6) +
  annotate("text", 
           x = legend_start_date + years(0), 
           y = max_dna * 1.01, 
           label = "DNA Profiles (Millions)", 
           fontface = "bold", hjust = 0, size = 6, vjust = 0) +
  # Investigations Aided legend
  annotate("point", 
           x = legend_start_date + years(0) + months(6), 
           y = max_dna * 0.77, 
           color = "#c00000", size = 1.5) +
  annotate("text", 
           x = legend_start_date + years(1), 
           y = max_dna * 0.77, 
           label = "Investigations Aided",
           hjust = 0, size = 6) +
  annotate("text", 
           x = legend_start_date + years(0), 
           y = max_dna * 0.81, 
           label = "Investigations (Thousands)", 
           fontface = "bold", hjust = 0, size = 6 , vjust = 0)

p_static

Temporal Coverage

The heat map visualizes the temporal coverage of NDIS data submissions across different jurisdictions over the years for the intermediate csv file (with outliers and reporting errors) and for the cleaned dataset. It highlights periods of active reporting and gaps in data submission.

Show heatmap code
# Prepare data for heatmap - CLEANED DATASET
temporal_coverage_clean <- ndis_clean %>%
  mutate(year = year(capture_datetime)) %>%
  count(jurisdiction, year) %>%
  complete(jurisdiction, year = 2001:2025, fill = list(n = 0)) %>%
  filter(!is.na(jurisdiction)) %>%
  mutate(jurisdiction = factor(jurisdiction, levels = rev(sort(unique(jurisdiction)))))

# Create the heatmap for cleaned data
heatmap_after_clean <- ggplot(temporal_coverage_clean, aes(x = year, y = jurisdiction, fill = n)) +
  geom_tile(color = "white", linewidth = 0.3) +
  scale_fill_viridis(
    name = "Snapshots\nper Year",
    option = "plasma",
    direction = -1,
    breaks = c(0, 3, 6, 10),
    labels = c("0", "3", "6", "10+")
  ) +
  scale_x_continuous(
    breaks = seq(2001, 2025, by = 1),
    expand = expansion(mult = 0.01)
  ) +
  labs(
    x = "Year",
    y = "Jurisdiction",
    title = " "
  ) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "right",
    legend.key.height = unit(0.6, "cm"), 
    legend.key.width = unit(0.2, "cm")
  )

heatmap_after_clean

Comparison with peer-reviewed papers

As an additional check, we compared corrected national aggregates against published NDIS totals from FBI press releases and peer-reviewed articles. As shown in Figure 6, the reconstructed dataset aligns closely with these independent milestones, supporting the technical quality of the NDIS time series.

Show peer-reviewed literature comparison
# Preparation of growth_data_yearly 
growth_data_yearly <- ndis_clean %>%
  mutate(year = year(capture_datetime)) %>%
  group_by(jurisdiction, year) %>%
  arrange(jurisdiction, capture_datetime) %>%
  mutate(
    selection_priority = case_when(
      year <= 2018 ~ arrestee,
      year > 2018 ~ offender_profiles
    )
  ) %>%
  slice_max(order_by = selection_priority, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  group_by(year) %>%
  summarise(
    offender_total = sum(offender_profiles, na.rm = TRUE),
    arrestee_total = sum(arrestee, na.rm = TRUE),
    forensic_total = sum(forensic_profiles, na.rm = TRUE),
    investigations_total = sum(investigations_aided, na.rm = TRUE),
    n_jurisdictions = n(),
    .groups = 'drop'
  ) %>%
  mutate(
    total_profiles = offender_total + arrestee_total + forensic_total,
    date = as.Date(paste0(year, "-06-01"))
  )

# Prepare data for plotting DNA profiles
dna_data <- growth_data_yearly %>%
  select(date, offender_total, arrestee_total, forensic_total, total_profiles) %>%
  pivot_longer(
    cols = c(offender_total, arrestee_total, forensic_total, total_profiles),
    names_to = "variable",
    values_to = "count"
  ) %>%
  mutate(
    variable = case_when(
      variable == "offender_total" ~ "Offender",
      variable == "arrestee_total" ~ "Arrestee", 
      variable == "forensic_total" ~ "Forensic",
      variable == "total_profiles" ~ "Total"
    )
  )

# Prepare data for plotting investigations
investigations_data <- growth_data_yearly %>%
  select(date, investigations_total)

# Create literature dataset
literature_data <- tribble(
  ~citation, ~asof_date, ~offender_profiles, ~arrestee_profiles, ~forensic_profiles, ~total_profiles, ~investigations_aided, ~short_label,
  "FBI Brochure", "2000-12-01", 441181, NA, 21625, NA, 1573, "FBI (Dec 2000)",
  "FBI Brochure", "2002-12-01", 1247163, NA, 46177, NA, 6670, "FBI (Dec 2002)",
  "FBI Brochure", "2004-12-01", 2038514, NA, 93956, NA, 21266, "FBI (Dec 2004)",
  "FBI Brochure", "2006-12-01", 3977435, 54313, 160582, NA, 45364, "FBI (Dec 2006)",
  "FBI Brochure", "2008-12-01", 6399200, 140719, 248943, NA, 81955, "FBI (Dec 2008)",
  "FBI Brochure", "2010-12-01", 8564705, 668849, 351951, NA, 130317, "FBI (Dec 2010)",
  "FBI Brochure", "2012-12-01", 10086404, 1332721, 446689, NA, 190560, "FBI (Dec 2012)",
  "FBI Brochure", "2015-06-01", 11822927, 2028734, 638162, NA, 274648, "FBI (Jun 2015)",
  "Ge et al., 2012", "2011-06-01", NA, NA, NA, 10000000, 141300, "Ge et al., 2012",
  "Ge et al., 2014", "2013-05-01", NA, NA, NA, 12000000, 185000, "Ge et al., 2014",
  "Wickenheiser, 2022", "2021-10-01", 14836490, 4513955, 1144255, NA, 587773, "Wickenheiser, 2022",
  "Link et al., 2023", "2022-11-01", NA, NA, NA, 21791620, 622955, "Link et al., 2023",
  "Greenwald & Phiri, 2024", "2024-02-01", 17000000, 5000000, 1300000, NA, 680000, "Greenwald & Phiri, 2024"
) %>%
  mutate(
    asof_date = as.Date(asof_date),
    total_profiles = ifelse(
      is.na(total_profiles),
      rowSums(select(., offender_profiles, arrestee_profiles, forensic_profiles), na.rm = TRUE),
      total_profiles
    )
  )

# Prepare literature data for DNA profiles
literature_dna <- literature_data %>%
  select(short_label, asof_date, offender_profiles, arrestee_profiles, forensic_profiles, total_profiles) %>%
  pivot_longer(
    cols = c(offender_profiles, arrestee_profiles, forensic_profiles, total_profiles),
    names_to = "variable",
    values_to = "count"
  ) %>%
  filter(!is.na(count)) %>%
  mutate(
    variable = case_when(
      variable == "offender_profiles" ~ "Offender",
      variable == "arrestee_profiles" ~ "Arrestee",
      variable == "forensic_profiles" ~ "Forensic",
      variable == "total_profiles" ~ "Total"
    )
  )

# Prepare literature data for investigations
literature_investigations <- literature_data %>%
  select(short_label, asof_date, investigations_aided) %>%
  filter(!is.na(investigations_aided))

# Get date range
date_range <- range(growth_data_yearly$date)
extended_date_range <- c(min(date_range) - years(1), max(date_range))
legend_start_date <- extended_date_range[1]

# Calculate y-axis limits for DNA profiles
max_dna <- max(dna_data$count, na.rm = TRUE)
y_upper_dna <- max_dna * 1.05

# Calculate y-axis limits for investigations
max_inv <- max(investigations_data$investigations_total, na.rm = TRUE)
y_upper_inv <- max_inv * 1.05

# Define colors for each plot type
offender_color <- "#31688e"
arrestee_color <- "#35b779"
forensic_color <- "#440154"
total_color <- "#22a884"
investigations_color <- "#fde724"

# Create individual plots for each DNA profile type
p_offender <- ggplot() +
  geom_line(data = dna_data %>% filter(variable == "Offender"), 
            aes(x = date, y = count), 
            color = offender_color, linewidth = 0.5) +
  geom_point(data = dna_data %>% filter(variable == "Offender"), 
             aes(x = date, y = count), 
             color = offender_color, size = 1.0) +
  geom_point(data = literature_dna %>% filter(variable == "Offender"),
             aes(x = asof_date, y = count),
             shape = 4, size = 2.8, stroke = 1.0, color = offender_color) +
  geom_label_repel(
    data = literature_dna %>% filter(variable == "Offender" & lubridate::year(asof_date) < 2020),
    aes(x = asof_date, y = count, label = short_label),
    size = 3.5,
    nudge_y = 5000000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "both", force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  geom_label_repel(
    data = literature_dna %>% filter(variable == "Offender" & lubridate::year(asof_date) >= 2021),
    aes(x = asof_date, y = count, label = short_label),
    size = 3.5,
    nudge_y = 1000000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "both", force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  scale_x_date(name = "Year", date_breaks = "1 years", date_labels = "%Y",
               limits = extended_date_range, expand = expansion(mult = 0.02)) +
  scale_y_continuous(name = "Offender Profiles",
                     labels = function(x) ifelse(x >= 1e6, paste0(x/1e6, "M"), 
                                                  ifelse(x >= 1e3, paste0(x/1e3, "K"), x)),
                     limits = c(0, max(literature_dna %>% filter(variable == "Offender") %>% pull(count), na.rm = TRUE) * 1.1),
                     expand = expansion(mult = c(0, 0.05))) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid.major.y = element_line(color = "gray90", linewidth = 0.3),
    panel.grid.minor = element_blank(), 
    panel.grid.major.x = element_blank(),
    axis.line = element_line(color = "black", linewidth = 0.3),
    axis.ticks = element_line(color = "black", linewidth = 0.3),
    axis.text.x = element_text(angle = 45, hjust = 1, margin = margin(t = -5)),
    axis.text.y = element_text(margin = margin(r = -10)),
    axis.title.x = element_blank(),
    axis.title.y = element_text(margin = margin(r = 2)),
    plot.title = element_blank(),
    plot.title.position = "plot"
  ) +
  labs(x = "Year", y = NULL, title = "Offender Profiles")

p_offender

Show peer-reviewed literature comparison
p_arrestee <- ggplot() +
  geom_line(data = dna_data %>% filter(variable == "Arrestee"), 
            aes(x = date, y = count), 
            color = arrestee_color, linewidth = 0.5) +
  geom_point(data = dna_data %>% filter(variable == "Arrestee"), 
             aes(x = date, y = count), 
             color = arrestee_color, size = 1.0) +
  geom_point(data = literature_dna %>% filter(variable == "Arrestee"),
             aes(x = asof_date, y = count),
             shape = 4, size = 2.8, stroke = 1.0, color = arrestee_color) +
  geom_label_repel(
    data = literature_dna %>% filter(variable == "Arrestee"),
    aes(x = asof_date, y = count, label = short_label),
    size = 3.5,
    nudge_y = 500000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "both", force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  scale_x_date(name = "Year", date_breaks = "1 years", date_labels = "%Y",
               limits = extended_date_range, expand = expansion(mult = 0.02)) +
  scale_y_continuous(name = "Arrestee Profiles",
                     labels = function(x) ifelse(x >= 1e6, paste0(x/1e6, "M"), 
                                                  ifelse(x >= 1e3, paste0(x/1e3, "K"), x)),
                     limits = c(0, max(literature_dna %>% filter(variable == "Arrestee") %>% pull(count), na.rm = TRUE) * 1.1),
                     expand = expansion(mult = c(0, 0.05))) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid.major.y = element_line(color = "gray90", linewidth = 0.3),
    panel.grid.minor = element_blank(), 
    panel.grid.major.x = element_blank(),
    axis.line = element_line(color = "black", linewidth = 0.3),
    axis.ticks = element_line(color = "black", linewidth = 0.3),
    axis.text.x = element_text(angle = 45, hjust = 1, margin = margin(t = -5)),
    axis.text.y = element_text(margin = margin(r = -10)),
    axis.title.x = element_blank(),
    axis.title.y = element_text(margin = margin(r = 2)),
    plot.title = element_blank(),
    plot.title.position = "plot") +
  labs(x = "Year", y = NULL, title = "Arrestee Profiles")

p_arrestee

Show peer-reviewed literature comparison
p_forensic <- ggplot() +
  geom_line(data = dna_data %>% filter(variable == "Forensic"), 
            aes(x = date, y = count), 
            color = forensic_color, linewidth = 0.5) +
  geom_point(data = dna_data %>% filter(variable == "Forensic"), 
             aes(x = date, y = count), 
             color = forensic_color, size = 1.0) +
  geom_point(data = literature_dna %>% filter(variable == "Forensic"),
             aes(x = asof_date, y = count),
             shape = 4, size = 2.8, stroke = 1.0, color = forensic_color) +
  geom_label_repel(
    data = literature_dna %>% filter(variable == "Forensic" & lubridate::year(asof_date) < 2020),
    aes(x = asof_date, y = count, label = short_label),
    size = 3.5,
    nudge_y = 500000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "both", force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  geom_label_repel(
    data = literature_dna %>% filter(variable == "Forensic" & lubridate::year(asof_date) >= 2021),
    aes(x = asof_date, y = count, label = short_label),
    size = 3.5,
    nudge_y = 100000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "both", force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  scale_x_date(name = "Year", date_breaks = "1 years", date_labels = "%Y",
               limits = extended_date_range, expand = expansion(mult = 0.02)) +
  scale_y_continuous(name = "Forensic Profiles",
                     labels = function(x) ifelse(x >= 1e6, paste0(x/1e6, "M"), 
                                                  ifelse(x >= 1e3, paste0(x/1e3, "K"), x)),
                     limits = c(0, max(literature_dna %>% filter(variable == "Forensic") %>% pull(count), na.rm = TRUE) * 1.1),
                     expand = expansion(mult = c(0, 0.05))) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid.major.y = element_line(color = "gray90", linewidth = 0.3),
    panel.grid.minor = element_blank(), 
    panel.grid.major.x = element_blank(),
    axis.line = element_line(color = "black", linewidth = 0.3),
    axis.ticks = element_line(color = "black", linewidth = 0.3),
    axis.text.x = element_text(angle = 45, hjust = 1, margin = margin(t = -5)),
    axis.text.y = element_text(margin = margin(r = -10)),
    axis.title.x = element_blank(),
    axis.title.y = element_text(margin = margin(r = 2)),
    plot.title = element_blank(),
    plot.title.position = "plot") +
  labs(x = "Year", y = NULL, title = "Forensic Profiles")

p_forensic

Show peer-reviewed literature comparison
p_total <- ggplot() +
  geom_line(data = dna_data %>% filter(variable == "Total"), 
            aes(x = date, y = count), 
            color = total_color, linewidth = 0.5) +
  geom_point(data = dna_data %>% filter(variable == "Total"), 
             aes(x = date, y = count), 
             color = total_color, size = 1.0) +
  geom_point(data = literature_dna %>% filter(variable == "Total"),
             aes(x = asof_date, y = count),
             shape = 4, size = 2.8, stroke = 1.0, color = total_color) +
  geom_label_repel(
    data = literature_dna %>% filter(variable == "Total" & lubridate::year(asof_date) >= 2020),
    aes(x = asof_date, y = count, label = short_label),
    size = 3.5,
    nudge_y = -1000000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "y", force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  geom_label_repel(
    data = literature_dna %>% filter(variable == "Total" & lubridate::year(asof_date) < 2021),
    aes(x = asof_date, y = count, label = short_label),
    size = 3.5,
    nudge_y = 6000000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "both", force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  scale_x_date(name = "Year", date_breaks = "1 years", date_labels = "%Y",
               limits = extended_date_range, expand = expansion(mult = 0.02)) +
  scale_y_continuous(name = "Total Profiles",
                     labels = function(x) ifelse(x >= 1e6, paste0(x/1e6, "M"), 
                                                  ifelse(x >= 1e3, paste0(x/1e3, "K"), x)),
                     limits = c(0, max(literature_dna %>% filter(variable == "Total") %>% pull(count), na.rm = TRUE) * 1.1),
                     expand = expansion(mult = c(0, 0.05))) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid.major.y = element_line(color = "gray90", linewidth = 0.3),
    panel.grid.minor = element_blank(), 
    panel.grid.major.x = element_blank(),
    axis.line = element_line(color = "black", linewidth = 0.3),
    axis.ticks = element_line(color = "black", linewidth = 0.3),
    axis.text.x = element_text(angle = 45, hjust = 1, margin = margin(t = -5)),
    axis.text.y = element_text(margin = margin(r = -10)),
    axis.title.x = element_blank(),
    axis.title.y = element_text(margin = margin(r = 2)),
    plot.title = element_blank(),
    plot.title.position = "plot" ) +
  labs(x = "Year", y = NULL, title = "Total Profiles")

p_total

Show peer-reviewed literature comparison
p_investigations_grid <- ggplot() +
  geom_line(data = investigations_data, 
            aes(x = date, y = investigations_total), 
            color = investigations_color, linewidth = 0.5) +
  geom_point(data = investigations_data, 
             aes(x = date, y = investigations_total), 
             color = investigations_color, size = 1.0) +
  geom_point(data = literature_investigations,
             aes(x = asof_date, y = investigations_aided),
             shape = 4, size = 2.8, stroke = 1.0, color = investigations_color) +
  geom_label_repel(
    data = literature_investigations %>% filter(lubridate::year(asof_date) < 2021),
    aes(x = asof_date, y = investigations_aided, label = short_label),
    size = 3.5,
    nudge_y = 100000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "both", force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  geom_label_repel(
    data = literature_investigations %>% filter(lubridate::year(asof_date) > 2021),
    aes(x = asof_date, y = investigations_aided, label = short_label),
    size = 3.5,
    nudge_x = -600,
    nudge_y = 15000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "x",
    force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  geom_label_repel(
    data = literature_investigations %>% filter(lubridate::year(asof_date) == 2021),
    aes(x = asof_date, y = investigations_aided, label = short_label),
    size = 3.5,
    nudge_y = 10000,
    nudge_x = 1000,
    box.padding = 0.35, point.padding = 0.5,
    min.segment.length = 0.5, segment.color = "gray50",
    direction = "both",
    force = 5, force_pull = 1,
    max.overlaps = Inf,
    fill = "white", label.size = 0.2, label.padding = unit(0.15, "lines")
  ) +
  scale_x_date(name = "Year", date_breaks = "1 years", date_labels = "%Y",
               limits = extended_date_range, expand = expansion(mult = 0.02)) +
  scale_y_continuous(name = "Investigations Aided",
                     labels = function(x) ifelse(x >= 1e6, paste0(x/1e6, "M"), 
                                                  ifelse(x >= 1e3, paste0(x/1e3, "K"), x)),
                     limits = c(0, max(literature_investigations %>% pull(investigations_aided), na.rm = TRUE) * 1.1),
                     expand = expansion(mult = c(0, 0.05))) +
  theme_ndis(base_size = 11) +
  theme(
    panel.grid.major.y = element_line(color = "gray90", linewidth = 0.3),
    panel.grid.minor = element_blank(), 
    panel.grid.major.x = element_blank(),
    axis.line = element_line(color = "black", linewidth = 0.3),
    axis.ticks = element_line(color = "black", linewidth = 0.3),
    axis.text.x = element_text(angle = 45, margin = margin(t = -5)),
    axis.text.y = element_text(margin = margin(r = -10)),
    axis.title.x = element_text(margin = margin(t = 2)),
    axis.title.y = element_text(margin = margin(r = 2)),
    plot.title = element_blank(),
    plot.title.position = "plot") +
  labs(x = "Year", y = NULL, title = "Investigations Aided")

p_investigations_grid

Show peer-reviewed literature comparison
plots_grid <- (p_offender + p_arrestee) / 
              (p_forensic + p_total) / 
              (p_investigations_grid)

plots_grid <- plots_grid + 
  plot_layout(heights = c(1, 1, 1)) &
  theme(plot.title = element_text(face = "bold"),
        axis.title = element_text(face = "bold"))

Summary Statistics

Basic descriptive statistics to understand the scope and characteristics of the NDIS data.

Show summary statistics code
# Summary statistics table
ndis_summary <- ndis_clean %>%
  mutate(
    year = year(capture_datetime),
    offender_profiles = ifelse(is.na(offender_profiles), 0, offender_profiles),
    arrestee = ifelse(is.na(arrestee), 0, arrestee),
    forensic_profiles = ifelse(is.na(forensic_profiles), 0, forensic_profiles)
  ) %>%
  group_by(year, jurisdiction) %>%
  summarise(
    offender = max(offender_profiles, na.rm = TRUE),
    arrestee = max(arrestee, na.rm = TRUE),
    forensic = max(forensic_profiles, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  group_by(year) %>%
  summarise(
    jurisdictions = n(),
    offender = sum(offender, na.rm = TRUE),
    arrestee = sum(arrestee, na.rm = TRUE),
    forensic = sum(forensic, na.rm = TRUE),
    total_profiles = sum(offender + arrestee + forensic, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(year)

# Print summary table
kable(ndis_summary, caption = "Annual Summary Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Annual Summary Statistics
year jurisdictions offender arrestee forensic total_profiles
2001 29 602848 0 21414 624262
2002 34 922908 0 36070 958978
2003 41 1406708 0 65446 1472154
2004 45 1859371 0 91460 1950831
2005 48 2743367 0 123780 2867147
2006 49 3714152 0 153738 3867890
2007 51 5051763 0 195056 5246819
2008 50 6229082 0 240466 6469548
2009 51 7000317 0 275850 7276167
2010 51 8213788 0 331258 8545046
2011 52 9298626 0 397257 9695883
2012 52 9993151 1239280 449642 11682073
2013 52 10667112 1694305 524823 12886240
2014 52 11192422 2037068 587026 13816516
2015 52 11934281 2287369 654047 14875697
2016 52 12227209 2254961 687439 15169609
2017 54 13332249 2891857 811242 17035348
2018 54 13566718 3324282 895228 17786228
2019 54 14013946 3760236 979841 18754023
2020 54 14377412 4181569 1069155 19628136
2021 54 14836566 4513962 1144266 20494794
2022 54 14836490 4513955 1144255 20494700
2023 54 16532335 5190629 1282432 23005396
2024 54 17026171 5382544 1321790 23730505
2025 54 18648655 5954756 1421751 26025162

Data Visualization

Geospatial Mapping of Jurisdiction Participation

Show jurisdiction mapping analysis code
jurisdiction_coords <- tibble::tribble(
  ~jurisdiction_std,        ~lat,     ~lng,
  "Alabama",         32.8067,  -86.7911,
  "Alaska",         66.1605, -153.3691,
  "Arizona",        33.7298, -111.4312,
  "Arkansas",       34.9697,  -92.3731,
  "California",     36.1162, -119.6816,
  "Colorado",       39.0598, -105.3111,
  "Connecticut",    41.5978,  -72.7554,
  "Delaware",       39.3185,  -75.5071,
  "DC/FBI Lab",     38.9072,  -77.0369,
  "DC/Metro PD",    39.9072,  -77.0369,
  "Florida",        27.7663,  -81.6868,
  "Georgia",        33.0406,  -83.6431,
  "Hawaii",         21.3068, -157.7912,
  "Idaho",          44.2405, -114.4788,
  "Illinois",       40.3495,  -88.9861,
  "Indiana",        39.8494,  -86.2583,
  "Iowa",           42.0115,  -93.2105,
  "Kansas",         38.5266,  -96.7265,
  "Kentucky",       37.6681,  -84.6701,
  "Louisiana",      31.1695,  -91.8678,
  "Maine",          44.6939,  -69.3819,
  "Maryland",       39.0639,  -76.8021,
  "Massachusetts",  42.2302,  -71.5301,
  "Michigan",       43.3266,  -84.5361,
  "Minnesota",      45.6945,  -93.9002,
  "Mississippi",    32.7416,  -89.6787,
  "Missouri",       38.4561,  -92.2884,
  "Montana",        46.9219, -110.4544,
  "Nebraska",       41.1254,  -98.2681,
  "Nevada",         38.3135, -117.0554,
  "New Hampshire",  43.4525,  -71.5639,
  "New Jersey",     40.2989,  -74.5210,
  "New Mexico",     34.8405, -106.2485,
  "New York",       42.1657,  -74.9481,
  "North Carolina", 35.6301,  -79.8064,
  "North Dakota",   47.5289,  -99.7840,
  "Ohio",           40.3888,  -82.7649,
  "Oklahoma",       35.5653,  -96.9289,
  "Oregon",         44.5720, -122.0709,
  "Pennsylvania",   40.5908,  -77.2098,
  "Rhode Island",   41.6809,  -71.5118,
  "South Carolina", 33.8569,  -80.9450,
  "South Dakota",   44.2998,  -99.4388,
  "Tennessee",      35.7478,  -86.6923,
  "Texas",          31.0545,  -97.5635,
  "Utah",           40.1500, -111.8624,
  "Vermont",        44.0459,  -72.7107,
  "Virginia",       37.7693,  -78.1700,
  "Washington",     47.4009, -121.4905,
  "West Virginia",  38.4912,  -80.9545,
  "Wisconsin",      44.2685,  -89.6165,
  "Wyoming",        42.7560, -107.3025,
  "US Army",        33.7443,  -86.4733,
  "Puerto Rico",    18.2208,  -66.5901
)

state_abbs <- tibble::tibble(
  state = tolower(c(
    "Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut",
    "Delaware", "Florida","Georgia","Hawaii","Idaho","Illinois",
    "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts",
    "Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada",
    "New Hampshire","New Jersey","New Mexico","New York","North Carolina","North Dakota",
    "Ohio","Oklahoma","Oregon","Pennsylvania","Rhode Island","South Carolina",
    "South Dakota","Tennessee","Texas","Utah","Vermont","Virginia","Washington",
    "West Virginia","Wisconsin","Wyoming",
    "DC/Metro PD",
    "DC/FBI Lab",
    "Puerto Rico",
    "US Army"
  )),
  abb = c(
    "AL","AK","AZ","AR","CA","CO","CT",
    "DE","FL","GA","HI","ID","IL",
    "IN","IA","KS","KY","LA","ME","MD","MA",
    "MI","MN","MS","MO","MT","NE","NV",
    "NH","NJ","NM","NY","NC","ND",
    "OH","OK","OR","PA","RI","SC",
    "SD","TN","TX","UT","VT","VA","WA",
    "WV","WI","WY",
    "DC",
    "FBI",
    "PR",
    "US"
  )
)

map_data <- ndis_clean %>%
  left_join(jurisdiction_coords, by = c("jurisdiction" = "jurisdiction_std"))

set.seed(123) # for reproducibility

map_data <- map_data %>%
  group_by(lat, lng) %>%
  mutate(
    n = n(),
    offset_needed = n > 1,
    lat_offset = ifelse(offset_needed, runif(1, -0.5, 0.5), 0),
    lng_offset = ifelse(offset_needed, runif(1, -0.5, 0.5), 0),
    lat_adj = lat + lat_offset,
    lng_adj = lng + lng_offset
  ) %>%
  ungroup()

jurisdiction_summary <- map_data %>%
  group_by(jurisdiction) %>%
  filter(capture_datetime == max(capture_datetime, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(jurisdiction = tolower(trimws(jurisdiction))) %>% 
  group_by(jurisdiction) %>%
  summarise(
    capture_datetime = max(capture_datetime, na.rm = TRUE),
    offender = if (all(is.na(offender_profiles))) 0 else max(offender_profiles, na.rm = TRUE),
    arrestee = if (all(is.na(arrestee))) 0 else max(arrestee, na.rm = TRUE),
    forensic = if (all(is.na(forensic_profiles))) 0 else max(forensic_profiles, na.rm = TRUE),
    total_profiles = sum(c(offender, arrestee, forensic), na.rm = TRUE),
    lat_adj = first(lat_adj),
    lng_adj = first(lng_adj)
  ) %>%
  left_join(state_abbs, by = c("jurisdiction" = "state")) %>%
  filter(!is.na(lat_adj) & !is.na(lng_adj))

pal <- colorNumeric(palette = "Blues", domain = jurisdiction_summary$total_profiles)

leaflet() %>%
  addTiles() %>%
  addLabelOnlyMarkers(
    data = jurisdiction_summary,
    lng = ~lng_adj,
    lat = ~lat_adj,
    label = ~abb,
    labelOptions = labelOptions(
      noHide = TRUE,
      direction = "center",
      textOnly = FALSE,
      style = list(
        "background" = "white",
        "border" = "2px solid #1a5276",
        "border-radius" = "3px",
        "padding" = "2px 4px",
        "font-weight" = "bold",
        "font-size" = "10px",
        "color" = "#1a5276",
        "box-shadow" = "2px 2px 4px rgba(0,0,0,0.3)"
      )
    )
  ) %>%
  addCircleMarkers(
    data = jurisdiction_summary,
    lng = ~lng_adj,
    lat = ~lat_adj,
    stroke = TRUE,
    weight = 1,
    popup = ~paste0(
      "<div style='font-size:12px'>",
      "<b>", tools::toTitleCase(jurisdiction), " (", abb, ")</b><br>",
      "Date: ",  format(capture_datetime, "%Y-%m"), "<br>",
      "Total: ", format(total_profiles, big.mark = ","), "<br>",
      "Offender: ", format(offender, big.mark = ","), "<br>",
      "Arrestee: ", format(arrestee, big.mark = ","), "<br>",
      "Forensic: ", format(forensic, big.mark = ","),
      "</div>"
    )
  ) %>%
  addControl(
    html = "<div style='background:white;padding:5px;border:2px solid #1a5276;border-radius:3px;font-weight:bold;'>NDIS 2025 State Participation</div>",
    position = "topright"
  ) %>%
  setView(lng = -98.5833, lat = 39.8333, zoom = 4)

NDIS Time Series Dataset

The NDIS_time_series.csv dataset retains key temporal, jurisdictional, and operational metrics that can be used for further analysis and visualization.

Column Type Description
capture_datetime POSIXct Full timestamp of data capture, parsed from the raw timestamp field (YYYY-MM-DD HH:MM:SS).
asof_date Date Standardized date representing the reporting period (asof_year + asof_month, first day of month).
jurisdiction Character Name or code of the reporting jurisdiction (e.g., “California”, “Texas”).
offender_profiles Numeric Number of DNA profiles from known offenders in the jurisdiction.
arrestee Numeric Number of DNA profiles collected from arrestees.
forensic_profiles Numeric Number of DNA profiles developed from forensic (crime scene) samples.
total_profiles Numeric Sum of offender, arrestee, and forensic profiles for each record.
ndis_labs Integer Count of laboratories actively participating in NDIS for the given jurisdiction and month.
investigations_aided Numeric Number of investigations aided by NDIS matches in the reporting period.
Show interactive table code
summary_table <- ndis_clean %>%
  group_by(jurisdiction, capture_datetime) %>%
  arrange(jurisdiction, capture_datetime) %>%
  select(
    capture_datetime, asof_date, jurisdiction,
    offender_profiles, arrestee, forensic_profiles, total_profiles,
    ndis_labs, investigations_aided
  )

# Count the number of numeric columns (excluding the first 2 grouping columns)
numeric_cols_start <- 3
numeric_cols_end <- ncol(summary_table)  

# Interactive table
datatable(
  summary_table,
  extensions = c('Buttons', 'ColReorder', 'Scroller'),
  options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'colvis'),
    scrollX = TRUE,
    scrollY = "600px",
    scroller = TRUE,
    pageLength = 20,
    columnDefs = list(
      list(className = 'dt-right', targets = (numeric_cols_start-1):(numeric_cols_end-1))
    )
  ),
  rownames = FALSE,
  filter = 'top'
)

Export Cleaned Dataset

After cleaning and processing the NDIS data, the final dataset is exported as a CSV file for further analysis or sharing. The file is saved to data/ndis/final/ so it can be referenced by other analyses. Use analysis/version_freeze.qmd whenever you need a versioned snapshot.

Output: NDIS_time_series.csv

Column

Type

Rows

Missing

Unique

First_Values

capture_datetime

POSIXct, POSIXt

29512

0

9200

2003-06-26 00:35:13, 2004-02-05 18:34:48, 2005-01-26 15:39:56

asof_date

Date

29512

8827

123

<NA>, <NA>, <NA>

jurisdiction

factor

29512

0

54

Alabama, Alabama, Alabama

offender_profiles

numeric

29512

0

7357

2507, 9231, 31542

arrestee

numeric

29512

0

2494

0, 0, 0

forensic_profiles

numeric

29512

0

6218

22, 24, 676

total_profiles

numeric

29512

0

8017

2529, 9255, 32218

ndis_labs

numeric

29512

0

23

4, 4, 4

investigations_aided

numeric

29512

0

4274

299, 299, 434

Data frame dimensions: 29512 rows × 9 columns

Show dataset exportation code
ndis_clean <- ndis_clean %>%
  select(
    capture_datetime, asof_date, jurisdiction,
    offender_profiles, arrestee, forensic_profiles, total_profiles,
    ndis_labs, investigations_aided
  )

# Write latest final dataset
final_dir <- here("data", "ndis", "final")
dir.create(final_dir, recursive = TRUE, showWarnings = FALSE)
final_relative <- file.path("data", "ndis", "final", "NDIS_time_series.csv")
final_path <- here(final_relative)

write_csv(ndis_clean, final_path)
cat(paste("✓ Created final dataset at:", final_relative, "\n"))
✓ Created final dataset at: data/ndis/final/NDIS_time_series.csv 
Show dataset exportation code
cat("Use analysis/version_freeze.qmd to publish processed-data releases.\n")
Use analysis/version_freeze.qmd to publish processed-data releases.