FOIA Document OCR Processing

Processing Murphy & Tong FOIA Documents about State DNA Database Racial Composition

Author

Tina Lasisi | Edit: João P. Donadio

Published

November 9, 2025

1 Overview

This document details the processing of Freedom of Information Act (FOIA) responses from seven U.S. states regarding the demographic composition of their State DNA Index System (SDIS) databases. These responses were obtained by Professor Erin Murphy (NYU Law) in 2018 as part of research on racial disparities in DNA databases.

2 Materials and Methods

2.1 Data Sources

2.1.1 Raw FOIA Responses

The original FOIA responses are stored in two formats:

  • PDFs: raw/foia_pdfs/ - Original scanned documents
  • HTML: raw/foia_html/ - OCR’d versions for easier extraction
Show setup code
# List of required packages
required_packages <- c(
  "tidyverse",    # Data manipulation and visualization
  "here",         # File path management
  "knitr",        # Dynamic report generation
  "kableExtra",   # Enhanced table formatting
  "ggplot2",      # Data visualization
  "patchwork",    # Plot composition and layout
  "scales",       # Axis scaling and formatting
  "tidyr",        # Data tidying and reshaping
  "tibble",       # Modern data frames
  "flextable",    # Advanced table formatting
  "DT",           # Interactive tables
  "cowplot",      # Plotting composition
  "sf",           # Simple Features for spatial data
  "usmap"        # Mapping US states
)
  

# Function to install missing packages
install_missing <- function(packages) {
  for (pkg in packages) {
    if (!requireNamespace(pkg, quietly = TRUE)) {
      message(paste("Installing missing package:", pkg))
      install.packages(pkg, dependencies = TRUE)
    }
  }
}

# Install any missing packages
install_missing(required_packages)

# Load all packages
suppressPackageStartupMessages({
  library(tidyverse)
  library(here)
  library(knitr)
  library(kableExtra)
  library(ggplot2)
  library(patchwork)
  library(scales)
  library(tidyr)
  library(tibble)
  library(flextable)
  library(cowplot)
  library(sf)
  library(usmap)
})

# Verify all packages loaded successfully
loaded_packages <- sapply(required_packages, require, character.only = TRUE)
if (all(loaded_packages)) {
  message("All packages loaded successfully!")
} else {
  warning("The following packages failed to load: ", 
          paste(names(loaded_packages)[!loaded_packages], collapse = ", "))
}

# Display options
options(tibble.width = Inf)
options(dplyr.summarise.inform = FALSE)

# Path to per-state files (run notebook from analysis/)
base_dir <- here("..")
per_state <- here("data", "foia", "intermediate")

# ------------------------------------------------------------------
# 1. Discover available per-state CSV files
# ------------------------------------------------------------------
state_files <- list.files(per_state, pattern = "*_foia_data\\.csv$", full.names = TRUE)

if (length(state_files) == 0) {
  stop(paste("No per-state FOIA files found in", per_state, ". Check the folder path."))
}

stem_to_state <- function(stem) {
  toks <- str_split(stem, "_")[[1]]
  if ("foia" %in% toks) {
    toks <- toks[1:(which(toks == "foia") - 1)]
  }
  paste(tools::toTitleCase(toks), collapse = " ")
}

states_available <- map_chr(basename(state_files), ~ stem_to_state(str_remove(.x, "_foia_data\\.csv")))

cat(paste("✓ Found", length(state_files), "per-state files:\n"))
for (s in states_available) {
  cat(paste("  •", s, "\n"))
}

# ------------------------------------------------------------------
# 2. Initialize empty containers for the loop that follows
# ------------------------------------------------------------------
foia_combined <- tibble()
foia_state_metadata <- list()
✓ Found 7 per-state files:
  • California 
  • Florida 
  • Indiana 
  • Maine 
  • Nevada 
  • South Dakota 
  • Texas 

2.2 Processing workflow

For transparency, each state file is processed independently then merged into a single combined long‑format table (foia_combined):

  1. Load one file per state from data/foia/intermediate/.

  2. Append its rows to foia_combined. A parallel dataframe, foia_state_metadata, records what each state reported (counts, percentages, which categories) and any state-specific characteristics (e.g. Nevada’s “flags” terminology).

  3. Quality‑check each state:

    • verify that race and gender percentages sum to ≈ 100 % when provided,
    • confirm that demographic counts sum to the state’s reported total profiles,
    • calculate any missing counts or percentages and tag those rows value_source = "calculated".
  4. Save outputs

    • data/foia/final/foia_data_clean.csv — the fully combined tidy table with both reported and calculated values,
    • data/foia/intermediate/foia_state_metadata.csv — one row per state summarising coverage and caveats. After QC passes, freeze foia_data_clean.csv to data/foia/final/foia_data_clean.csv. Versioned releases are managed via analysis/version_freeze.qmd.

2.3 Helper Functions

The functions below perform each transformation required for harmonizing the state‑level FOIA tables.

2.3.1 Data Processing Helper Functions Reference

Function Definition Parameters
load_state() Loads and preprocesses state FOIA data files, handling numeric conversion and validation path: File path to state CSV
enhanced_glimpse() Provides an enhanced data overview with column types, missing values, unique counts, and unique values df: Input dataframe
fill_demographic_gaps() Fills missing gender counts and adds Unknown race category when totals permit calculation df: Input dataframe
add_combined() Creates Combined offender type by summing Convicted Offender and Arrestee counts when missing df: Input dataframe
add_percentages() Derives percentage values from counts for all demographic categories df: Input dataframe
counts_consistent() Verifies that demographic counts sum to total_profiles for each offender type df: Input dataframe
percentages_consistent() Verifies that percentages sum to 100 ± 0.5% for each category df: Input dataframe
report_status() Reports what data types (counts/percentages/both) are available for a category df: Input dataframe, category: race or gender
verify_category_totals() Compares demographic sums against reported totals and shows differences df: Input dataframe
verify_percentage_consistency() Compares reported vs calculated percentages for consistency df_combined: Combined dataframe, state_name: State name
calculate_combined_totals() Calculates Combined totals by summing across offender types df: Input dataframe, state_name: State name
calculate_percentages() Calculates percentages from counts for demographic categories df_combined: Combined dataframe, state_name: State name
calculate_counts_from_percentages() Calculates counts from percentages for demographic categories df_combined: Combined dataframe, state_name: State name
standardize_offender_types() Standardizes offender type names to consistent terminology df: Input dataframe
prepare_state_for_combined() Prepares state data for inclusion in combined dataset with proper columns df: Input dataframe, state_name: State name
format_compact() Formats large numbers with K/M suffixes for readability x: Numeric value
create_pie_chart() Creates pie charts for specific demographic categories data: Input data, offender_type, category, value_type, title, show_values
create_state_visualizations() Creates comprehensive pie chart visualizations for all metrics df_combined: Combined dataframe, state_name: State name
create_demographic_bar_charts() Creates side-by-side bar charts for gender and race distributions df_combined: Combined dataframe, state_name: State name
add_state_metadata() Creates and appends a metadata record capturing state data characteristics including available offender types, demographic categories, data formats, and special features df: Input dataframe, state_name: State name
update_state_metadata() Modifies existing state metadata to update QC results (count/percentage consistency) and append validation notes state_name: State name, counts_ok: Count consistency flag, percentages_ok: Percentage consistency flag, notes_text: Additional notes
Show helper functions setup
# Columns retained from every raw table
COLS_NEEDED <- c("state", "offender_type", "variable_category",
                 "variable_detailed", "value", "value_type")

# ------------------------------------------------------------------
# 1. Load and preprocess state files
# ------------------------------------------------------------------
load_state <- function(path) {
  "
  Read a *_foia_data.csv* file, enforce column order,
  and convert <1 to 0.5 so that trace counts are retained.
  Execution halts if non-numeric values remain.
  "
  df <- read_csv(path, show_col_types = FALSE)
  if (!"state" %in% colnames(df)) {
    df <- df %>%
      mutate(state = str_remove(basename(path), "_foia_data\\.csv") %>%
        str_replace_all("_", " ") %>%
        tools::toTitleCase())
  }
  df <- df %>% select(all_of(COLS_NEEDED))
  df$value_source <- "reported"

  df <- df %>%
    mutate(value = ifelse(value == "<1", 0.5, value),
           value = as.numeric(value))
  
  nonnumeric <- df %>% filter(is.na(value))
  if (nrow(nonnumeric) > 0) {
    cat(paste("**Non-numeric rows in", basename(path), "; please amend**\n"))
    print(nonnumeric)
    stop("Numeric coercion failure")
  }
  return(df)
}


# ------------------------------------------------------------------
# 2. Enhanced glimpse
# ------------------------------------------------------------------
# Display data types for each column with unique values
enhanced_glimpse <- function(df) {
  glimpse_data <- data.frame(
    Column = names(df),
    Type = sapply(df, function(x) paste(class(x), collapse = ", ")),
    Rows = nrow(df),
    Missing = sapply(df, function(x) sum(is.na(x))),
    Unique = sapply(df, function(x) length(unique(x))),
    Unique_Values = sapply(df, function(x) {
      unique_vals <- unique(x)
      if (length(unique_vals) > 10) {
        paste(encodeString(as.character(unique_vals[1:10])), collapse = ", ", "...")
      } else {
        paste(encodeString(as.character(unique_vals)), collapse = ", ")
      }
    })
  )
  
  ft <- flextable(glimpse_data) %>%
    theme_zebra() %>%
    set_caption(paste("Enhanced Data Glimpse:", deparse(substitute(df)))) %>%
    autofit() %>%
    align(align = "left", part = "all") %>%
    colformat_num(j = c("Rows", "Missing", "Unique"), big.mark = "") %>%
    bg(j = "Missing", bg = function(x) ifelse(x > 0, "#FFF3CD", "transparent")) %>%
    bg(j = "Unique", bg = function(x) ifelse(x == 1, "#FFF3CD", "transparent")) %>%
    add_footer_lines(paste("Data frame dimensions:", nrow(df), "rows ×", ncol(df), "columns")) %>%
    fontsize(size = 10, part = "all") %>%
    set_table_properties(layout = "autofit", width = 1)
  
  return(ft)
}

# ------------------------------------------------------------------
# 3. Fill missing Male counts and Unknown race counts
# ------------------------------------------------------------------
fill_demographic_gaps <- function(df) {
  "If exactly one gender or the Unknown race category is absent and
  totals permit a residual, calculate and insert the missing count.
  "
  inserts <- list()
  
  for (ot in unique(df$offender_type)) {
    tot <- df %>%
      filter(offender_type == ot,
             variable_category == "total",
             variable_detailed == "total_profiles",
             value_type == "count")
    
    if (nrow(tot) == 0) next
    
    total <- tot$value[1]
    
    # gender residual ----------------------------------------------
    g <- df %>%
      filter(offender_type == ot,
             variable_category == "gender",
             value_type == "count")
    
    missing_gender <- setdiff(c("Male", "Female"), unique(g$variable_detailed))
    if (nrow(g) == 1 && length(missing_gender) == 1) {
      inserts[[length(inserts) + 1]] <- tibble(
        state = df$state[1],
        offender_type = ot,
        variable_category = "gender",
        variable_detailed = missing_gender,
        value = total - sum(g$value),
        value_type = "count",
        value_source = "calculated"
      )
    }
    
    # race residual -----------------------------------------------
    r <- df %>%
      filter(offender_type == ot,
             variable_category == "race",
             value_type == "count")
    
    if (nrow(r) > 0 && !"Unknown" %in% r$variable_detailed) {
      gap <- total - sum(r$value)
      if (gap > 0) {
        inserts[[length(inserts) + 1]] <- tibble(
          state = df$state[1],
          offender_type = ot,
          variable_category = "race",
          variable_detailed = "Unknown",
          value = gap,
          value_type = "count",
          value_source = "calculated"
        )
      }
    }
  }
  
  if (length(inserts) > 0) {
    df <- bind_rows(df, bind_rows(inserts))
  }
  return(df)
}

# ------------------------------------------------------------------
# 4. Construct Combined offender type if absent (add_combined)
# ------------------------------------------------------------------
add_combined <- function(df) {
  "
  When a state reports Convicted Offender and Arrestee counts but
  omits Combined, create a Combined block by summing the two.
  "
  if ("Combined" %in% df$offender_type) return(df)
  
  required <- c("Convicted Offender", "Arrestee")
  if (!all(required %in% df$offender_type)) return(df)  # cannot construct
  
  summed <- df %>%
    filter(value_type == "count") %>%
    group_by(variable_category, variable_detailed, value_type) %>%
    summarise(value = sum(value), .groups = "drop") %>%
    mutate(state = df$state[1],
           offender_type = "Combined",
           value_source = "calculated")
  
  return(bind_rows(df, summed))
}

# ------------------------------------------------------------------
# 5. Derive percentages wherever only counts exist (add_percentages)
# ------------------------------------------------------------------
add_percentages <- function(df) {
  "
  Ensure that every gender and race row has both count and percentage
  values, derived from the offender-type total if necessary.
  "
  totals <- df %>%
    filter(variable_category == "total",
           variable_detailed == "total_profiles",
           value_type == "count") %>%
    select(offender_type, value) %>%
    deframe()
  
  need_pct <- df %>%
    filter(value_type == "count",
           variable_category != "total")
  
  new_pct_rows <- need_pct %>%
    rowwise() %>%
    mutate(has_percentage = nrow(df %>%
           filter(state == state,
                  offender_type == offender_type,
                  variable_category == variable_category,
                  variable_detailed == variable_detailed,
                  value_type == "percentage"))) %>%
    filter(has_percentage == 0) %>%
    mutate(value = round(value / totals[offender_type] * 100, 2),
           value_type = "percentage",
           value_source = "calculated") %>%
    select(-has_percentage)
  
  if (nrow(new_pct_rows) > 0) {
    df <- bind_rows(df, new_pct_rows)
  }
  return(df)
}

# ------------------------------------------------------------------
# 6. Counts consistency checks
# ------------------------------------------------------------------
counts_consistent <- function(df) {
  "
  Verifies that demographic counts sum to total_profiles for each
  offender type and category.
  "
  demo_sum <- df %>%
    filter(value_type == "count",
           variable_category != "total") %>%
    group_by(offender_type, variable_category) %>%
    summarise(sum_value = sum(value), .groups = "drop")
  
  totals <- df %>%
    filter(variable_category == "total",
           variable_detailed == "total_profiles",
           value_type == "count") %>%
    select(offender_type, value)
  
  merged <- demo_sum %>%
    left_join(totals, by = "offender_type") %>%
    mutate(diff = abs(sum_value - value))
  
  all(merged$diff < 1e-6)
}

# ------------------------------------------------------------------
# 7. Percentage consistency checks
# ------------------------------------------------------------------

percentages_consistent <- function(df) {
  "
  Verifies that derived or reported percentages sum to 100 ± 0.5 %.
  "
  result <- df %>%
    filter(value_type == "percentage") %>%
    group_by(offender_type, variable_category) %>%
    summarise(sum_value = sum(value), .groups = "drop") %>%
    mutate(consistent = abs(sum_value - 100) <= 0.5)
  
  all(result$consistent)
}


# ------------------------------------------------------------------
# 8. Report status for each category
# ------------------------------------------------------------------

# Define columns needed for foia_combined
report_status <- function(df, category) {
  values <- unique(df$value_type[df$variable_category == category])
  
  if (all(c("count", "percentage") %in% values)) {
    return("both")
  } else if ("count" %in% values) {
    return("counts")
  } else if ("percentage" %in% values) {
    return("percentages")
  } else {
    return("neither")
  }
}

# ------------------------------------------------------------------
# 9. Verify category totals
# ------------------------------------------------------------------

verify_category_totals <- function(df) {
  # 1 pull total_profiles per offender_type
  total_map <- df %>%
    filter(variable_category == "total", 
           variable_detailed == "total_profiles") %>%
    select(offender_type, value) %>%
    deframe() %>%
    as.list()
  
  # 2 sum counts by offender_type and variable_category
  demo_sum <- df %>%
    filter(value_type == "count",
           variable_category != "total") %>%
    group_by(offender_type, variable_category) %>%
    summarise(sum_counts = sum(value, na.rm = TRUE), .groups = "drop")
  
  # 3 attach total_profiles and compute difference
  demo_sum <- demo_sum %>%
    mutate(total_profiles = map_dbl(offender_type, ~total_map[[.x]]),
           difference = total_profiles - sum_counts)
  
  # tidy columns order
  demo_sum %>%
    select(offender_type, variable_category, total_profiles, 
           sum_counts, difference)
}

# ------------------------------------------------------------------
# 10. Calculate Combined totals
# ------------------------------------------------------------------

calculate_combined_totals <- function(df, state_name) {
  # Get all counts
  counts_df <- df %>%
    filter(value_type == 'count') %>%
    mutate(value_source = 'calculated')
  
  # Group by variable_category and variable_detailed, sum values
  combined_sums <- counts_df %>%
    group_by(variable_category, variable_detailed) %>%
    summarise(value = sum(value, na.rm = TRUE), .groups = "drop")
  
  # Create Combined rows
  combined_rows <- combined_sums %>%
    mutate(state = state_name,
           offender_type = 'Combined',
           value_type = 'count',
           value_source = 'calculated') %>%
    select(all_of(COLS_NEEDED), value_source)
  
  return(combined_rows)
}

# ------------------------------------------------------------------
# 11. Calculate percentages from counts
# ------------------------------------------------------------------

calculate_percentages <- function(df_combined, state_name) {
  # Get total profiles for each offender type
  totals_map <- df_combined %>%
    filter(state == state_name,
           variable_category == 'total',
           variable_detailed == 'total_profiles') %>%
    select(offender_type, value) %>%
    deframe() %>%
    as.list()
  
  percentage_rows <- list()
  
  for (offender_type in names(totals_map)) {
    total <- totals_map[[offender_type]]
    
    # Get all demographic counts
    demo_data <- df_combined %>%
      filter(state == state_name,
             offender_type == !!offender_type,
             variable_category %in% c('gender', 'race'),
             value_type == 'count')
    
    if (nrow(demo_data) > 0) {
      # Calculate percentage for each
      demo_percentages <- demo_data %>%
        mutate(value = round((value / total) * 100, 2),
               value_type = 'percentage',
               value_source = 'calculated') %>%
        select(all_of(COLS_NEEDED), value_source)
      
      percentage_rows <- c(percentage_rows, list(demo_percentages))
    }
  }
  
  bind_rows(percentage_rows)
}

# ------------------------------------------------------------------
# 12. Calculate counts from percentages
# ------------------------------------------------------------------

calculate_counts_from_percentages <- function(df_combined, state_name) {
  # Get total profiles for each offender type
  totals_map <- df_combined %>%
    filter(state == state_name,
           variable_category == 'total',
           variable_detailed == 'total_profiles') %>%
    select(offender_type, value) %>%
    deframe() %>%
    as.list()
  
  count_rows <- list()
  
  for (offender_type in names(totals_map)) {
    total <- totals_map[[offender_type]]
    
    # Get all demographic percentages
    demo_data <- df_combined %>%
      filter(state == state_name,
             offender_type == !!offender_type,
             variable_category %in% c('gender', 'race'),
             value_type == 'percentage')
    
    if (nrow(demo_data) > 0) {
      # Calculate count for each
      demo_counts <- demo_data %>%
        mutate(value = as.integer(round(total * (value / 100))),
               value_type = 'count',
               value_source = 'calculated') %>%
        select(all_of(COLS_NEEDED), value_source)
      
      count_rows <- c(count_rows, list(demo_counts))
    }
  }
  
  bind_rows(count_rows)
}

# ------------------------------------------------------------------
# 13. Standardize offender types
# ------------------------------------------------------------------

standardize_offender_types <- function(df) {
  replacements <- c(
    'Offenders' = 'Convicted Offender',
    'Convicted offenders' = 'Convicted Offender',
    'Arrested offender' = 'Arrestee',
    'All' = 'Combined'
  )
  
  df %>%
    mutate(offender_type = recode(offender_type, !!!replacements))
}

# ------------------------------------------------------------------
# 14. Prepare state data for combined dataset
# ------------------------------------------------------------------

prepare_state_for_combined <- function(df, state_name) {
  
  df_prepared <- df %>%
    select(any_of(COLS_NEEDED), value_source)
  
  df_prepared <- df_prepared %>%
      mutate(value_source = case_when(
        is.na(value_source) ~ "calculated",
        value_source == "" ~ "calculated",
        TRUE ~ value_source
      ))

  
  df_prepared
}

# ------------------------------------------------------------------
# 15. Compare reported vs calculated percentages
# ------------------------------------------------------------------

verify_percentage_consistency <- function(df_combined, state_name) {
  state_data <- df_combined %>%
    filter(state == state_name)
  
  # Get all offender types that have both counts and percentages
  offender_types <- unique(state_data$offender_type)
  
  consistency_results <- list()
  
  for (offender_type in offender_types) {
    offender_data <- state_data %>%
      filter(offender_type == !!offender_type)
    
    # Check if we have both reported and calculated percentages
    for (category in c('gender', 'race')) {
      reported_pcts <- offender_data %>%
        filter(variable_category == !!category,
               value_type == 'percentage',
               value_source == 'reported')
      
      calculated_pcts <- offender_data %>%
        filter(variable_category == !!category,
               value_type == 'percentage',
               value_source == 'calculated')
      
      if (nrow(reported_pcts) > 0 && nrow(calculated_pcts) > 0) {
        # Compare each demographic value
        for (i in 1:nrow(reported_pcts)) {
          rep_row <- reported_pcts[i, ]
          calc_match <- calculated_pcts %>%
            filter(variable_detailed == rep_row$variable_detailed)
          
          if (nrow(calc_match) > 0) {
            diff <- abs(rep_row$value - calc_match$value[1])
            consistency_results <- c(consistency_results, list(data.frame(
              offender_type = offender_type,
              category = category,
              variable = rep_row$variable_detailed,
              reported = rep_row$value,
              calculated = calc_match$value[1],
              difference = diff,
              consistent = diff < 0.5
            )))
          }
        }
      }
    }
  }
  
  if (length(consistency_results) > 0) {
    consistency_df <- bind_rows(consistency_results)
    cat(paste0("\nPercentage consistency check for ", state_name, ":\n"))
    cat(paste0("All values consistent: ", all(consistency_df$consistent), "\n"))
    
    if (!all(consistency_df$consistent)) {
      cat("\nInconsistent values:\n")
      print(consistency_df %>% filter(!consistent))
    }
    
    return(all(consistency_df$consistent))
  } else {
    # No comparison possible - state only has one type of data
    return(TRUE)
  }
}
# ------------------------------------------------------------------
# 16. Add compact formatting for large numbers
# ------------------------------------------------------------------

format_compact <- function(x) {
  sapply(x, function(single_x) {
    if (single_x >= 1000000) {
      if (single_x/1000000 == as.integer(single_x/1000000)) {
        return(paste0(as.integer(single_x/1000000), "M"))
      } else {
        return(paste0(round(single_x/1000000, 1), "M"))
      }
    } else if (single_x >= 1000) {
      return(paste0(as.integer(single_x/1000), "k"))
    } else {
      return(paste0(as.integer(single_x)))
    }
  })
}

# ------------------------------------------------------------------
# 17. Pie chart creation function
# ------------------------------------------------------------------

create_pie_chart <- function(data, offender_type, category, value_type, title, show_values = FALSE) {
  chart_data <- data %>%
    filter(offender_type == !!offender_type,
           variable_category == !!category,
           value_type == !!value_type)
  
  # Check if we have data after filtering
  if (nrow(chart_data) == 0) {
    plot.new()
    title(main = title, cex.main = 0.9)
    text(0.5, 0.5, "No data", cex = 0.8)
    return()
  }
  
  # AGGREGATE DATA TO REMOVE DUPLICATES - KEY FIX
  chart_data <- chart_data %>%
    group_by(variable_detailed) %>%
    summarise(value = sum(value, na.rm = TRUE)) %>%
    ungroup()
  
  # Ensure consistent categories
  if (category == 'gender') {
    all_genders <- data.frame(variable_detailed = c('Male', 'Female', 'Unknown'))
    chart_data <- chart_data %>%
      right_join(all_genders, by = "variable_detailed") %>%
      mutate(value = ifelse(is.na(value), 0, value)) %>%
      arrange(factor(variable_detailed, levels = c('Male', 'Female', 'Unknown')))
  } else if (category == 'race') {
    all_races <- data.frame(variable_detailed = c('White', 'Black', 'Hispanic', 
                                                 'Asian', 'Native American', 'Other', 'Unknown'))
    chart_data <- chart_data %>%
      right_join(all_races, by = "variable_detailed") %>%
      mutate(value = ifelse(is.na(value), 0, value)) %>%
      arrange(factor(variable_detailed, levels = c('White', 'Black', 'Hispanic', 
                                                  'Asian', 'Native American', 'Other', 'Unknown')))
  }
  
  # Filter out zero values and ensure we have data
  chart_data <- chart_data %>% filter(value > 0)
  
  if (nrow(chart_data) == 0) {
    plot.new()
    title(main = title, cex.main = 0.9)
    text(0.5, 0.5, "No data", cex = 0.8)
    return()
  }
  
  # Define colors
  if (category == 'gender') {
    colors <- c('Male' = '#4E79A7', 'Female' = '#E15759', 'Unknown' = '#BAB0AC')
  } else {
    colors <- c('White' = '#4E79A7', 'Black' = '#F25E2B', 'Hispanic' = '#E14759',
               'Asian' = '#76B7B2', 'Native American' = '#59A14F',
               'Other' = '#9C755F', 'Unknown' = '#BAB0AC')
  }
  
  # Filter colors to only include categories present in data
  pie_colors <- colors[names(colors) %in% chart_data$variable_detailed]
  
  # Calculate percentages
  total_value <- sum(chart_data$value)
  chart_data <- chart_data %>%
    mutate(pct = value / total_value * 100)
  
  # Create labels based on value_type and show_values
  if (show_values && value_type == 'count') {
    chart_data <- chart_data %>%
      mutate(base_label = paste0(variable_detailed, "\n(", format(value, big.mark = ","), ")"))
  } else if (value_type == 'percentage') {
    chart_data <- chart_data %>%
      mutate(base_label = paste0(variable_detailed, "\n(", round(value, 1), "%)"))
  } else {
    chart_data <- chart_data %>%
      mutate(base_label = variable_detailed)
  }
  
  # Only show labels for slices >= 3%, otherwise empty string
  chart_data <- chart_data %>%
    mutate(label = ifelse(pct >= 3, base_label, ""))
  
  # Create the pie chart
  pie(chart_data$value, 
      labels = chart_data$label,
      main = title,
      col = pie_colors,
      cex.main = 0.9,
      cex = 0.8)
  
  # Add legend for small slices
  small_slices <- chart_data %>% filter(pct < 3)
  if (nrow(small_slices) > 0) {
    legend_labels <- paste0(small_slices$variable_detailed, " (", round(small_slices$pct, 1), "%)")
    legend_colors <- pie_colors[small_slices$variable_detailed]
    
    legend("bottomright", 
           legend = legend_labels,
           fill = legend_colors,
           cex = 0.7,
           bty = "n")
  }
}

# ------------------------------------------------------------------
# 18. State visualizations with 2 pies per row
# ------------------------------------------------------------------

create_state_visualizations <- function(df_combined, state_name) {
  state_data <- df_combined %>% filter(state == state_name)
  
  offender_types <- sort(unique(state_data$offender_type))
  plots <- list()
  
  for (offender_type in offender_types) {
    plots <- c(plots, list(
      create_pie_chart(state_data, offender_type, 'gender', 'count',
                       paste(offender_type, "Gender Counts"), TRUE),
      create_pie_chart(state_data, offender_type, 'gender', 'percentage',
                       paste(offender_type, "Gender Percentages")),
      create_pie_chart(state_data, offender_type, 'race', 'count',
                       paste(offender_type, "Race Counts"), TRUE),
      create_pie_chart(state_data, offender_type, 'race', 'percentage',
                       paste(offender_type, "Race Percentages"))
    ))
  }
}

# ------------------------------------------------------------------
# 19. Demographic bar chart function
# ------------------------------------------------------------------

create_demographic_bar_charts <- function(df_combined, state_name) {
  state_data <- df_combined %>%
    filter(state == state_name)
  
  # Get offender types and ensure Combined is last
  offender_types <- state_data %>%
    filter(value_type == 'count') %>%
    pull(offender_type) %>%
    unique() %>%
    sort()
  
  if ('Combined' %in% offender_types) {
    offender_types <- c(setdiff(offender_types, 'Combined'), 'Combined')
  }
  
  # Color palettes
  gender_colors <- c('Male' = '#4E79A7', 'Female' = '#E15759', 'Unknown' = '#BAB0AC')
  race_colors <- c(
    'White' = '#4E79A7', 
    'Black' = '#F25E2B', 
    'Hispanic' = '#E14759',
    'Asian' = '#76B7B2',
    'Native American' = '#59A14F',
    'Other' = '#9C755F',
    'Unknown' = '#BAB0AC'
  )
  
  # Gender data - ensure no duplicates by summing values
  gender_data <- state_data %>%
    filter(variable_category == 'gender',
           value_type == 'count') %>%
    group_by(offender_type, variable_detailed) %>%
    summarize(value = sum(value, na.rm = TRUE), .groups = 'drop') %>%
    complete(offender_type, variable_detailed = c('Male', 'Female', 'Unknown'), 
             fill = list(value = 0))
  
  # Race data - ensure no duplicates by summing values
  race_data <- state_data %>%
    filter(variable_category == 'race',
           value_type == 'count') %>%
    group_by(offender_type, variable_detailed) %>%
    summarize(value = sum(value, na.rm = TRUE), .groups = 'drop') %>%
    complete(offender_type, 
             variable_detailed = c('White', 'Black', 'Hispanic', 
                                  'Asian', 'Native American', 'Other', 'Unknown'), 
             fill = list(value = 0))
  
  # Create separate plots - one per row
  par(mfrow = c(2, 1), mar = c(5, 9, 4, 9), oma = c(0, 0, 2, 0)) # Increased right margin for legend
  
  # Gender plot - ordered by total volume
  gender_plot_data <- gender_data %>%
    filter(variable_detailed %in% c('Male', 'Female', 'Unknown')) %>%
    mutate(offender_type = factor(offender_type, levels = rev(offender_types)))
  
  # Order gender categories by total volume (largest at bottom)
  gender_order <- gender_plot_data %>%
    group_by(variable_detailed) %>%
    summarize(total = sum(value)) %>%
    arrange(total) %>%
    pull(variable_detailed)
  
  gender_plot_data <- gender_plot_data %>%
    mutate(variable_detailed = factor(variable_detailed, levels = gender_order))
  
  # Reshape for barplot
  gender_matrix <- gender_plot_data %>%
    pivot_wider(names_from = variable_detailed, values_from = value) %>%
    as.data.frame() %>%
    column_to_rownames("offender_type") %>%
    as.matrix()
  
  # Ensure all columns exist
  for (gender in gender_order) {
    if (!gender %in% colnames(gender_matrix)) {
      gender_matrix <- cbind(gender_matrix, temp = 0)
      colnames(gender_matrix)[ncol(gender_matrix)] <- gender
    }
  }
  
  # Reorder columns by volume
  gender_matrix <- gender_matrix[, as.character(gender_order), drop = FALSE]
  
  # Format x-axis labels with "k" for thousands
  max_x <- max(rowSums(gender_matrix))
  x_breaks <- pretty(c(0, max_x))
  x_labels <- ifelse(x_breaks >= 1000, 
                    paste0(x_breaks/1000, "k"), 
                    as.character(x_breaks))
  
  barplot(t(gender_matrix), 
          horiz = TRUE,
          las = 1,
          col = gender_colors[colnames(gender_matrix)],
          main = 'Gender Distribution',
          xlab = 'Number of Profiles',
          xaxt = 'n',  # Remove default x-axis
          legend.text = FALSE,  # Don't show legend in plot area
          args.legend = list(x = "right", bty = "n", inset = c(-0.2, 0)))
  
  # Add custom x-axis with formatted labels
  axis(1, at = x_breaks, labels = x_labels)
  
  # Add legend outside the plot area
  legend("topright", 
         legend = colnames(gender_matrix), 
         fill = gender_colors[colnames(gender_matrix)],
         bty = "n", 
         xpd = TRUE,  # Allow plotting outside main area
         inset = c(-0.25, 0),  # Move legend to the right
         cex = 0.8)
  
  # Race plot - ordered by total volume
  race_plot_data <- race_data %>%
    mutate(offender_type = factor(offender_type, levels = rev(offender_types)))
  
  # Order race categories by total volume (largest at bottom)
  race_order <- race_plot_data %>%
    group_by(variable_detailed) %>%
    summarize(total = sum(value)) %>%
    arrange(total) %>%
    pull(variable_detailed)
  
  race_plot_data <- race_plot_data %>%
    mutate(variable_detailed = factor(variable_detailed, levels = race_order))
  
  # Reshape for barplot
  race_matrix <- race_plot_data %>%
    pivot_wider(names_from = variable_detailed, values_from = value) %>%
    as.data.frame() %>%
    column_to_rownames("offender_type") %>%
    as.matrix()
  
  # Ensure all columns exist
  for (race in race_order) {
    if (!race %in% colnames(race_matrix)) {
      race_matrix <- cbind(race_matrix, temp = 0)
      colnames(race_matrix)[ncol(race_matrix)] <- race
    }
  }
  
  # Reorder columns by volume
  race_matrix <- race_matrix[, as.character(race_order), drop = FALSE]
  
  # Format x-axis labels with "k" for thousands
  max_x_race <- max(rowSums(race_matrix))
  x_breaks_race <- pretty(c(0, max_x_race))
  x_labels_race <- ifelse(x_breaks_race >= 1000, 
                         paste0(x_breaks_race/1000, "k"), 
                         as.character(x_breaks_race))
  
  barplot(t(race_matrix), 
          horiz = TRUE,
          las = 1,
          col = race_colors[colnames(race_matrix)],
          main = 'Race Distribution',
          xlab = 'Number of Profiles',
          xaxt = 'n',  # Remove default x-axis
          legend.text = FALSE)  # Don't show legend in plot area
  
  # Add custom x-axis with formatted labels
  axis(1, at = x_breaks_race, labels = x_labels_race)
  
  # Add legend outside the plot area
  legend("topright", 
         legend = colnames(race_matrix), 
         fill = race_colors[colnames(race_matrix)],
         bty = "n", 
         xpd = TRUE,  # Allow plotting outside main area
         inset = c(-0.25, 0),  # Move legend to the right
         cex = 0.8)
  
  title(paste(state_name, "Demographic Distribution"), outer = TRUE, cex.main = 1.5)
}

# ------------------------------------------------------------------
# 20. Add state's metadata
# ------------------------------------------------------------------

add_state_metadata <- function(state_name, state_df) {
  
  raw_data <- state_df %>% filter(value_source == "reported")
  offender_types_reported <- unique(raw_data$offender_type)
  
  has_unknown <- any(raw_data$variable_detailed == "Unknown", na.rm = TRUE)
  has_other <- any(raw_data$variable_detailed == "Other", na.rm = TRUE)
  has_crosstab <- any(raw_data$variable_category == "gender_race", na.rm = TRUE)
  
  nonstandard_terms <- any(
    grepl("All|Offenders", raw_data$offender_type, ignore.case = TRUE),
    grepl("Caucasian|African American| American Indian", raw_data$variable_detailed, ignore.case = TRUE),
    grepl("flag", raw_data$variable_detailed, ignore.case = TRUE))
  
  new_row <- tibble(
    state = state_name,
    race_data_provided = report_status(raw_data, "race"),
    gender_data_provided = report_status(raw_data, "gender"),
    total_profiles_provided = report_status(
      raw_data %>% filter(variable_category == "total"), "total"
    ),
    convicted_offender_reported = "Convicted Offender" %in% offender_types_reported,
    arrestee_reported = "Arrestee" %in% offender_types_reported,
    combined_reported = "Combined" %in% offender_types_reported,
    has_unknown_category = has_unknown,
    has_other_category = has_other,
    uses_nonstandard_terminology = nonstandard_terms,
    provides_crosstabulation = has_crosstab,
    counts_sum_to_total = NA,
    percentages_sum_to_100 = NA,
    total_calculated_combined = !("Combined" %in% offender_types_reported),
    notes = ""
  )
  
  foia_state_metadata <<- bind_rows(foia_state_metadata, new_row)
  
  cat("✓ Metadata added for:", state_name, "\n")
  return(invisible(TRUE))
}

# ------------------------------------------------------------------
# 21. Function to update a state's metadata after QC checks
# ------------------------------------------------------------------
update_state_metadata <- function(state_name, 
                                  counts_ok = NA, 
                                  percentages_ok = NA, 
                                  notes_text = NULL) {
  
  row_index <- which(foia_state_metadata$state == state_name)
  
  if (length(row_index) == 0) {
    warning("State not found in metadata: ", state_name)
    return(FALSE)
  }
  
  if (!is.na(counts_ok)) {
    foia_state_metadata$counts_sum_to_total[row_index] <<- counts_ok
  }
  if (!is.na(percentages_ok)) {
    foia_state_metadata$percentages_sum_to_100[row_index] <<- percentages_ok
  }
  if (!is.null(notes_text)) {
    current_notes <- foia_state_metadata$notes[row_index]
    if (current_notes == "") {
      foia_state_metadata$notes[row_index] <<- notes_text
    } else {
      foia_state_metadata$notes[row_index] <<- paste(current_notes, notes_text, sep = "; ")
    }
  }
  
  cat("✓ Metadata updated for:", state_name, "\n")
}

2.4 File Structure and Contents

2.4.1 State-Specific Files: data/foia/intermediate/[state]_foia_data.csv

Purpose: Individual files for each state containing only their reported data.

Structure: Long format with columns:

  • state: State name
  • offender_type: Category of individuals (Convicted Offender, Arrestee, Combined, etc.)
  • variable_category: Type of data (total, gender, race, gender_race)
  • variable_detailed: Specific value (e.g., Male, Female, African American)
  • value: The reported number or percentage
  • value_type: Whether value is a “count” or “percentage”
  • date: Date of data snapshot, if reported
Show per-state files loading code
ca_raw <- load_state(here(per_state, "california_foia_data.csv"))
fl_raw <- load_state(here(per_state, "florida_foia_data.csv"))
in_raw <- load_state(here(per_state, "indiana_foia_data.csv"))
me_raw <- load_state(here(per_state, "maine_foia_data.csv"))
nv_raw <- load_state(here(per_state, "nevada_foia_data.csv"))
sd_raw <- load_state(here(per_state, "south_dakota_foia_data.csv"))
tx_raw <- load_state(here(per_state, "texas_foia_data.csv"))

2.4.2 Raw Data Characteristics

The following table summarizes the structure and content of the data as originally received from each state prior to any standardization, calculation, or processing.

State Offender Types Value Types Total Profiles Action Needed Key Reporting Notes
California CO, A Counts only Reported per offender type Add Unknown Race, Calculate % & Combined, Standardize Terminology Discrepancy in Race: counts < total profiles; Non-standard terminology (Caucasian and African American)
Florida COMB Counts + % Reported Standardize Terminology Non-standard terminology (Caucasian and African American)
Indiana CO, A, COMB Percentage (Counts for totals only) Reported per offender type Calculate Counts & Total Profiles Combined, Fix % inconsistency, Standardize Terminology Demographics only for Combined; Other race category as “<1”; Non-standard terminology (Caucasian)
Maine COMB Counts + % Reported Solve counts and Percentage inconsistency
Nevada CO, A, COMB Counts + % Reported for all types Standardize Terminology Non-standard terminology (All, total_flags and American Indian)
South Dakota COMB Counts + % Reported Standardize Terminology, Solve counts and % inconsistency Includes gender×race cross-tabulation; Non-standard terminology
Texas CO, A Counts only Reported per offender type Calculate Male counts, Solve counts inconsistency, Calculate % & Combined, Standardize Terminology Only female gender was reported; Non-standard term (Offenders, Caucasian, and African American)

Legend:

  • CO: Convicted Offender

  • AR: Arrestee

  • COMB: Combined Total (all profiles)

  • Counts + %: Both raw numbers and percentages were provided

2.5 Prepare Combined Dataset

The goal of this step is to transform each state’s raw data into a standardized format before appending it to the master foia_combined DataFrame. This ensures consistency and enables seamless analysis across all seven states.

The ideal, standardized state dataset ready for combination must have the following columns:

Column Name Description Example Values
state The name of the state. "California", "Florida"
offender_type The category of offender profile. "Convicted Offender", "Arrestee", "Combined"
variable_category The broad demographic category. "race", "gender", "total", "gender_race"
variable_detailed The specific value within the category. "White", "Male", "total_profiles", "Male_White"
value The numerical value for the metric. 150000, 25.8
value_type The type of metric the value represents. "count", "percentage"
value_source Whether the data was provided or derived. "reported", "calculated"
Show the master foia_combined dataframe elaboration code
# ------------------------------------------------------------------
# Initialize the master foia_combined dataframe with correct schema
# This empty structure ensures all state data is appended consistently
# ------------------------------------------------------------------

foia_combined <- tibble( state = character(),
offender_type = character(),
variable_category = character(),
variable_detailed = character(), 
value = numeric(),
value_type = character(),
value_source = character()
)

# Create a data dictionary for foia_combined
schema_dict <- tribble(
  ~Column,             ~Type,        ~Description, 
  "state",             "character",  "'California', 'Florida'",
  "offender_type",     "character",  "'Convicted Offender', 'Arrestee', 'Combined'",
  "variable_category", "character",  "'race', 'gender', 'total', 'gender_race'",
  "variable_detailed", "character",  "'White', 'Male', 'total_profiles', 'Male_White'",
  "value",             "numeric",    "150000, 25.8",
  "value_type",        "character",  "'count', 'percentage'",
  "value_source",      "character",  "'reported', 'calculated'"
)

# Turn into a nice flextable
flextable(schema_dict) %>%
  autofit() %>%
  theme_booktabs() %>%
  set_header_labels(
    Column = "Column Name",
    Type = "Data Type",
    Description = "Example Values to be added"
  )

Column Name

Data Type

Example Values to be added

state

character

'California', 'Florida'

offender_type

character

'Convicted Offender', 'Arrestee', 'Combined'

variable_category

character

'race', 'gender', 'total', 'gender_race'

variable_detailed

character

'White', 'Male', 'total_profiles', 'Male_White'

value

numeric

150000, 25.8

value_type

character

'count', 'percentage'

value_source

character

'reported', 'calculated'

2.6 Prepare Metadata Documentation Table

This section creates a comprehensive metadata table (foia_state_metadata) to document the original content and structure of each state’s FOIA response before any processing or cleaning was applied.

This serves as a permanent record of data provenance, ensuring transparency and reproducibility by clearly distinguishing between what was provided by the states and what was calculated during analysis.

Key Documentation Captured:

  • Data Types Provided: Whether each state reported counts, percentages, or both for race, gender, and total profiles.

  • Offender Categories Reported: Which offender types (Convicted Offender, Arrestee, Combined) were originally included.

  • Demographic Granularity: Presence of ‘Unknown’ or ‘Other’ categories and gender-race cross-tabulations.

  • Terminology & Anomalies: Use of non-standard terms (e.g., “flags,” “offenders”) and other state-specific reporting notes.

  • QC Results: Flags for whether cleaned data passes consistency checks (counts sum to totals, percentages sum to ~100%).

Show the foia_state_metadata table elaboration code
# ------------------------------------------------------------------
# Initialize the foia_state_metadata as a tibble (not a list of lists)
# This makes it easier to add rows and ensures consistent structure.
# ------------------------------------------------------------------

# Define the full schema for our metadata table
foia_state_metadata <- tibble(
  state = character(),
  race_data_provided = character(),
  gender_data_provided = character(),
  total_profiles_provided = character(), 
  convicted_offender_reported = logical(),
  arrestee_reported = logical(),
  combined_reported = logical(),
  has_unknown_category = logical(),
  has_other_category = logical(),
  uses_nonstandard_terminology = logical(),
  provides_crosstabulation = logical(),
  counts_sum_to_total = logical(),
  percentages_sum_to_100 = logical(),
  total_calculated_combined = logical(),
  notes = character()
)

# Build data dictionary for foia_state_metadata
schema_dict_meta <- tribble(
  ~Column,                        ~Type,       ~Description,
  "state",                        "character", "State name (e.g., 'California', 'Florida')",
  "race_data_provided",           "character", "Race data availability: 'counts', 'percentages', 'both', 'none'",
  "gender_data_provided",         "character", "Gender data availability: 'counts', 'percentages', 'both', 'none'",
  "total_profiles_provided",      "character", "Total profiles availability: 'counts', 'percentages', 'both', 'none'",
  "convicted_offender_reported",  "logical",   "Was convicted offender data reported?",
  "arrestee_reported",            "logical",   "Was arrestee data reported?",
  "combined_reported",            "logical",   "Was combined category reported?",
  "has_unknown_category",         "logical",   "Does the state include 'Unknown' category?",
  "has_other_category",           "logical",   "Does the state include 'Other' category?",
  "uses_nonstandard_terminology", "logical",   "Does the state use non-standard terms?",
  "provides_crosstabulation",     "logical",   "Does the state provide crosstabs (e.g., gender x race)?",
  "counts_sum_to_total",          "logical",   "Do reported counts sum to the total?",
  "percentages_sum_to_100",       "logical",   "Do reported percentages sum to ~100%?",
  "total_calculated_combined",    "logical",   "Did we calculate combined total manually?",
  "notes",                        "character", "Free-text notes for state-specific caveats"
)

# Render with flextable
flextable(schema_dict_meta) %>%
  autofit() %>%
  theme_booktabs() %>%
  set_header_labels(
    Column = "Column Name",
    Type = "Data Type",
    Description = "Meaning"
  )

Column Name

Data Type

Meaning

state

character

State name (e.g., 'California', 'Florida')

race_data_provided

character

Race data availability: 'counts', 'percentages', 'both', 'none'

gender_data_provided

character

Gender data availability: 'counts', 'percentages', 'both', 'none'

total_profiles_provided

character

Total profiles availability: 'counts', 'percentages', 'both', 'none'

convicted_offender_reported

logical

Was convicted offender data reported?

arrestee_reported

logical

Was arrestee data reported?

combined_reported

logical

Was combined category reported?

has_unknown_category

logical

Does the state include 'Unknown' category?

has_other_category

logical

Does the state include 'Other' category?

uses_nonstandard_terminology

logical

Does the state use non-standard terms?

provides_crosstabulation

logical

Does the state provide crosstabs (e.g., gender x race)?

counts_sum_to_total

logical

Do reported counts sum to the total?

percentages_sum_to_100

logical

Do reported percentages sum to ~100%?

total_calculated_combined

logical

Did we calculate combined total manually?

notes

character

Free-text notes for state-specific caveats

3 State-by-state Standardization

Each state is processed individually to standardize terminology, fill gaps, and calculate Combined totals where necessary.

3.1 California (CA)

Overview: California supplies counts only for gender and race plus a separate total for each offender type; no percentages are reported.

3.1.1 Examine Raw Data

Establish a baseline understanding of the data exactly as it was received.

Column

Type

Rows

Missing

Unique

Unique_Values

state

character

16

0

1

California

offender_type

character

16

0

2

Convicted Offender, Arrestee

variable_category

character

16

0

3

total, gender, race

variable_detailed

character

16

0

8

total_profiles, Female, Male, Unknown, African American, Caucasian, Hispanic, Asian

value

numeric

16

0

16

2019899 ..., 751822 ..., 309827 ..., 1603222 ..., 106850 ..., 208225 ..., 524231 ..., 19366 ..., 368952 ..., 588555 ...

value_type

character

16

0

1

count

value_source

character

16

0

1

reported

Data frame dimensions: 16 rows × 7 columns

3.1.2 Verify Data Consistency

Runs the first quality check using the verify_category_totals() and counts_consistent() functions.

This identifies any immediate discrepancies, such as the sum of demographic counts not matching the reported total profiles, which flags data issues that need to be resolved.

Verifying that demographic counts match reported totals:
offender_type variable_category total_profiles sum_counts difference
Arrestee gender 751822 751822 0
Arrestee race 751822 655695 96127
Convicted Offender gender 2019899 2019899 0
Convicted Offender race 2019899 1626012 393887

Counts consistency check on raw data:
All counts consistent: FALSE 

3.1.3 Address Data Gaps

3.1.3.1 Create Unknown Category

“Racial classification is not considered a required field on the collection card; thus, an unknown number of offenders may have no racial classification listed.” — California DOJ FOIA letter, July 10 2018 (raw/foia_pdfs/FOIA_RacialComp_California.pdf)

The 393,887 Convicted Offender profiles and 96,127 Arrestee profiles that do not appear in any of the four reported race categories must belong to an unreported “Unknown” category.

The calculated values are added with a value_source = "calculated" tag to maintain transparency about what was provided versus what was derived.

Show unknown addition code
# Start with the raw data
ca_clean <- ca_raw

# Add Unknown race category to reconcile totals
ca_clean <- fill_demographic_gaps(ca_clean)

# Verify the fix
cat("Category totals after adding Unknown race category:\n")
verify_category_totals(ca_clean) %>% kable() %>% kable_styling()

cat("\nCounts consistency after adding Unknown:\n")
cat(paste("All counts consistent:", counts_consistent(ca_clean), "\n"))
Category totals after adding Unknown race category:
offender_type variable_category total_profiles sum_counts difference
Arrestee gender 751822 751822 0
Arrestee race 751822 751822 0
Convicted Offender gender 2019899 2019899 0
Convicted Offender race 2019899 2019899 0

Counts consistency after adding Unknown:
All counts consistent: TRUE 

3.1.3.2 Create Combined Totals

Since California only reported data for “Convicted Offender” and “Arrestee” separately.

This step uses the add_combined() helper function to calculate a new “Combined” offender type by summing the counts from the other two categories.

Show combined addition code
# Calculate Combined totals using helper function
ca_clean <- add_combined(ca_clean)

cat("✓ Created Combined totals for California\n")

# Show the Combined total
combined_total <- ca_clean %>%
  filter(offender_type == "Combined",
         variable_category == "total",
         variable_detailed == "total_profiles") %>%
  pull(value)

cat(paste("Combined total profiles:", format(combined_total, big.mark = ","), "\n"))
✓ Created Combined totals for California
Combined total profiles: 2,771,721 

3.1.3.3 Calculate Percentages

Transforms the data from counts into percentages for comparative analysis.

The add_percentages() helper function calculates each demographic group’s proportion relative to its offender type’s total.

A final consistency check ensures all percentages logically sum to approximately 100%.

Show percentage calculation code
# Derive percentages from counts
ca_clean <- add_percentages(ca_clean)

cat("✓ Added percentages for all demographic categories\n")

# Check percentage consistency
cat("Percentage consistency check:\n")
cat(paste("All percentages sum to ~100%:", percentages_consistent(ca_clean), "\n\n"))

# Show current data availability
cat("Final data availability:\n")
cat(paste("Race data:", report_status(ca_clean, "race"), "\n"))
cat(paste("Gender data:", report_status(ca_clean, "gender"), "\n"))
✓ Added percentages for all demographic categories
Percentage consistency check:
All percentages sum to ~100%: TRUE 

Final data availability:
Race data: both 
Gender data: both 

3.1.3.4 Standardize Terminology

California uses “African American” instead of “Black” and “Caucasian” instead of “White”.

Show terminology standardization code
# Standardize racial terminology
ca_clean <- ca_clean %>%
  mutate(variable_detailed = case_when(
    variable_detailed == "African American" ~ "Black",
    TRUE ~ variable_detailed
  ))

cat("✓ Standardized terminology: 'African American' → 'Black'\n")

ca_clean <- ca_clean %>%
  mutate(variable_detailed = case_when(
    variable_detailed == "Caucasian" ~ "White",
    TRUE ~ variable_detailed
  ))

cat("✓ Standardized terminology: 'Caucasian' → 'White'\n")
✓ Standardized terminology: 'African American' → 'Black'
✓ Standardized terminology: 'Caucasian' → 'White'

3.1.4 Prepare for Combined Dataset

The cleaned data is formatted to match the master schema and appended to the foia_combined dataframe.

Show California data preparation to combined dataset
# Prepare the cleaned data for the combined dataset
ca_prepared <- prepare_state_for_combined(ca_clean, "California")

# Append to the master combined dataframe
foia_combined <- bind_rows(foia_combined, ca_prepared)

cat(paste0("✓ Appended ", nrow(ca_prepared), " California rows to foia_combined\n"))
cat(paste0("✓ Total rows in foia_combined: ", nrow(foia_combined), "\n"))
✓ Appended 51 California rows to foia_combined
✓ Total rows in foia_combined: 51

3.1.5 Document Metadata

The metadata is added with the raw information and updated with the results of the quality checks and a note on the processing steps taken.

Show California data preparation and addition to metadata table
# Add California to the metadata table using the helper function
add_state_metadata("California", ca_raw)

# Update metadata with QC results
update_state_metadata("California", 
                      counts_ok = counts_consistent(ca_clean),
                      percentages_ok = percentages_consistent(ca_clean),
                      notes_text = "Added Unknown race category to reconcile totals; calculated Combined totals and all percentages")
✓ Metadata added for: California 
✓ Metadata updated for: California 

3.1.6 Visualizations

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California DNA Database Demographic Distributions

California Demographic Distributions by Offender Type

3.1.7 Summary Statistics

Show the summary statistics code
cat("California DNA Database Summary:\n")
cat("=", strrep("=", 40), "\n")

# Total profiles by offender type
totals <- foia_combined %>%
  filter(state == "California",
         variable_category == "total",
         variable_detailed == "total_profiles",
         value_type == "count") %>%
  select(offender_type, value) %>%
  mutate(value_formatted = format(value, big.mark = ","))

print(totals)

# Data completeness
cat("\nData completeness:\n")
completeness <- foia_combined %>%
  filter(state == "California") %>%
  group_by(offender_type, value_source) %>%
  summarise(n_values = n(), .groups = "drop")

print(completeness)

# Final verification
cat("\nFinal verification:\n")
cat(paste("Counts consistent:", counts_consistent(foia_combined %>% filter(state == "California")), "\n"))
cat(paste("Percentages consistent:", percentages_consistent(foia_combined %>% filter(state == "California")), "\n"))
California DNA Database Summary:
= ======================================== 
# A tibble: 3 × 3
  offender_type        value value_formatted
  <chr>                <dbl> <chr>          
1 Convicted Offender 2019899 "2,019,899"    
2 Arrestee            751822 "  751,822"    
3 Combined           2771721 "2,771,721"    

Data completeness:
# A tibble: 5 × 3
  offender_type      value_source n_values
  <chr>              <chr>           <int>
1 Arrestee           calculated          9
2 Arrestee           reported            8
3 Combined           calculated         17
4 Convicted Offender calculated          9
5 Convicted Offender reported            8

Final verification:
Counts consistent: TRUE 
Percentages consistent: TRUE 

3.1.8 Summary of California Processing

California data processing complete. The dataset now includes:

  • Reported data: Counts for Convicted Offender and Arrestee

  • Calculated additions:

    • Unknown race category to reconcile reported totals

    • Combined totals across all offender types

    • Percentage values for all demographic categories

    • “Caucasian” and “African American” converted to “White” and “Black”.

  • Quality checks: All counts and percentages pass consistency validation

  • Provenance tracking: All values include appropriate value_source indicators

The California data is now standardized and ready for cross-state analysis.

3.2 Florida (FL)

Overview: Florida provides both counts and percentages for gender and race categories and already includes a “Combined” total for all offender types, making it one of the most complete and straightforward datasets.

Only requires to standardize terminology for gender and race categories to match the common data model.

3.2.1 Examine Raw Data

Establish a baseline understanding of the data exactly as it was received.

Column

Type

Rows

Missing

Unique

Unique_Values

state

character

22

0

1

Florida

offender_type

character

22

0

1

Combined

variable_category

character

22

0

3

total, gender, race

variable_detailed

character

22

0

10

total_profiles, Female, Male, Unknown, African American, Asian, Caucasian, Hispanic, Native American, Other

value

numeric

22

0

22

1175391 ..., 100 ..., 260885 ..., 22.2 ..., 901126 ..., 76.67 ..., 13380 ..., 1.14 ..., 413733 ..., 35.2 ...

value_type

character

22

0

2

count, percentage

value_source

character

22

0

1

reported

Data frame dimensions: 22 rows × 7 columns

3.2.2 Verify Data Consistency

Runs the first quality check using the Verify_category_totals() and counts_consistent() functions.

Verifying that demographic counts match reported totals:
offender_type variable_category total_profiles sum_counts difference
Combined gender 1175391 1175391 0
Combined race 1175391 1175391 0

Counts consistency check on raw data:
All counts consistent: TRUE 

Percentage consistency check on raw data:
All percentages sum to ~100%: TRUE 

3.2.3 Address Data Gaps

3.2.3.1 Standardize Terminology

Florida uses “African American” instead of “Black” and “Caucasian” instead of “White”.

Show terminology standardization code
fl_clean <- fl_raw

# Standardize racial terminology
fl_clean <- fl_clean %>%
  mutate(variable_detailed = case_when(
    variable_detailed == "African American" ~ "Black",
    TRUE ~ variable_detailed
  ))

cat("✓ Standardized terminology: 'African American' → 'Black'\n")

fl_clean <- fl_clean %>%
  mutate(variable_detailed = case_when(
    variable_detailed == "Caucasian" ~ "White",
    TRUE ~ variable_detailed
  ))

cat("✓ Standardized terminology: 'Caucasian' → 'White'\n")
✓ Standardized terminology: 'African American' → 'Black'
✓ Standardized terminology: 'Caucasian' → 'White'

3.2.4 Prepare for Combined Dataset

The Florida data is already complete and consistent. It is formatted to match the master schema and appended to the foia_combined dataframe.

Show Florida data preparation to combined dataset
# Prepare the data for the combined dataset
fl_prepared <- prepare_state_for_combined(fl_clean, "Florida")

# Append to the master combined dataframe
foia_combined <- bind_rows(foia_combined, fl_prepared)

cat(paste0("✓ Appended ", nrow(fl_prepared), " Florida rows to foia_combined\n"))
cat(paste0("✓ Total rows in foia_combined: ", nrow(foia_combined), "\n"))
✓ Appended 22 Florida rows to foia_combined
✓ Total rows in foia_combined: 73

3.2.5 Document Metadata

The metadata is added with a note that the data was complete and required no processing.

Show Florida data preparation and addition to metadata table
# Add Florida to the metadata table using the helper function
add_state_metadata("Florida", fl_raw)

# Update metadata with QC results
update_state_metadata("Florida", 
                      counts_ok = counts_consistent(fl_clean),
                      percentages_ok = percentages_consistent(fl_clean),
                      notes_text = "Complete dataset provided. No processing or calculations required. All values are reported.")
✓ Metadata added for: Florida 
✓ Metadata updated for: Florida 

3.2.6 Visualizations

Florida DNA Database Demographic Distributions

Florida DNA Database Demographic Distributions

Florida DNA Database Demographic Distributions

Florida DNA Database Demographic Distributions

3.2.7 Summary Statistics

Show the summary statistics code
cat("Florida DNA Database Summary:\n")
cat("=", strrep("=", 40), "\n")

# Total profiles by offender type
totals <- foia_combined %>%
  filter(state == "Florida",
         variable_category == "total",
         variable_detailed == "total_profiles",
         value_type == "count") %>%
  select(offender_type, value) %>%
  mutate(value_formatted = format(value, big.mark = ","))

print(totals)

# Data completeness
cat("\nData completeness:\n")
completeness <- foia_combined %>%
  filter(state == "Florida") %>%
  group_by(offender_type, value_source) %>%
  summarise(n_values = n(), .groups = "drop")

print(completeness)

# Final verification
cat("\nFinal verification:\n")
cat(paste("Counts consistent:", counts_consistent(foia_combined %>% filter(state == "Florida")), "\n"))
cat(paste("Percentages consistent:", percentages_consistent(foia_combined %>% filter(state == "Florida")), "\n"))
Florida DNA Database Summary:
= ======================================== 
# A tibble: 1 × 3
  offender_type   value value_formatted
  <chr>           <dbl> <chr>          
1 Combined      1175391 1,175,391      

Data completeness:
# A tibble: 1 × 3
  offender_type value_source n_values
  <chr>         <chr>           <int>
1 Combined      reported           22

Final verification:
Counts consistent: TRUE 
Percentages consistent: TRUE 

3.2.8 Summary of Florida Processing

Florida data processing complete. The dataset is exemplary and required no adjustments:

  • Reported data: Both counts and percentages for all Convicted Offender, Arrestee, and Combined categories.

  • Terminology standardization: “Caucasian” and “African American” converted to “White” and “Black”.

  • No calculated additions needed: All values are sourced directly from the state report (value_source = "reported").

  • Quality checks: All counts and percentages pass consistency validation.

  • Provenance tracking: All values maintain their original value_source as “reported”.

The Florida data is now standardized and ready for cross-state analysis.

3.3 Indiana (IN)

Overview: Indiana presents a unique reporting pattern where total counts are provided by offender type, but demographic breakdowns are given only as percentages for the Combined total.

Values were provided as strings, including a “<1” notation, requiring conversion.

3.3.1 Examine Raw Data

Establish a baseline understanding of the data exactly as it was received.

Column

Type

Rows

Missing

Unique

Unique_Values

state

character

8

0

1

Indiana

offender_type

character

8

0

3

Convicted Offender, Arrestee, Combined

variable_category

character

8

0

3

total, gender, race

variable_detailed

character

8

0

7

total_profiles, Female, Male, Caucasian, Black, Hispanic, Other

value

numeric

8

0

8

279654, 21087, 20, 80, 70, 26, 4, 0.5

value_type

character

8

0

2

count, percentage

value_source

character

8

0

1

reported

Data frame dimensions: 8 rows × 7 columns

3.3.2 Verify Data Consistency

Initial checks reveal Indiana’s unique structure: counts for totals, percentages only for Combined demographics.

Initial data availability:
Race data: percentages 
Gender data: percentages 

Value types in raw data:
count, percentage

3.3.3 Address Data Gaps

3.3.3.1 Convert String Values to Numeric

The raw data contains string values including “<1” which we convert to 0.5.

Show value conversion code
# Start with raw data
in_clean <- in_raw

# Convert string values to numeric, handling "<1" as 1
in_clean$value <- sapply(in_clean$value, function(x) {
  if (x == "<1") {
    0.5
  } else {
    as.numeric(x)
  }
})

# Update value_type for converted percentages
in_clean <- in_clean %>%
  mutate(value_type = ifelse(value_type == "percentage", "percentage", value_type))

cat("✓ Converted Indiana values from String to numeric\n")
cat(paste("Unique values after conversion:", paste(unique(in_clean$value), collapse = ", "), "\n"))
✓ Converted Indiana values from String to numeric
Unique values after conversion: 279654, 21087, 20, 80, 70, 26, 4, 0.5 

3.3.3.2 Solve Percentages Inconsistency

Racial percentages summed to 100.5% instead of 100%

Proportional scaling was applied and value_source was updated to “calculated” for all adjusted values.

Show percentage recalculation code
# Adjust percentages to ensure they sum to 100% and mark as calculated
in_clean <- in_clean %>%
  group_by(value_type, variable_category) %>%
  mutate(
    value = ifelse(
      value_type == "percentage" & variable_category == "race",
      value * (100 / sum(value, na.rm = TRUE)),
      value
    ),
    value_source = ifelse(
      value_type == "percentage" & variable_category == "race",
      "calculated",
      value_source
    )
  ) %>%
  ungroup()

# Verify the new sum
percentage_sum <- in_clean %>%
  filter(value_type == "percentage" & variable_category == "race") %>%
  summarise(total = sum(value, na.rm = TRUE))

cat("✓ Recalculated percentages for Indiana - New sum:", percentage_sum$total, "%\n")
✓ Recalculated percentages for Indiana - New sum: 100 %

3.3.3.3 Standardize Terminology

Indiana uses “Caucasian” instead of “White”.

Show terminology standardization code
# Standardize racial terminology
in_clean <- in_clean %>%
  mutate(variable_detailed = case_when(
    variable_detailed == "Caucasian" ~ "White",
    TRUE ~ variable_detailed
  ))

cat("✓ Standardized terminology: 'Caucasian' → 'White'\n")
✓ Standardized terminology: 'Caucasian' → 'White'

3.3.3.4 Create Combined Total Profiles

Indiana provides separate totals for Convicted Offenders and Arrestees, but we need a Combined total to match the demographic percentages.

Show combined total calculation code
# Calculate Combined total from separate offender type totals
convicted_total <- in_clean %>%
  filter(offender_type == "Convicted Offender",
         variable_category == "total",
         variable_detailed == "total_profiles") %>%
  pull(value)

arrestee_total <- in_clean %>%
  filter(offender_type == "Arrestee",
         variable_category == "total",
         variable_detailed == "total_profiles") %>%
  pull(value)

combined_total <- convicted_total + arrestee_total

# Add Combined total to the data
combined_row <- data.frame(
  state = "Indiana",
  offender_type = "Combined",
  variable_category = "total",
  variable_detailed = "total_profiles",
  value = combined_total,
  value_type = "count",
  value_source = "calculated"
)

in_clean <- bind_rows(in_clean, combined_row)

cat(paste("Combined total profiles:", format(combined_total, big.mark = ","), "\n"))
cat("✓ Added Combined total profiles\n")
Combined total profiles: 300,741 
✓ Added Combined total profiles

3.3.3.5 Calculate Counts from Percentages

Indiana only provides percentages for demographic categories. We calculate the actual counts using the Combined total.

Show count calculation code
# Calculate counts from percentages for Combined offender type
in_clean <- bind_rows(in_clean, calculate_counts_from_percentages(in_clean, "Indiana"))

cat("✓ Calculated demographic counts from percentages\n")

# Verify the calculations
cat("Category totals after calculating counts:\n")
verify_category_totals(in_clean) %>% kable() %>% kable_styling()
✓ Calculated demographic counts from percentages
Category totals after calculating counts:
offender_type variable_category total_profiles sum_counts difference
Combined gender 300741 300741 0
Combined race 300741 300741 0

3.3.4 Verify Data Consistency

Final checks to ensure all data is now consistent and complete.

Final data consistency checks:
Counts consistent: TRUE 
Percentages consistent: TRUE 

Final data availability:
Race data: both 
Gender data: both 

3.3.5 Prepare for Combined Dataset

The cleaned data is formatted to match the master schema and appended to the foia_combined dataframe.

Show Indiana data preparation to combined dataset
# Prepare the cleaned data for the combined dataset
in_prepared <- prepare_state_for_combined(in_clean, "Indiana")

# Append to the master combined dataframe
foia_combined <- bind_rows(foia_combined, in_prepared)

cat(paste0("✓ Appended ", nrow(in_prepared), " Indiana rows to foia_combined\n"))
cat(paste0("✓ Total rows in foia_combined: ", nrow(foia_combined), "\n"))
✓ Appended 15 Indiana rows to foia_combined
✓ Total rows in foia_combined: 88

3.3.6 Document Metadata

The metadata is added with details on all processing steps performed.

Show Indiana data preparation and addition to metadata table
# Add Indiana to the metadata table using the helper function
add_state_metadata("Indiana", in_raw)

# Update metadata with QC results and processing notes
update_state_metadata("Indiana", 
                      counts_ok = counts_consistent(in_clean),
                      percentages_ok = percentages_consistent(in_clean),
                      notes_text = "Converted string values to numeric; standardized 'Black' to 'African American'; calculated Combined total profiles; derived all demographic counts from reported percentages")
✓ Metadata added for: Indiana 
✓ Metadata updated for: Indiana 

3.3.7 Visualizations

Indiana DNA Database Demographic Distributions

Indiana DNA Database Demographic Distributions

Indiana DNA Database Demographic Distributions

Indiana DNA Database Demographic Distributions

3.3.8 Summary Statistics

Show the summary statistics code
cat("Indiana DNA Database Summary:\n")
cat("=", strrep("=", 40), "\n")

# Total profiles by offender type
totals <- foia_combined %>%
  filter(state == "Indiana",
         variable_category == "total",
         variable_detailed == "total_profiles",
         value_type == "count") %>%
  select(offender_type, value, value_source) %>%
  mutate(value_formatted = format(value, big.mark = ","))

print(totals)

# Data completeness by value source
cat("\nData completeness by source:\n")
completeness <- foia_combined %>%
  filter(state == "Indiana") %>%
  group_by(value_source) %>%
  summarise(n_values = n(), .groups = "drop")

print(completeness)

# Final verification
cat("\nFinal verification:\n")
cat(paste("Counts consistent:", counts_consistent(foia_combined %>% filter(state == "Indiana")), "\n"))
cat(paste("Percentages consistent:", percentages_consistent(foia_combined %>% filter(state == "Indiana")), "\n"))
Indiana DNA Database Summary:
= ======================================== 
# A tibble: 3 × 4
  offender_type       value value_source value_formatted
  <chr>               <dbl> <chr>        <chr>          
1 Convicted Offender 279654 reported     "279,654"      
2 Arrestee            21087 reported     " 21,087"      
3 Combined           300741 calculated   "300,741"      

Data completeness by source:
# A tibble: 2 × 2
  value_source n_values
  <chr>           <int>
1 calculated         11
2 reported            4

Final verification:
Counts consistent: TRUE 
Percentages consistent: TRUE 

3.3.9 Summary of Indiana Processing

Indiana data processing complete. The unique dataset required:

  • Data conversion: String values converted to numeric, handling “<1” as 0.5

  • Terminology standardization: “Caucasian” converted to “White”

  • Calculated additions:

    • Combined total profiles across offender types

    • All demographic counts derived from reported percentages

  • Quality checks: All counts and percentages pass consistency validation

  • Provenance tracking: Clear distinction between reported and calculated values

The Indiana data is now standardized and ready for cross-state analysis.

3.4 Maine (ME)

Overview: Maine provides comprehensive reporting with both counts and percentages for all gender and race categories across all offender types, including pre-calculated Combined totals. The data is complete and requires no processing.

3.4.1 Examine Raw Data

Establish a baseline understanding of the data exactly as it was received.

Column

Type

Rows

Missing

Unique

Unique_Values

state

character

19

0

1

Maine

offender_type

character

19

0

1

Combined

variable_category

character

19

0

3

total, gender, race

variable_detailed

character

19

0

9

total_profiles, Male, Female, Unknown, White, Black, Native American, Hispanic, Asian

value

numeric

19

0

19

33711 ..., 27694 ..., 82.7 ..., 5734 ..., 17 ..., 83 ..., 0.2 ..., 31298 ..., 92.8 ..., 1299 ...

value_type

character

19

0

2

count, percentage

value_source

character

19

0

1

reported

Data frame dimensions: 19 rows × 7 columns

3.4.2 Verify Data Consistency

Runs quality checks using the verify_category_totals(), counts_consistent(), and percentages_consistent() functions.

Verifying that demographic counts match reported totals:
offender_type variable_category total_profiles sum_counts difference
Combined gender 33711 33511 200
Combined race 33711 33711 0

Counts consistency check on raw data:
All counts consistent: FALSE 

Percentage consistency check on raw data:
All percentages sum to ~100%: TRUE 

3.4.3 Address Data Gaps

3.4.3.1 Solve Percentages Inconsistency

Racial percentages summed to 99.9% instead of 100%

Proportional scaling was applied and value_source was updated to “calculated” for all adjusted values.

Show percentage recalculation code
# Start with the raw data
me_clean <- me_raw

# Adjust percentages to ensure they sum to 100% and mark as calculated
me_clean <- me_clean %>%
  group_by(value_type, variable_category) %>%
  mutate(
    value = ifelse(
      value_type == "percentage" & variable_category == "gender",
      value * (100 / sum(value, na.rm = TRUE)),
      value
    ),
    value_source = ifelse(
      value_type == "percentage" & variable_category == "gender",
      "calculated",
      value_source
    )
  ) %>%
  ungroup()

# Verify the new sum
percentage_sum <- me_clean %>%
  filter(value_type == "percentage" & variable_category == "gender") %>%
  summarise(total = sum(value, na.rm = TRUE))

cat("✓ Recalculated percentages for Maine - New sum:", round(percentage_sum$total, 2), "%\n")
✓ Recalculated percentages for Maine - New sum: 100 %

3.4.3.2 Recalculate Counts from Percentages

Maine’s reported gender counts sum were inconsistent with the total_profiles.

We removed existing gender count data and recalculated counts using percentage values and combined totals.

All recalculated values flagged with value_source = "calculated"

Show count recalculation code
# Remove existing gender count rows to avoid duplication
me_clean <- me_clean %>%
  filter(!(variable_category == "gender" & value_type == "count"))

cat("✓ Removed existing gender count data\n")

me_gender <- me_clean %>%
    filter(variable_category == "gender" | variable_category == "total")

# Calculate counts from percentages for Combined offender type
me_gender <- calculate_counts_from_percentages(me_gender, "Maine")

# Append recalculated gender counts to the main dataset
me_clean <- bind_rows(me_clean, me_gender)

cat("✓ Calculated demographic counts from percentages\n")

# Verify the calculations
cat("Category totals after calculating counts:\n")
verify_category_totals(me_clean) %>% kable() %>% kable_styling()
✓ Removed existing gender count data
✓ Calculated demographic counts from percentages
Category totals after calculating counts:
offender_type variable_category total_profiles sum_counts difference
Combined gender 33711 33711 0
Combined race 33711 33711 0

3.4.4 Verify Data Consistency

Final checks to ensure all data is now consistent and complete.

Final data consistency checks:
Counts consistent: TRUE 
Percentages consistent: TRUE 

Final data availability:
Race data: both 
Gender data: both 

3.4.5 Prepare for Combined Dataset

The Maine data is already complete and consistent. It is formatted to match the master schema and appended to the foia_combined dataframe.

Show Maine data preparation to combined dataset
# Prepare the data for the combined dataset
me_prepared <- prepare_state_for_combined(me_clean, "Maine")

# Append to the master combined dataframe
foia_combined <- bind_rows(foia_combined, me_prepared)

cat(paste0("✓ Appended ", nrow(me_prepared), " Maine rows to foia_combined\n"))
cat(paste0("✓ Total rows in foia_combined: ", nrow(foia_combined), "\n"))
✓ Appended 19 Maine rows to foia_combined
✓ Total rows in foia_combined: 107

3.4.6 Document Metadata

The metadata is added with a note that the data was complete and required no processing.

Show Maine data preparation and addition to metadata table
# Add Maine to the metadata table using the helper function
add_state_metadata("Maine", me_raw)

# Update metadata with QC results
update_state_metadata("Maine", 
                      counts_ok = counts_consistent(me_clean),
                      percentages_ok = percentages_consistent(me_clean),
                      notes_text = "Complete dataset provided with both counts and percentages. No processing or calculations required. All values are reported.")
✓ Metadata added for: Maine 
✓ Metadata updated for: Maine 

3.4.7 Visualizations

Maine DNA Database Demographic Distributions

Maine DNA Database Demographic Distributions

Maine DNA Database Demographic Distributions

Maine DNA Database Demographic Distributions

3.4.8 Summary Statistics

Show the summary statistics code
cat("Maine DNA Database Summary:\n")
cat("=", strrep("=", 40), "\n")

# Total profiles by offender type
totals <- foia_combined %>%
  filter(state == "Maine",
         variable_category == "total",
         variable_detailed == "total_profiles",
         value_type == "count") %>%
  select(offender_type, value) %>%
  mutate(value_formatted = format(value, big.mark = ","))

print(totals)

# Data completeness
cat("\nData completeness:\n")
completeness <- foia_combined %>%
  filter(state == "Maine") %>%
  group_by(offender_type, value_source) %>%
  summarise(n_values = n(), .groups = "drop")

print(completeness)

# Final verification
cat("\nFinal verification:\n")
cat(paste("Counts consistent:", counts_consistent(foia_combined %>% filter(state == "Maine")), "\n"))
cat(paste("Percentages consistent:", percentages_consistent(foia_combined %>% filter(state == "Maine")), "\n"))
Maine DNA Database Summary:
= ======================================== 
# A tibble: 1 × 3
  offender_type value value_formatted
  <chr>         <dbl> <chr>          
1 Combined      33711 33,711         

Data completeness:
# A tibble: 2 × 3
  offender_type value_source n_values
  <chr>         <chr>           <int>
1 Combined      calculated          6
2 Combined      reported           13

Final verification:
Counts consistent: TRUE 
Percentages consistent: TRUE 

3.4.9 Summary of Maine Processing

Maine data processing complete. The dataset is exemplary and required no adjustments:

  • Reported data: Both counts and percentages for all Convicted Offender, Arrestee, and Combined categories

  • No calculated additions needed: All values are sourced directly from the state report (value_source = "reported")

  • Quality checks: All counts and percentages pass consistency validation

  • Provenance tracking: All values maintain their original value_source as “reported”

The Maine data is now standardized and ready for cross-state analysis.

3.5 Nevada (NV)

Overview: Nevada provides both counts and percentages for gender and race categories but uses non-standard terminology that requires conversion for consistency with our schema.

3.5.1 Examine Raw Data

Establish a baseline understanding of the data exactly as it was received.

Column

Type

Rows

Missing

Unique

Unique_Values

state

character

21

0

1

Nevada

offender_type

character

21

0

4

All, Arrestee, Convicted Offender, Combined

variable_category

character

21

0

3

total, gender, race

variable_detailed

character

21

0

9

total_flags, total_profiles, Female, Male, Unknown, White, American Indian, Black, Asian

value

numeric

21

0

21

344097 ..., 185074 ..., 53.785 ..., 159023 ..., 46.215 ..., 63287 ..., 18.392 ..., 280738 ..., 81.587 ..., 72 ...

value_type

character

21

0

2

count, percentage

value_source

character

21

0

1

reported

Data frame dimensions: 21 rows × 7 columns

3.5.2 Verify Data Consistency

Initial check reveals Nevada’s non-standard terminology.

Initial data availability:
Race data: both 
Gender data: both 

Non-standard terminology found:
Offender types: All, Arrestee, Convicted Offender, Combined 

3.5.3 Address Data Gaps

3.5.3.1 Standardize Terminology

Nevada uses “All” instead of “Combined”, “total_flags” instead of “total_profiles” and “American Indian” instead of “Native American”.

Show terminology standardization code
# Start with raw data
nv_clean <- nv_raw

# Standardize offender types and racial terminology
nv_clean <- nv_clean %>%
  mutate(
    offender_type = case_when(
      offender_type == "All" ~ "Combined",
      TRUE ~ offender_type
    ),
    variable_detailed = case_when(
      variable_detailed == "total_flags" ~ "total_profiles",
      TRUE ~ variable_detailed
    ),
    variable_detailed = case_when(
      variable_detailed == "American Indian" ~ "Native American",
      TRUE ~ variable_detailed
    )
  )

cat("✓ Standardized terminology:\n")
cat("  - 'All' → 'Combined'\n")
cat("  - 'total_flags' → 'total_profiles'\n")
cat("  - 'American Indian' → 'Native American'\n")
✓ Standardized terminology:
  - 'All' → 'Combined'
  - 'total_flags' → 'total_profiles'
  - 'American Indian' → 'Native American'

3.5.3.2 Verify Consistency

Now that the offender types are standardized, we can verify the counts and percentages.

Verifying that demographic counts match reported totals:
offender_type variable_category total_profiles sum_counts difference
Combined gender 344097 344097 0
Combined race 344097 344097 0

Counts consistency check on raw data:
All counts consistent: TRUE 

Percentage consistency check on raw data:
All percentages sum to ~100%: TRUE 
Sum of 'race' percentages: 100 %
Sum of 'gender' percentages: 100 %

3.5.4 Prepare for Combined Dataset

The cleaned data is formatted to match the master schema and appended to the foia_combined dataframe.

Show Nevada data preparation to combined dataset
# Prepare the cleaned data for the combined dataset
nv_prepared <- prepare_state_for_combined(nv_clean, "Nevada")

# Append to the master combined dataframe
foia_combined <- bind_rows(foia_combined, nv_prepared)

cat(paste0("✓ Appended ", nrow(nv_prepared), " Nevada rows to foia_combined\n"))
cat(paste0("✓ Total rows in foia_combined: ", nrow(foia_combined), "\n"))
✓ Appended 21 Nevada rows to foia_combined
✓ Total rows in foia_combined: 128

3.5.5 Document Metadata

The metadata is added with details on the terminology standardization performed.

Show Nevada data preparation and addition to metadata table
# Add Nevada to the metadata table using the helper function
add_state_metadata("Nevada", nv_raw)

# Update metadata with QC results and processing notes
update_state_metadata("Nevada", 
                      counts_ok = counts_consistent(nv_clean),
                      percentages_ok = percentages_consistent(nv_clean),
                      notes_text = "Standardized terminology: 'All' to 'Combined' and 'American Indian' to 'Native American'. All values remain reported.")
✓ Metadata added for: Nevada 
✓ Metadata updated for: Nevada 

3.5.6 Visualizations

Nevada DNA Database Demographic Distributions

Nevada DNA Database Demographic Distributions

Nevada DNA Database Demographic Distributions

Nevada DNA Database Demographic Distributions

3.5.7 Summary Statistics

Show the summary statistics code
cat("Nevada DNA Database Summary:\n")
cat("=", strrep("=", 40), "\n")

# Total profiles by offender type
totals <- foia_combined %>%
  filter(state == "Nevada",
         variable_category == "total",
         variable_detailed == "total_profiles",
         value_type == "count") %>%
  select(offender_type, value) %>%
  mutate(value_formatted = format(value, big.mark = ","))

print(totals)

# Data completeness
cat("\nData completeness:\n")
completeness <- foia_combined %>%
  filter(state == "Nevada") %>%
  group_by(offender_type, value_source) %>%
  summarise(n_values = n(), .groups = "drop")

print(completeness)

# Final verification
cat("\nFinal verification:\n")
cat(paste("Counts consistent:", counts_consistent(foia_combined %>% filter(state == "Nevada")), "\n"))
cat(paste("Percentages consistent:", percentages_consistent(foia_combined %>% filter(state == "Nevada")), "\n"))
Nevada DNA Database Summary:
= ======================================== 
# A tibble: 3 × 3
  offender_type       value value_formatted
  <chr>               <dbl> <chr>          
1 Combined           344097 344,097        
2 Arrestee           185074 185,074        
3 Convicted Offender 159023 159,023        

Data completeness:
# A tibble: 3 × 3
  offender_type      value_source n_values
  <chr>              <chr>           <int>
1 Arrestee           reported            2
2 Combined           reported           17
3 Convicted Offender reported            2

Final verification:
Counts consistent: TRUE 
Percentages consistent: FALSE 

3.5.8 Summary of Nevada Processing

Nevada data processing complete. The dataset required minimal adjustments:

  • Terminology standardization:

    • “All” → “Combined” (offender type)

    • “American Indian” → “Native American” (race category)

  • Reported data: Both counts and percentages for all categories

  • Quality checks: All counts and percentages pass consistency validation

  • Provenance tracking: All values maintain value_source = "reported" as only terminology changes were made

The Nevada data is now standardized and ready for cross-state analysis.

3.6 South Dakota (SD)

Overview: South Dakota provides the most comprehensive reporting with both counts and percentages for all standard categories plus unique intersectional gender×race data. Minor terminology standardization is required for consistency.

3.6.1 Examine Raw Data

Establish a baseline understanding of the data exactly as it was received.

Column

Type

Rows

Missing

Unique

Unique_Values

state

character

41

0

1

South Dakota

offender_type

character

41

0

1

Combined

variable_category

character

41

0

4

total, gender, race, gender_race

variable_detailed

character

41

0

21

total_profiles ..., Male ..., Female ..., Asian ..., Black ..., Hispanic ..., Native American ..., Other/Unknown ..., White/Caucasian ..., Male_Asian ...

value

numeric

41

0

38

67753 ..., 51197 ..., 75.56 ..., 16556 ..., 24.44 ..., 5 ..., 0.08 ..., 4041 ..., 5.96 ..., 2949 ...

value_type

character

41

0

2

count, percentage

value_source

character

41

0

1

reported

Data frame dimensions: 41 rows × 7 columns

3.6.2 Gender-race intersection analysis

Since South Dakota is the only state that reported gender-race intersection data, we can analyze it in detail.

South Dakota Intersectional Gender × Race Analysis

3.6.3 Verify Data Consistency

Initial check reveals South Dakota’s comprehensive data structure with some non-standard terminology.

Initial data availability:
Race data: both 
Gender data: both 

Non-standard terminology found:
Race terms: Asian, Black, Hispanic, Native American, Other/Unknown, White/Caucasian 
Verifying that demographic counts match reported totals:
offender_type variable_category total_profiles sum_counts difference
Combined gender 67753 67753 0
Combined race 67753 67702 51

Counts consistency check on raw data:
All counts consistent: FALSE 

Percentage consistency check on raw data:
All percentages sum to ~100%: TRUE 
Sum of 'race' percentages: 100 %
Sum of 'gender' percentages: 100 %

3.6.4 Address Data Gaps

3.6.4.1 Standardize Terminology

South Dakota uses “White/Caucasian” and “Other/Unknown” which need standardization.

Show terminology standardization code
# Standardize racial terminology
sd_clean <- sd_clean %>%
  mutate(
    variable_detailed = case_when(
      variable_detailed == "White/Caucasian" ~ "White",
      variable_detailed == "Other/Unknown" ~ "Unknown",
      TRUE ~ variable_detailed
    )
  )

cat("✓ Standardized terminology:\n")
cat("  - 'White/Caucasian' → 'White'\n")
cat("  - 'Other/Unknown' → 'Unknown'\n")

# Verify the changes
cat("\nRace categories after standardization:\n")
sd_clean %>%
  filter(variable_category == "race") %>%
  distinct(variable_detailed) %>%
  pull() %>%
  paste(collapse = ", ") %>%
  cat()
✓ Standardized terminology:
  - 'White/Caucasian' → 'White'
  - 'Other/Unknown' → 'Unknown'

Race categories after standardization:
Asian, Black, Hispanic, Native American, Unknown, White

3.6.4.2 Recalculate Counts from Percentages

South Dakota’s reported race counts sum were inconsistent with the total_profiles.

We removed existing gender count data and recalculated counts using percentage values and combined totals.

All recalculated values flagged with value_source = "calculated"

Show count recalculation code
# Remove existing gender count rows to avoid duplication
sd_clean <- sd_clean %>%
  filter(!(variable_category == "race" & value_type == "count"))

cat("✓ Removed existing race count data\n")

sd_race <- sd_clean %>%
    filter(variable_category == "race" | variable_category == "total")

# Calculate counts from percentages for Combined offender type
sd_race <- calculate_counts_from_percentages(sd_race, "South Dakota")

# Append recalculated race counts to the main dataset
sd_clean <- bind_rows(sd_clean, sd_race)

cat("✓ Calculated demographic counts from percentages\n")

# Verify the calculations
cat("Category totals after calculating counts:\n")
verify_category_totals(sd_clean) %>% kable() %>% kable_styling()
✓ Removed existing race count data
✓ Calculated demographic counts from percentages
Category totals after calculating counts:
offender_type variable_category total_profiles sum_counts difference
Combined gender 67753 67753 0
Combined race 67753 67752 1

We handled this diffence of 1 by adding it to the most representative race (White).

Show difference handle code
# Handle the difference of 1 by adding it to the most representative race
sd_clean <- sd_clean %>%
  mutate(value = ifelse(variable_detailed == "White" & value_type == "count", value + 1, value))

3.6.5 Verify Data Consistency

Final checks to ensure standardization didn’t affect data integrity.

Final data consistency checks after standardization:
Verifying that demographic counts match reported totals:
offender_type variable_category total_profiles sum_counts difference
Combined gender 67753 67753 0
Combined race 67753 67753 0

Counts consistency check:
All counts consistent: TRUE 

Percentage consistency check:
All percentages sum to ~100%: TRUE 

3.6.6 Prepare for Combined Dataset

The cleaned data is formatted to match the master schema and appended to the foia_combined dataframe.

Show South Dakota data preparation to combined dataset
# Prepare the cleaned data for the combined dataset
sd_prepared <- prepare_state_for_combined(sd_clean, "South Dakota")

# Append to the master combined dataframe
foia_combined <- bind_rows(foia_combined, sd_prepared)

cat(paste0("✓ Appended ", nrow(sd_prepared), " South Dakota rows to foia_combined\n"))
cat(paste0("✓ Total rows in foia_combined: ", nrow(foia_combined), "\n"))

# Show the comprehensive nature of South Dakota's data
cat("\nSouth Dakota's comprehensive data structure:\n")
sd_prepared %>%
  group_by(variable_category) %>%
  summarise(n_rows = n(), .groups = "drop") %>%
  kable() %>% kable_styling()
✓ Appended 17 South Dakota rows to foia_combined
✓ Total rows in foia_combined: 145

South Dakota's comprehensive data structure:
variable_category n_rows
gender 4
race 12
total 1

3.6.7 Document Metadata

The metadata is added with details on South Dakota’s comprehensive reporting and the terminology standardization performed.

Show South Dakota data preparation and addition to metadata table
# Add South Dakota to the metadata table using the helper function
add_state_metadata("South Dakota", sd_raw)

# Update metadata with QC results and processing notes
update_state_metadata("South Dakota", 
                      counts_ok = counts_consistent(sd_clean),
                      percentages_ok = percentages_consistent(sd_clean),
                      notes_text = "Standardized terminology: 'White/Caucasian' to 'White' and 'Other/Unknown' to 'Unknown'. Includes comprehensive gender_race intersectional data. All values remain reported.")
✓ Metadata added for: South Dakota 
✓ Metadata updated for: South Dakota 

3.6.8 Visualizations

South Dakota DNA Database Demographic Distributions

South Dakota DNA Database Demographic Distributions

South Dakota DNA Database Demographic Distributions

South Dakota DNA Database Demographic Distributions

3.6.9 Summary Statistics

Show the summary statistics code
cat("South Dakota DNA Database Summary:\n")
cat("=", strrep("=", 40), "\n")

# Total profiles by offender type
totals <- foia_combined %>%
  filter(state == "South Dakota",
         variable_category == "total",
         variable_detailed == "total_profiles",
         value_type == "count") %>%
  select(offender_type, value) %>%
  mutate(value_formatted = format(value, big.mark = ","))

print(totals)

# Data completeness by category
cat("\nData completeness by category:\n")
completeness <- foia_combined %>%
  filter(state == "South Dakota") %>%
  group_by(variable_category) %>%
  summarise(n_values = n(), .groups = "drop")

print(completeness)

# Final verification
cat("\nFinal verification:\n")
cat(paste("Counts consistent:", counts_consistent(foia_combined %>% filter(state == "South Dakota")), "\n"))
cat(paste("Percentages consistent:", percentages_consistent(foia_combined %>% filter(state == "South Dakota")), "\n"))
South Dakota DNA Database Summary:
= ======================================== 
# A tibble: 1 × 3
  offender_type value value_formatted
  <chr>         <dbl> <chr>          
1 Combined      67753 67,753         

Data completeness by category:
# A tibble: 3 × 2
  variable_category n_values
  <chr>                <int>
1 gender                   4
2 race                    12
3 total                    1

Final verification:
Counts consistent: TRUE 
Percentages consistent: TRUE 

3.6.10 Summary of South Dakota Processing

South Dakota data processing complete. The state provided exemplary data with minimal adjustments needed:

  • Terminology standardization:

    • “White/Caucasian” → “White”

    • “Other/Unknown” → “Unknown”

  • Comprehensive reporting: Standard demographics plus unique gender×race intersectional data

  • Reported data: Both counts and percentages for all categories

  • Quality checks: All counts and percentages pass consistency validation

  • Provenance tracking: All values maintain value_source = "reported" as only terminology changes were made

South Dakota’s data is now standardized and ready for cross-state analysis.

3.7 Texas (TX)

Overview: Texas provides counts only for gender and race categories. The Male gender is missing in the dataset. The state uses non-standard terminology that requires conversion and needs Combined totals and percentages calculated.

3.7.1 Examine Raw Data

Establish a baseline understanding of the data exactly as it was received.

Column

Type

Rows

Missing

Unique

Unique_Values

state

character

16

0

1

Texas

offender_type

character

16

0

2

Offenders, Arrestee

variable_category

character

16

0

3

total, gender, race

variable_detailed

character

16

0

8

total_profiles, Female, Asian, African American, Caucasian, Hispanic, Native American, Other

value

numeric

16

0

16

845322 ..., 73631 ..., 121434 ..., 18721 ..., 3361 ..., 254366 ..., 309010 ..., 276245 ..., 138 ..., 2173 ...

value_type

character

16

0

1

count

value_source

character

16

0

1

reported

Data frame dimensions: 16 rows × 7 columns

3.7.2 Verify Data Consistency

Initial checks reveal Texas’s reporting structure and terminology differences.

Initial data availability:
Race data: counts 
Gender data: counts 

Non-standard terminology found:
Offender types: Offenders, Arrestee 
Race terms: Asian, African American, Caucasian, Hispanic, Native American, Other 

3.7.3 Address Data Gaps

3.7.3.1 Add Missing Male category

Texas data reports only Female counts explicitly. We calculated Male counts by subtracting Female counts from total profiles, assuming binary gender classification in the dataset.

Show male addition code
# First, let's examine the current structure of gender data
gender_data <- tx_raw %>%
  filter(variable_category == "gender")

cat("Current gender structure:\n")
print(unique(gender_data$variable_detailed))

# Get total profiles for each offender type
total_profiles <- tx_raw %>%
  filter(variable_category == "total" & variable_detailed == "total_profiles") %>%
  select(offender_type, total_value = value)

# Join total profiles with gender data
gender_with_totals <- gender_data %>%
  left_join(total_profiles, by = "offender_type")

# Create Male entries for each offender type
male_entries <- gender_with_totals %>%
  filter(variable_detailed == "Female") %>%
  mutate(
    variable_detailed = "Male",
    value = total_value - value, 
    value_source = "calculated",
    total_value = NULL 
  )

# Add these entries to the original dataset
tx_raw_with_male <- tx_raw %>%
  bind_rows(male_entries)

# Update the tx_raw object
tx_clean <- tx_raw_with_male

# Verify the addition
cat("\nAfter adding Male entries - gender categories:\n")
print(unique(tx_clean %>% 
       filter(variable_category == "gender") %>% 
       pull(variable_detailed)))
Current gender structure:
[1] "Female"

After adding Male entries - gender categories:
[1] "Female" "Male"  

3.7.3.2 Standardize Terminology

Texas uses “Offenders” instead of “Convicted Offender” and “Caucasian” instead of “White”.

Show terminology standardization code
# Standardize offender types and racial terminology
tx_clean <- tx_clean %>%
  mutate(
    offender_type = case_when(
      offender_type == "Offenders" ~ "Convicted Offender",
      TRUE ~ offender_type
    ),
    variable_detailed = case_when(
      variable_detailed == "Caucasian" ~ "White",
      variable_detailed == "African American" ~ "Black",
      TRUE ~ variable_detailed
    )
  )

cat("✓ Standardized terminology:\n")
cat("  - 'Offenders' → 'Convicted Offender'\n")
cat("  - 'Caucasian' → 'White'\n")
cat("  - 'African American' → 'Black'\n")
cat(paste("Offender types after standardization:", paste(sort(unique(tx_clean$offender_type)), collapse = ", "), "\n"))
✓ Standardized terminology:
  - 'Offenders' → 'Convicted Offender'
  - 'Caucasian' → 'White'
  - 'African American' → 'Black'
Offender types after standardization: Arrestee, Convicted Offender 

3.7.3.3 Create Unknown Category

Texas race count is inconsistent, with a significant number of profiles not reported in any racial category.

Unknown category was created to account for these missing profiles.

The calculated values are added with a value_source = "calculated" tag to maintain transparency about what was provided versus what was derived.

Show unknown addition code
# Add Unknown race category to reconcile totals
tx_clean <- fill_demographic_gaps(tx_clean)

# Verify the fix
cat("Category totals after adding Unknown race category:\n")
verify_category_totals(tx_clean) %>% kable() %>% kable_styling()

cat("\nCounts consistency after adding Unknown:\n")
cat(paste("All counts consistent:", counts_consistent(tx_clean), "\n"))
Category totals after adding Unknown race category:
offender_type variable_category total_profiles sum_counts difference
Arrestee gender 73631 73631 0
Arrestee race 73631 73631 0
Convicted Offender gender 845322 845322 0
Convicted Offender race 845322 845322 0

Counts consistency after adding Unknown:
All counts consistent: TRUE 

3.7.3.4 Create Combined Totals

Texas only reported data for “Convicted Offender” and “Arrestee” separately. We calculate Combined totals.

Show combined addition code
# Calculate Combined totals using helper function
tx_clean <- add_combined(tx_clean)

cat("✓ Created Combined totals for Texas\n")

# Show the Combined total
combined_total <- tx_clean %>%
  filter(offender_type == "Combined",
         variable_category == "total",
         variable_detailed == "total_profiles") %>%
  pull(value)

cat(paste("Combined total profiles:", format(combined_total, big.mark = ","), "\n"))
✓ Created Combined totals for Texas
Combined total profiles: 918,953 

3.7.3.5 Calculate Percentages

Transforms the data from counts into percentages for comparative analysis.

Show percentage calculation code
# Derive percentages from counts
tx_clean <- add_percentages(tx_clean)

cat("✓ Added percentages for all demographic categories\n")

# Check percentage consistency
cat("Percentage consistency check:\n")
cat(paste("All percentages sum to ~100%:", percentages_consistent(tx_clean), "\n\n"))

# Show current data availability
cat("Final data availability:\n")
cat(paste("Race data:", report_status(tx_clean, "race"), "\n"))
cat(paste("Gender data:", report_status(tx_clean, "gender"), "\n"))
✓ Added percentages for all demographic categories
Percentage consistency check:
All percentages sum to ~100%: TRUE 

Final data availability:
Race data: both 
Gender data: both 

3.7.4 Verify Data Consistency

Final checks to ensure all processing maintained data integrity.

Final data consistency checks:
Verifying that demographic counts match reported totals:
offender_type variable_category total_profiles sum_counts difference
Arrestee gender 73631 73631 0
Arrestee race 73631 73631 0
Combined gender 918953 918953 0
Combined race 918953 918953 0
Convicted Offender gender 845322 845322 0
Convicted Offender race 845322 845322 0

Counts consistency check:
All counts consistent: TRUE 

Percentage consistency check:
All percentages sum to ~100%: TRUE 

3.7.5 Prepare for Combined Dataset

The cleaned data is formatted to match the master schema and appended to the foia_combined dataframe.

Show Texas data preparation to combined dataset
# Prepare the cleaned data for the combined dataset
tx_prepared <- prepare_state_for_combined(tx_clean, "Texas")

# Append to the master combined dataframe
foia_combined <- bind_rows(foia_combined, tx_prepared)

cat(paste0("✓ Appended ", nrow(tx_prepared), " Texas rows to foia_combined\n"))
cat(paste0("✓ Total rows in foia_combined: ", nrow(foia_combined), "\n"))
✓ Appended 57 Texas rows to foia_combined
✓ Total rows in foia_combined: 202

3.7.6 Document Metadata

The metadata is added with details on all processing steps performed.

Show Texas data preparation and addition to metadata table
# Add Texas to the metadata table using the helper function
add_state_metadata("Texas", tx_raw)

# Update metadata with QC results and processing notes
update_state_metadata("Texas", 
                      counts_ok = counts_consistent(tx_clean),
                      percentages_ok = percentages_consistent(tx_clean),
                      notes_text = "Standardized terminology: 'Offenders' to 'Convicted Offender', 'Caucasian' to 'White', 'African American' to 'Black'; calculated Combined totals and all percentages")
✓ Metadata added for: Texas 
✓ Metadata updated for: Texas 

3.7.7 Visualizations

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas DNA Database Demographic Distributions

Texas Demographic Distributions by Offender Type

3.7.8 Summary Statistics

Show the summary statistics code
cat("Texas DNA Database Summary:\n")
cat("=", strrep("=", 40), "\n")

# Total profiles by offender type
totals <- foia_combined %>%
  filter(state == "Texas",
         variable_category == "total",
         variable_detailed == "total_profiles",
         value_type == "count") %>%
  select(offender_type, value, value_source) %>%
  mutate(value_formatted = format(value, big.mark = ","))

print(totals)

# Data completeness by value source
cat("\nData completeness by source:\n")
completeness <- foia_combined %>%
  filter(state == "Texas") %>%
  group_by(value_source) %>%
  summarise(n_values = n(), .groups = "drop")

print(completeness)

# Final verification
cat("\nFinal verification:\n")
cat(paste("Counts consistent:", counts_consistent(foia_combined %>% filter(state == "Texas")), "\n"))
cat(paste("Percentages consistent:", percentages_consistent(foia_combined %>% filter(state == "Texas")), "\n"))
Texas DNA Database Summary:
= ======================================== 
# A tibble: 3 × 4
  offender_type       value value_source value_formatted
  <chr>               <dbl> <chr>        <chr>          
1 Convicted Offender 845322 reported     "845,322"      
2 Arrestee            73631 reported     " 73,631"      
3 Combined           918953 calculated   "918,953"      

Data completeness by source:
# A tibble: 2 × 2
  value_source n_values
  <chr>           <int>
1 calculated         41
2 reported           16

Final verification:
Counts consistent: TRUE 
Percentages consistent: TRUE 

3.7.9 Summary of Texas Processing

Texas data processing complete. The dataset required several adjustments:

  • Male Category Addition:

    • “Male” added to variable_detailed
  • Terminology standardization:

    • “Offenders” → “Convicted Offender”

    • “Caucasian” → “White”

    • “African American” → “Black”

  • Calculated additions:

    • Combined totals across all offender types

    • Percentage values for all demographic categories

  • Quality checks: All counts and percentages pass consistency validation

  • Provenance tracking: Clear distinction between reported and calculated values

The Texas data is now standardized and ready for cross-state analysis.

3.8 Combined Dataset

3.9 Metadata table

4 Saving Processed Data

Show final saving process code
# Define output paths
output_dir <- here("data", "foia", "final")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

# Save the combined dataset
foia_output_path <- here(output_dir, "foia_data_clean.csv")
write_csv(foia_combined, foia_output_path)
cat(paste("✓ Saved combined FOIA data to:", foia_output_path, "\n"))

# Save the metadata
metadata_dir <- here("data", "foia", "intermediate")
metadata_output_path <- here(metadata_dir, "foia_state_metadata.csv")
write_csv(foia_state_metadata, metadata_output_path)
cat(paste("✓ Saved state metadata to:", metadata_output_path, "\n"))

# No additional copies are written to shared version folders here; use
# analysis/version_freeze.qmd when a full processed-data release is needed.
cat("\n✅ All processing complete! Final files written to data/foia/.\n")
✓ Saved combined FOIA data to: /Users/tlasisi/GitHub/PODFRIDGE-Databases/data/foia/final/foia_data_clean.csv 
✓ Saved state metadata to: /Users/tlasisi/GitHub/PODFRIDGE-Databases/data/foia/intermediate/foia_state_metadata.csv 

✅ All processing complete! Final files written to data/foia/.

5 Conclusions

  1. Data Acquisition and Harmonization: We ingested seven unique state datasets (california_foia_data.csv through texas_foia_data.csv), each with distinct reporting formats, terminology, and levels of completeness. Through a systematic processing workflow, we harmonized these into a single, tidy long-format dataset (foia_combined), ensuring consistency across all variables.

  2. Standardization of Terminology: A significant challenge was the non-standard terminology used across states. We implemented a rigorous process to map all state-specific terms to a common data model:

    • Offender Types: Standardized to "Convicted Offender", "Arrestee", and "Combined".

    • Race Categories: Mapped terms like "Caucasian", "African American", and "American Indian" to standardized categories ("White", "Black", "Native American").

    • Total Profiles: Consolidated terms like "total_flags" to "total_profiles".

  3. Imputation and Calculation of Missing Data: To ensure comparability, we calculated values that were not directly provided by the states:

    • Derived Percentages: For states providing only counts (CA, TX), we calculated percentage compositions.

    • Derived Counts: For states providing only percentages (IN), we calculated absolute numbers using reported totals.

    • Calculated Totals: We created "Combined" offender type totals for states that only reported separate "Convicted Offender" and "Arrestee" figures.

    • Inferred Categories: We added "Unknown" race and "Male" gender categories where they were logically missing but necessary to reconcile reported totals (CA, TX).

  4. Quality Assurance and Transparency: A core principle of this project was maintaining transparency and data provenance. This allows future researchers to understand exactly what was provided by the state versus what was derived during processing.

    • Validation checks: counts_consistent(), percentages_consistent()

    • Value tagging: "reported" or "calculated".

    • Metadata table: foia_state_metadata provides a clear audit trail of each state’s original characteristics and the processing steps applied.