Predict racial proportions in the M&T forensic database

Last updated: 2025-03-21

Checks: 7 0

Knit directory: PODFRIDGE/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230302)

The command set.seed(20230302) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 35e5bd5

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 35e5bd5. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/racial_proportion_v0.Rmd

Unstaged changes:
    Deleted:    analysis/racial_proportion_v2.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/racial_proportion.Rmd) and HTML (docs/racial_proportion.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	35e5bd5	Junhui He	2025-03-21	update racial proportions
Rmd	75d6332	He Junhui	2025-02-17	modify probability_disparity.Rmd
html	74b54d0	Junhui He	2024-12-14	Build site.
Rmd	9aa1730	Junhui He	2024-12-14	wflow_publish("analysis/racial_proportion.Rmd")
html	89b000e	He Junhui	2024-12-12	Publish a racial breakdown analysis report using workflowr
Rmd	1a32929	He Junhui	2024-12-12	m
html	1a32929	He Junhui	2024-12-12	m

1 Objective

To analyse the comparison between the M&T forensic DNA database and census database, we have to attain the racial breakdown (e.g., the proportions of black, white people) of the state-level forensic database across the US. However, the breakdown is given in only 7 states, which are California, Florida, Indiana, Maine, Nevada, South Dakota and Texas. Thus, we need to develop a statistical model to estimate the proportions of black and white Americans in the remaining 43 states. Specifically, we focus on the differences of the racial breakdown between the forensic database and census database.

2 Binomial Logistic Regression Model

In the underlying work, we establish binomial regression models to make the estimations. To make the main findings interpretable, we will give a brief introduction of this common model. If you don’t care about the details of the binomial regression model, just feel free to skip this part .

Firstly, we introduces the concept of binomial distributions. Suppose in each observation, an event has two possible states, success or failure, and the probability of success is defined as \(p\). Then in \(n\) observations, the number of success \(y\in \{0,1,\ldots, n\}\) follows a binomial distribution \(\text{B}(n,p)\). A special case of the binomial distribution is \(n=1\), where the distribution is the simple Bernouli distribution. The success probability \(p\) determines the characteristics of the binomial distribution. An important property is that the expectation is given by \(np\).

Now we can delve into the binomial regression. Let the predictor be \(x\) and the response variable be \(y\), we assume \[y|x\sim \text{B}(n, g(x^\top \beta)),\] where \(\beta\) is the coefficients, and \(g\) is a link function taking values in \([0,1]\). The popular choices of \(g\) include Logit and Probit functions. Here we choose the Logit link function for its interpretability, which is given by \[g:\mathbb{R}\rightarrow [0,1], \quad g(z)=\frac{\exp(z)}{\exp(z)+1}.\] To fit this model, we will estimate the linear coefficients \(\hat{\beta}\), thereby we can predict the success probability \(\hat{p}=g(x_*^\top \hat{\beta})\) on a new point \(x_*\), which can be considered as the proportion of success events given \(x_*\).

3 Data and Model Setting

In this section, we demonstrate the response variable and predictors used in the binomial regression, and give the concrete model equation.

3.1 Response variable:

The total number of people for each state in the M&T forensic database.
The number of each race for each state in the M&T database.

3.2 Predictor:

The proportion of black and white people for each state in the census database.
The proportion of black and white people of the prison population for each state.

3.3 Stick-breaking:

We divide the people of the US into three categories, black + white + other. Then we need to estimate the complete racial breakdown for three categories in each state. To ensure the sum of these percents is equal to 1, we use a simple stick-breaking technique. That is, we separately estimate the percent of white people in all people \(p_{1}\) and the percent of black people in non-white people \(p_{2}\). Then the racial breakdown is given by \[\pi_{white}=p_1, \quad \pi_{black}=(1-p_1)p_2,\quad \pi_{other}=(1-p_1)(1-p_2).\] We run 2 binomial regression models to predict \(p_1\) and \(p_2\) for each state.

3.4 Model equation

Compared to the predictors used by Hanna, we remove the racial indicator and the interaction between the racial indicator and the census/prison proportion to avoid colinearity, which leads to a singular problem for linear regression. Therefore, the model equations are defined as \[ \frac{white}{all} = g(\beta_{00} + \beta_{01}*white_{census}+ \beta_{02}*white_{prison}), \] \[ \frac{black}{nonwhite} = g(\beta_{10} + \beta_{11}*black_{census} + \beta_{12}*black_{prison}). \]

3.5 Coefficient interpretability

To interpreter those coefficients, we simply denote the binomial logistic regression model as \[p = g(\beta_0 + \sum_{j=1}^p\beta_j x_{j}),\] where \(g\) is the logit link function and \(x_j\) for \(1,\ldots,p\) are covariates. We consider an odds as \(p/(1-p)\), which is the ratio of success probabilities versus failure probabilities. Thus, using the logit link function, the model equation can be written as \[\log(\frac{p}{1-p})=\beta_0 + \sum_{j=1}^p\beta_j x_{j}.\] Therefore, we can interpreter the coefficients \(\beta_j\) as the increase in the log odds for every unit increase in \(x_j\).

4 Model Evaluation

If you just want to read the answers to the key questions, please skip this part and go to the next section directly.

4.1 Coefficient estimatioan and hypothesis testing

We estimate the linear coefficients \(\beta\) of the binomial logistic regression for black Americans and white Americans using glm() function and do a Wald test on the beta coefficients. The \(p\)-values show that these coefficients are all statistically significant.


Call:
glm(formula = cbind(y[, 1], y[, 3] - y[, 1]) ~ census.percent.white + 
    incarc.percent.white, family = binomial, data = train_data)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.89315    0.01097 -263.80   <2e-16 ***
census.percent.white  3.29887    0.03424   96.34   <2e-16 ***
incarc.percent.white  2.17351    0.03143   69.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 599698  on 6  degrees of freedom
Residual deviance:  85204  on 4  degrees of freedom
AIC: 85300

Number of Fisher Scoring iterations: 4


Call:
glm(formula = cbind(y[, 2], y[, 3] - y[, 1] - y[, 2]) ~ census.remain.percent.black + 
    incarc.remain.percent.black, family = binomial, data = train_data)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -3.923344   0.008174  -480.0   <2e-16 ***
census.remain.percent.black -0.432640   0.027555   -15.7   <2e-16 ***
incarc.remain.percent.black  8.109374   0.032434   250.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 960397  on 6  degrees of freedom
Residual deviance:  29623  on 4  degrees of freedom
AIC: 29710

Number of Fisher Scoring iterations: 4

4.2 Goodness-of-Fit

Furthermore, we plot the fitted racial proportions using stick-breaking binomial regression versus the ground truth for the 7 states with available data. This figure shows that data points are around the identical map, which means our model fits the training data well at least.

Version	Author	Date
74b54d0	Junhui He	2024-12-14

5 Main Findings

5.1 TL;DR

Black Americans are significantly overrepresented while White Americans are underrepresented in M&T forensic database compared to census representation.

5.2 Racial breakdown

We generate side-by-side pie charts for each state showing the racial composition according to the census (left) versus the estimated racial composition of CODIS (right) for each state. From the following figure, Black/African Americans are overrepresented and white Americans are underrepresented in CODIS compared to Census.

Version	Author	Date
74b54d0	Junhui He	2024-12-14

5.3 Difference of racial proportions

To compare racial proportions between CODIS and Census in each state, we visualize the absolute differences and relative differences of racial proportions. These differences are defined as followings, \[\begin{equation*} \begin{split} absolute~difference=Proportion_{CODIS}-Proportion_{Census},\\ relative~difference=\frac{Proportion_{CODIS}-Proportion_{Census}}{Proportion_{Census}}. \end{split} \end{equation*}\] The difference barcharts show that Black/African Americans are sigficantly overrepresented in all states and White Americans are underrepresented in most states in M&T forensic database compared to census representation.

Version	Author	Date
74b54d0	Junhui He	2024-12-14

Version	Author	Date
74b54d0	Junhui He	2024-12-14

6 Statistical Inference

Based on the asymptotic theory for maximum likelihood estimation, as the sample size increase, \[\sqrt{n}(\hat{\beta}-\beta) \to N(0, ~ I^{-1}(\beta)),\quad as~n\to \infty,\] where \(I(\beta)\) is the Fisher information. Thus, the log odds approximately follows the normal distribution, \[\log(\frac{\hat{p}}{1-\hat{p}}) \sim N(x^\top \beta, ~ \frac{1}{n}x^\top I^{-1}(\beta)x),\] as the total number of population in 7 states is very large. This normal approximation is useful in the following hypothesis testing and confidence interval construction.

For each state, we consider a hypothesis testing problem for the difference of white proportions between Census and CODIS, \[H_0:p_{CODIS,White}=p_{Census,White} \leftrightarrow H_1:p_{CODIS,White}>p_{Census,White}.\] Using the logit link function, we work on the log odds instead of the probability. The normal approximation helps to construct a one-sided testing statistics, and the \(p{\text -values}<10^{-15}\) for all states.

Finally, we calculate the \(1-\alpha\) confidence intervals for the fitted probability using binomial regression. The normal approximation for the log odds \(z=x^\top \beta\) contributes to a confidence interval \(ConfInt=[g(\hat{z}-c_{1-\alpha/2}se(\hat{z})),~g(\hat{z}+c_{1-\alpha/2}se(\hat{z}))]\) for the white proportion in Census, where \(c_{1-\alpha/2}\) is the \(1-\alpha/2\) quantile of the standard normal distribution. This implies a confidence interval for the differences with \(ConfInt-p_{Census}\). For the black Americans, we utilize a Bonferroni method to construct the confidence interval. As the number of population in each state with available data is very large, the estimated standard errors \(se(\hat{z})\) is very small, causing that the interval widths are almost equal to zero compared to the point estimation. It also explains why the \(p\)-values for the above hypothesis testing are nearly zero.

Version	Author	Date
74b54d0	Junhui He	2024-12-14

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.3.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Detroit
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] gridExtra_2.3   ggplot2_3.5.1   workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      jsonlite_1.8.9    dplyr_1.1.4       compiler_4.4.2   
 [5] promises_1.3.2    tidyselect_1.2.1  Rcpp_1.0.13-1     stringr_1.5.1    
 [9] git2r_0.35.0      callr_3.7.6       later_1.4.1       jquerylib_0.1.4  
[13] scales_1.3.0      yaml_2.3.10       fastmap_1.2.0     R6_2.5.1         
[17] labeling_0.4.3    generics_0.1.3    knitr_1.49        tibble_3.2.1     
[21] munsell_0.5.1     rprojroot_2.0.4   bslib_0.8.0       pillar_1.9.0     
[25] rlang_1.1.4       utf8_1.2.4        cachem_1.1.0      stringi_1.8.4    
[29] httpuv_1.6.15     xfun_0.49         getPass_0.2-4     fs_1.6.5         
[33] sass_0.4.9        cli_3.6.3         withr_3.0.2       magrittr_2.0.3   
[37] ps_1.8.1          digest_0.6.37     processx_3.8.4    rstudioapi_0.17.1
[41] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.1    glue_1.8.0       
[45] farver_2.1.2      whisker_0.4.1     colorspace_2.1-1  fansi_1.0.6      
[49] rmarkdown_2.29    httr_1.4.7        tools_4.4.2       pkgconfig_2.0.3  
[53] htmltools_0.5.8.1