Last updated: 2025-06-19

Checks: 6 1

Knit directory: PODFRIDGE/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230302)

The command set.seed(20230302) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 7604e1d

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7604e1d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/sims/.DS_Store
    Ignored:    output/.DS_Store
    Ignored:    output/simulation_20240726-155743/.DS_Store
    Ignored:    output/simulation_20240726-162034_11228488/.DS_Store
    Ignored:    output/simulation_20240726-163235_11228791/.DS_Store

Unstaged changes:
    Modified:   analysis/fertility_assumptions.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/fertility_assumptions.Rmd) and HTML (docs/fertility_assumptions.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	b9c32a9	Tina Lasisi	2025-06-18	Created fertility assumptions supplementary doc
html	b9c32a9	Tina Lasisi	2025-06-18	Created fertility assumptions supplementary doc

U.S. Census Data on Population Differences in Fertility

Here we extract the mean fertility (μ), dispersion/size (θ), and zero‑inflation probability (π₀) for U.S. women aged 50‑59 in the 1990 Census. Women aged 50–59 in the 1990 U.S. Census were born between 1931 and 1940. Assuming a typical childbearing age range of 25–45, their children would have been born between approximately 1956 and 1985. Therefore, the fertility distributions and relative counts estimated from this cohort primarily reflect the family-building behaviors and social contexts of the mid-1950s through the mid-1980s. We filter to Black/African American and White American groups as these are of interest to our paper.

Libraries

library(dplyr)
library(tidyr)
library(readr)
library(MASS)     ## glm.nb
library(pscl)     ## zeroinfl
library(knitr)    ## kable

##knitr::opts_knit$set(root.dir = ".")
knitr::opts_chunk$set(eval = TRUE, echo = TRUE, warning = FALSE, fig.width = 9, fig.height = 7)

Load the pre‑filtered IPUMS subset

path         <- file.path(".", "data")
prop_race_year <- file.path(path, "proportions_table_by_race_year.csv")
data_filter    <- file.path(path, "data_filtered_recoded.csv")

## original script loads both; we only need mother_data for this step
mother_data <- read_csv(data_filter, show_col_types = FALSE)

mother_data already contains:

YEAR – Census year (1960,1970,1980,1990)
AGE – respondent age
RACE – recoded as “White” or “Black/African American”
chborn_num – completed fertility (children ever born)

Select the 1990, age 50‑59 cohort

cohort <- mother_data %>%
  filter(YEAR == 1990,
         AGE  >= 50, AGE <= 59,
         RACE %in% c("White", "Black/African American")) %>%
  mutate(RACE = droplevels(factor(RACE)))

cohort %>% count(RACE, name = "n_women")

# A tibble: 2 × 2
  RACE                   n_women
  <fct>                    <int>
1 Black/African American   10945
2 White                    96079

Fit Poisson, NB, ZIP, and ZINB per race

fit_four <- function(dat) {
  ## Poisson and NB use standard glm / glm.nb
  pois <- glm(chborn_num ~ 1, family = poisson, data = dat)
  nb   <- glm.nb(chborn_num ~ 1, data = dat)
  
  ## Zero‑inflated models (pscl::zeroinfl)
  zip  <- zeroinfl(chborn_num ~ 1 | 1, data = dat, dist = "poisson")
  zinb <- zeroinfl(chborn_num ~ 1 | 1, data = dat, dist = "negbin")
  
  tibble(
    Model = c("Poisson", "NB", "ZIP", "ZINB"),
    mu    = c(exp(coef(pois)[1]),                    ## log‑link → exp() gives mean
              exp(coef(nb)[1]),
              exp(zip$coefficients$count[1]),        ## intercept of count part
              exp(zinb$coefficients$count[1])),
    size  = c(Inf,                                   ## dispersion θ; Inf = Poisson
              nb$theta,
              Inf,
              zinb$theta),
    pi0   = c(0,                                      ## zero‑inflation prob.
              0,
              plogis(zip$coefficients$zero[1]),
              plogis(zinb$coefficients$zero[1])),
    AIC   = c(AIC(pois), AIC(nb), AIC(zip), AIC(zinb))
  )
}

param_tbl <- cohort %>%
  group_by(RACE) %>%
  group_modify(~ fit_four(.x)) %>%
  ungroup()

Results

knitr::kable(param_tbl, digits = 3,
             caption = "Fertility‑model parameter estimates for women aged50‑59 (1990 Census)")

Fertility‑model parameter estimates for women aged50‑59 (1990 Census)
RACE	Model	mu	size	pi0	AIC
Black/African American	Poisson	3.724	Inf	0.000	55107.97
Black/African American	NB	3.724	3.007	0.000	51184.00
Black/African American	ZIP	4.121	Inf	0.096	52841.82
Black/African American	ZINB	3.932	4.289	0.053	51024.46
White	Poisson	2.895	Inf	0.000	381107.37
White	NB	2.895	18.402	0.000	380052.99
White	ZIP	3.079	Inf	0.060	376840.14
White	ZINB	3.079	257628.938	0.060	376842.17

Table. Fertility-model parameter estimates for women aged 50–59 in the 1990 U.S. Census. Columns are: Model (distribution fitted: Poisson, Negative Binomial [NB], Zero-Inflated Poisson [ZIP], or Zero-Inflated Negative Binomial [ZINB]); μ (mean completed fertility, i.e., average number of children per woman); size (dispersion parameter, only meaningful for NB/ZINB models—set to Inf where not applicable); π₀ (probability of “excess” zeroes, i.e., extra childless women above model expectation, relevant only for ZIP/ZINB models); and AIC (Akaike Information Criterion for model comparison; lower values indicate better fit).

write_csv(param_tbl, "data/fertility_params_1990_50-59.csv")

For women aged 50–59 in the 1990 Census, mean completed fertility (μ) was higher in Black/African American (μ ≈ 3.9, best-fit ZINB) than in White (μ ≈ 3.1, best-fit ZIP) populations. The best-fitting model for Black women was the Zero-Inflated Negative Binomial (ZINB), reflecting both significant overdispersion (θ ≈ 4.3) and a moderate excess of childless women (π₀ ≈ 0.05). For White women, the best fit was the Zero-Inflated Poisson (ZIP), with low overdispersion (θ ≫ 1) and similar zero-inflation (π₀ ≈ 0.06); however, the ZINB model provided a nearly identical fit (ΔAIC < 2), so we use ZINB parameters for both populations in all downstream simulations for consistency and comparability. This approach ensures that our estimates reflect both mean differences and any potential overdispersion or zero inflation in the data for both groups.

Effect of distribution assumptions on relative counts

Here we evaluate how different fertility distribution assumptions (Poisson, Negative Binomial [NB], Zero-Inflated Poisson [ZIP], and Zero-Inflated Negative Binomial [ZINB]) influence estimates of genetically detectable relatives. Using model parameters (param_tbl) previously estimated for each race, we simulate family structures and compare the expected number of siblings, aunts/uncles, and cousins. Grandparents and parents are treated as constants. We present both variable relative counts and totals that include these constants, and provide formal statistical comparisons of distributions. In addition, we compare results to a scenario where the mean number of children is fixed at 2.5 for all distributions.

Calculating the overall mean fertility

For each distribution, we calculate the overall (population-level) mean number of children per woman. For zero-inflated models (ZIP and ZINB), this is calculated as (1 – π₀) × μ, while for Poisson and NB models it is simply μ. This table summarizes the fitted values for each race and model:

param_tbl_total <- param_tbl %>%
  mutate(overall_mean = (1 - pi0) * mu)

knitr::kable(param_tbl_total %>%
  dplyr::select(RACE, Model, mu, pi0, overall_mean), digits = 3,
  caption = "Overall mean fertility by model and race")

Overall mean fertility by model and race
RACE	Model	mu	pi0	overall_mean
Black/African American	Poisson	3.724	0.000	3.724
Black/African American	NB	3.724	0.000	3.724
Black/African American	ZIP	4.121	0.096	3.724
Black/African American	ZINB	3.932	0.053	3.724
White	Poisson	2.895	0.000	2.895
White	NB	2.895	0.000	2.895
White	ZIP	3.079	0.060	2.895
White	ZINB	3.079	0.060	2.895

Simulation function

To estimate how family structure varies across models, we simulate pedigrees using each set of parameters. The function below draws the number of children for grandparents, parents, and aunts/uncles using the given distribution and parameters, then returns the resulting counts of siblings, aunts/uncles, and cousins for 10,000 simulated individuals per model/race.

simulate_relatives <- function(Model, mu, size, pi0,
                               n_sim = 10000,
                               max_kids = 12) {
  draw_kids <- function(n) {
    if      (Model == "Poisson") pmin(rpois(n, mu), max_kids)
    else if (Model == "NB")      pmin(rnbinom(n, size, mu = mu), max_kids)
    else if (Model == "ZIP")     ifelse(runif(n) < pi0, 0L,
                                        pmin(rpois(n, mu), max_kids))
    else if (Model == "ZINB")    ifelse(runif(n) < pi0, 0L,
                                        pmin(rnbinom(n, size, mu = mu), max_kids))
    else stop("Unknown model type")
  }

  grandparent_children <- matrix(draw_kids(4 * n_sim), nrow = n_sim, ncol = 4)
  aunts_uncles <- rowSums(pmax(grandparent_children - 1, 0))

  parent_children <- matrix(draw_kids(2 * n_sim), nrow = n_sim, ncol = 2)
  siblings <- rowSums(pmax(parent_children - 1, 0))

  # FIXED: use vapply, always returns integer vector
  cousins <- vapply(aunts_uncles, function(n)
    if (n == 0) 0L else as.integer(sum(draw_kids(n))),
    integer(1)
  )

  tibble::tibble(
    Siblings    = siblings,
    AuntsUncles = aunts_uncles,
    Cousins     = cousins
  )
}

Scenario 1: Using empirically estimated model means

We first simulate relatives using the best-fit parameters for each model and race, which reflect the true underlying distribution in the data.

library(purrr)
n_sim <- 10000

relative_counts_emp <- param_tbl %>%
  mutate(Scenario = "Empirical mean",
         sim = pmap(list(Model, mu, size, pi0),
                    ~ simulate_relatives(..1, ..2, ..3, ..4, n_sim = n_sim))) %>%
  tidyr::unnest(sim)

Scenario 2: Using a fixed mean of 2.5 for all models

To isolate the impact of model shape from differences in mean fertility, we also simulate each model with a fixed mean of 2.5 children. For ZIP and ZINB, μ is rescaled to ensure (1 – π₀) × μ = 2.5.

param_tbl_fixed <- param_tbl %>%
  mutate(mu = ifelse(pi0 == 0, 2.5, 2.5 / (1 - pi0)))

relative_counts_fix <- param_tbl_fixed %>%
  mutate(Scenario = "Fixed mean 2.5",
         sim = pmap(list(Model, mu, size, pi0),
                    ~ simulate_relatives(..1, ..2, ..3, ..4, n_sim = n_sim))) %>%
  tidyr::unnest(sim)

Combine both scenarios and summarize

We then combine all simulated results and summarize the mean counts for each combination of scenario, model, and race.

relative_counts_both <- bind_rows(relative_counts_emp, relative_counts_fix)
summary_variable <- relative_counts_both %>%
  group_by(Scenario, RACE, Model) %>%
  summarise(across(c(Siblings, AuntsUncles, Cousins), mean), .groups = "drop")

knitr::kable(summary_variable, digits = 2,
             caption = "Expected counts of variable relatives by scenario, model, and race")

Expected counts of variable relatives by scenario, model, and race
Scenario	RACE	Model	Siblings	AuntsUncles	Cousins
Empirical mean	Black/African American	NB	5.52	11.03	40.62
Empirical mean	Black/African American	Poisson	5.50	10.93	40.70
Empirical mean	Black/African American	ZINB	5.62	11.18	41.30
Empirical mean	Black/African American	ZIP	5.69	11.30	42.01
Empirical mean	White	NB	3.88	7.85	22.70
Empirical mean	White	Poisson	3.89	7.76	22.45
Empirical mean	White	ZINB	4.01	8.01	23.22
Empirical mean	White	ZIP	3.99	8.00	23.26
Fixed mean 2.5	Black/African American	NB	3.33	6.68	16.69
Fixed mean 2.5	Black/African American	Poisson	3.15	6.40	15.98
Fixed mean 2.5	Black/African American	ZINB	3.32	6.68	16.68
Fixed mean 2.5	Black/African American	ZIP	3.30	6.58	16.50
Fixed mean 2.5	White	NB	3.20	6.42	16.00
Fixed mean 2.5	White	Poisson	3.18	6.30	15.71
Fixed mean 2.5	White	ZINB	3.23	6.49	16.30
Fixed mean 2.5	White	ZIP	3.26	6.47	16.18

Expected counts including constant relatives

The table below includes the two parents and four grandparents per individual, added to the mean expected number of variable relatives.

summary_total <- summary_variable %>%
  mutate(
    Parents = 2,
    Grandparents = 4,
    Total_1st_2nd = Parents + Grandparents + Siblings + AuntsUncles + Cousins
  ) %>%
  dplyr::select(Scenario, RACE, Model, Parents, Grandparents, Siblings, AuntsUncles, Cousins, Total_1st_2nd)

knitr::kable(summary_total, digits = 2,
             caption = "Expected total counts including constant parents and grandparents")

Expected total counts including constant parents and grandparents
Scenario	RACE	Model	Parents	Grandparents	Siblings	AuntsUncles	Cousins	Total_1st_2nd
Empirical mean	Black/African American	NB	2	4	5.52	11.03	40.62	63.17
Empirical mean	Black/African American	Poisson	2	4	5.50	10.93	40.70	63.14
Empirical mean	Black/African American	ZINB	2	4	5.62	11.18	41.30	64.09
Empirical mean	Black/African American	ZIP	2	4	5.69	11.30	42.01	64.99
Empirical mean	White	NB	2	4	3.88	7.85	22.70	40.44
Empirical mean	White	Poisson	2	4	3.89	7.76	22.45	40.11
Empirical mean	White	ZINB	2	4	4.01	8.01	23.22	41.24
Empirical mean	White	ZIP	2	4	3.99	8.00	23.26	41.25
Fixed mean 2.5	Black/African American	NB	2	4	3.33	6.68	16.69	32.70
Fixed mean 2.5	Black/African American	Poisson	2	4	3.15	6.40	15.98	31.53
Fixed mean 2.5	Black/African American	ZINB	2	4	3.32	6.68	16.68	32.68
Fixed mean 2.5	Black/African American	ZIP	2	4	3.30	6.58	16.50	32.38
Fixed mean 2.5	White	NB	2	4	3.20	6.42	16.00	31.63
Fixed mean 2.5	White	Poisson	2	4	3.18	6.30	15.71	31.20
Fixed mean 2.5	White	ZINB	2	4	3.23	6.49	16.30	32.03
Fixed mean 2.5	White	ZIP	2	4	3.26	6.47	16.18	31.91

Visualization: Comparing models and mean specification

We visualize the full distribution of simulated relatives across models, mean scenarios, and races. Each box represents the distribution of simulated counts; means are shown as white dots.

library(ggplot2)

rel_long_all <- relative_counts_both %>%
  tidyr::pivot_longer(c(Siblings, AuntsUncles, Cousins),
                      names_to = "Relative",
                      values_to = "Count") %>%
  mutate(
    Model      = factor(Model, levels = c("Poisson", "ZIP", "NB", "ZINB")),
    Relative   = factor(Relative, levels = c("Siblings","AuntsUncles","Cousins")),
    Scenario   = factor(Scenario, levels = c("Empirical mean","Fixed mean 2.5")),
    RACE       = factor(RACE,    levels = c("Black/African American","White"))
  )

ggplot(rel_long_all,
       aes(x = Model, y = Count, fill = Relative)) +
  geom_boxplot(position = position_dodge(width = 0.8),
               width = 0.7, alpha = .8, outlier.size = .3) +
  stat_summary(fun = mean, geom = "point", shape = 21, size = 1.4,
               aes(group = Relative),
               position = position_dodge(width = 0.8),
               colour = "black", fill = "white") +
  facet_grid(Scenario ~ RACE) +
  scale_fill_viridis_d(option = "D") +
  labs(title   = "Distribution of simulated relatives by model, mean assumption and race",
       x       = "Fertility model",
       y       = "Relative count",
       fill    = "Relative type") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "top",
        axis.title.x    = element_text(vjust = -0.2),
        strip.text      = element_text(size = 10))

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

Statistical modeling: Effects of model, scenario, and their interaction

We formally test the contribution of model choice, mean specification, and race on the simulated number of relatives, using three-way ANOVA for each relative type. We also generate a compact-letter display (CLD) to indicate which groups are or are not statistically different after correction for multiple comparisons. In the summary plot, each box is labeled with its CLD group letter: boxes sharing a letter are not significantly different at the 0.05 level (Bonferroni-adjusted).

library(broom)
library(knitr)

relatives <- c("Siblings", "AuntsUncles", "Cousins")

for (rel in relatives) {
  cat("\n\n### Model for", rel, "\n\n")
  
  analysis_data <- rel_long_all %>%
    mutate(
      RACE = factor(RACE, levels = c("White", "Black/African American")),
      Model = factor(Model, levels = c("Poisson", "ZIP", "NB", "ZINB")),
      Scenario = factor(Scenario, levels = c("Fixed mean 2.5", "Empirical mean")),
      Relative = factor(Relative, levels = c("Siblings", "AuntsUncles", "Cousins"))
    ) %>%
    filter(Relative == rel)
  
  lm_fit <- lm(Count ~ RACE * Model * Scenario, data = analysis_data)
  lm_tidy <- broom::tidy(lm_fit, conf.int = TRUE)
  
  print(knitr::kable(
    lm_tidy,
    digits = 3,
    caption = paste("Regression coefficients for", rel, 
                    "(reference: White, Poisson, Fixed mean 2.5)")
  ))
}

Model for Siblings

Regression coefficients for Siblings (reference: White, Poisson, Fixed mean 2.5)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	3.184	0.026	121.692	0.000	3.133	3.235
RACEBlack/African American	-0.039	0.037	-1.049	0.294	-0.111	0.034
ModelZIP	0.079	0.037	2.130	0.033	0.006	0.151
ModelNB	0.017	0.037	0.468	0.640	-0.055	0.090
ModelZINB	0.043	0.037	1.176	0.240	-0.029	0.116
ScenarioEmpirical mean	0.706	0.037	19.078	0.000	0.633	0.778
RACEBlack/African American:ModelZIP	0.075	0.052	1.426	0.154	-0.028	0.177
RACEBlack/African American:ModelNB	0.168	0.052	3.211	0.001	0.065	0.271
RACEBlack/African American:ModelZINB	0.127	0.052	2.437	0.015	0.025	0.230
RACEBlack/African American:ScenarioEmpirical mean	1.653	0.052	31.586	0.000	1.550	1.755
ModelZIP:ScenarioEmpirical mean	0.020	0.052	0.382	0.702	-0.083	0.123
ModelNB:ScenarioEmpirical mean	-0.023	0.052	-0.436	0.663	-0.125	0.080
ModelZINB:ScenarioEmpirical mean	0.076	0.052	1.460	0.144	-0.026	0.179
RACEBlack/African American:ModelZIP:ScenarioEmpirical mean	0.008	0.074	0.115	0.909	-0.137	0.154
RACEBlack/African American:ModelNB:ScenarioEmpirical mean	-0.145	0.074	-1.962	0.050	-0.290	0.000
RACEBlack/African American:ModelZINB:ScenarioEmpirical mean	-0.134	0.074	-1.812	0.070	-0.279	0.011

Model for AuntsUncles

Regression coefficients for AuntsUncles (reference: White, Poisson, Fixed mean 2.5)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	6.301	0.037	170.397	0.000	6.229	6.374
RACEBlack/African American	0.097	0.052	1.860	0.063	-0.005	0.200
ModelZIP	0.165	0.052	3.147	0.002	0.062	0.267
ModelNB	0.122	0.052	2.331	0.020	0.019	0.224
ModelZINB	0.192	0.052	3.681	0.000	0.090	0.295
ScenarioEmpirical mean	1.464	0.052	27.984	0.000	1.361	1.566
RACEBlack/African American:ModelZIP	0.019	0.074	0.254	0.799	-0.126	0.164
RACEBlack/African American:ModelNB	0.164	0.074	2.221	0.026	0.019	0.309
RACEBlack/African American:ModelZINB	0.094	0.074	1.268	0.205	-0.051	0.239
RACEBlack/African American:ScenarioEmpirical mean	3.071	0.074	41.523	0.000	2.926	3.216
ModelZIP:ScenarioEmpirical mean	0.075	0.074	1.009	0.313	-0.070	0.220
ModelNB:ScenarioEmpirical mean	-0.035	0.074	-0.473	0.636	-0.180	0.110
ModelZINB:ScenarioEmpirical mean	0.054	0.074	0.734	0.463	-0.091	0.199
RACEBlack/African American:ModelZIP:ScenarioEmpirical mean	0.107	0.105	1.022	0.307	-0.098	0.312
RACEBlack/African American:ModelNB:ScenarioEmpirical mean	-0.159	0.105	-1.516	0.129	-0.364	0.046
RACEBlack/African American:ModelZINB:ScenarioEmpirical mean	-0.099	0.105	-0.942	0.346	-0.304	0.107

Model for Cousins

Regression coefficients for Cousins (reference: White, Poisson, Fixed mean 2.5)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	15.713	0.128	122.316	0.000	15.461	15.965
RACEBlack/African American	0.270	0.182	1.487	0.137	-0.086	0.626
ModelZIP	0.467	0.182	2.570	0.010	0.111	0.823
ModelNB	0.291	0.182	1.601	0.109	-0.065	0.647
ModelZINB	0.591	0.182	3.256	0.001	0.235	0.948
ScenarioEmpirical mean	6.739	0.182	37.092	0.000	6.383	7.095
RACEBlack/African American:ModelZIP	0.048	0.257	0.185	0.853	-0.456	0.551
RACEBlack/African American:ModelNB	0.412	0.257	1.606	0.108	-0.091	0.916
RACEBlack/African American:ModelZINB	0.107	0.257	0.415	0.678	-0.397	0.610
RACEBlack/African American:ScenarioEmpirical mean	17.978	0.257	69.974	0.000	17.474	18.481
ModelZIP:ScenarioEmpirical mean	0.337	0.257	1.312	0.189	-0.166	0.841
ModelNB:ScenarioEmpirical mean	-0.037	0.257	-0.145	0.885	-0.541	0.466
ModelZINB:ScenarioEmpirical mean	0.174	0.257	0.677	0.498	-0.330	0.677
RACEBlack/African American:ModelZIP:ScenarioEmpirical mean	0.456	0.363	1.256	0.209	-0.256	1.169
RACEBlack/African American:ModelNB:ScenarioEmpirical mean	-0.746	0.363	-2.054	0.040	-1.459	-0.034
RACEBlack/African American:ModelZINB:ScenarioEmpirical mean	-0.275	0.363	-0.756	0.450	-0.987	0.438

library(dplyr)
library(broom)
library(tidyr)
library(purrr)

relatives <- c("Siblings", "AuntsUncles", "Cousins")
models <- c("Poisson", "ZIP", "NB", "ZINB")

anova_per_model <- purrr::cross_df(list(Relative = relatives, Model = models)) %>%
  rowwise() %>%
  mutate(
    anova = list({
      data_fixed <- rel_long_all %>%
        filter(Relative == Relative, Scenario == "Fixed mean 2.5", Model == Model) %>%
        mutate(RACE = factor(RACE, levels = c("White", "Black/African American")))
      if(nrow(data_fixed) < 2) return(tibble(Df = NA, F_value = NA, p_value = NA))
      lm0 <- lm(Count ~ 1, data = data_fixed)
      lm1 <- lm(Count ~ RACE, data = data_fixed)
      atab <- as.data.frame(anova(lm0, lm1))
      tibble(Df = atab$Df[2], F_value = atab$`F value`[2], p_value = atab$`Pr(>F)`[2])
    })
  ) %>%
  unnest(anova)

knitr::kable(anova_per_model, digits = 4, caption = "Test of Race effect within each model (Fixed mean = 2.5 only)")

Test of Race effect within each model (Fixed mean = 2.5 only)
Relative	Model	Df
Siblings	Poisson	1
AuntsUncles	Poisson	1
Cousins	Poisson	1
Siblings	ZIP	1
AuntsUncles	ZIP	1
Cousins	ZIP	1
Siblings	NB	1
AuntsUncles	NB	1
Cousins	NB	1
Siblings	ZINB	1
AuntsUncles	ZINB	1
Cousins	ZINB	1

Summary

When mean fertility is held constant at 2.5 children for both populations, the expected number of siblings, aunts/uncles, and cousins is virtually identical for Black and White groups, regardless of model choice (Poisson, NB, ZIP, or ZINB). Although our fitted NB/ZINB/ZIP models allow for group-specific overdispersion and zero-inflation, these parameters had negligible practical effect on expected kin counts for the distributions observed in US Census data. We conclude that, in practice, mean fertility is the primary driver of group differences in simulated kinship structure for recent cohorts.

Individual-Level Sibling Distribution and Kinship Risk

We now assess close-kin counts (siblings, aunts/uncles, cousins) directly from the perspective of the focal individual, fitting and simulating the distribution of siblings per person, and comparing the effect of distributional assumptions (Poisson, NB, ZIP, ZINB) and race, as before.

Fit models to individual-level sibling counts

# Prepare individual-level sibling count
sibs_long <- cohort %>%
  filter(chborn_num > 0) %>%
  tidyr::uncount(chborn_num, .remove = FALSE) %>%
  mutate(n_sib = chborn_num - 1)

fit_sib_models <- function(dat) {
  pois <- glm(n_sib ~ 1, family = poisson, data = dat)
  nb   <- glm.nb(n_sib ~ 1, data = dat)
  zip  <- zeroinfl(n_sib ~ 1 | 1, data = dat, dist = "poisson")
  zinb <- zeroinfl(n_sib ~ 1 | 1, data = dat, dist = "negbin")
  tibble(
    Model = c("Poisson", "NB", "ZIP", "ZINB"),
    mu    = c(exp(coef(pois)[1]),
              exp(coef(nb)[1]),
              exp(zip$coefficients$count[1]),
              exp(zinb$coefficients$count[1])),
    size  = c(Inf, nb$theta, Inf, zinb$theta),
    pi0   = c(0, 0, plogis(zip$coefficients$zero[1]), plogis(zinb$coefficients$zero[1])),
    AIC   = c(AIC(pois), AIC(nb), AIC(zip), AIC(zinb))
  )
}
sib_param_tbl <- sibs_long %>%
  group_by(RACE) %>%
  group_modify(~ fit_sib_models(.x)) %>%
  ungroup()
knitr::kable(sib_param_tbl, digits = 3, caption = "Model fits for sibling count distribution by race")

Model fits for sibling count distribution by race
RACE	Model	mu	size	pi0	AIC
Black/African American	Poisson	4.803	Inf	0.000	203758.5
Black/African American	NB	4.803	6.474000e+00	0.000	197096.3
Black/African American	ZIP	4.930	Inf	0.026	201893.1
Black/African American	ZINB	4.930	2.100902e+09	0.026	201895.1
White	Poisson	3.053	Inf	0.000	1104146.8
White	NB	3.053	1.837100e+01	0.000	1100190.5
White	ZIP	3.053	Inf	0.000	1104149.3
White	ZINB	3.053	6.323171e+10	0.000	1104150.8

Table. For individual-level sibling counts, the negative binomial (NB) model provides the best fit for both populations, with substantially lower AIC than the Poisson, ZIP, or ZINB models. This indicates that sibship size is more variable than would be expected under a simple Poisson process, consistent with substantial overdispersion in family sizes, but there is little evidence for zero-inflation (excess of only children).

Because individuals from large families are more common in an individual-level sample, the average number of siblings per person is higher than the mean family size minus one (the “sibship size paradox”). This effect is important to account for in forensic or genetic risk models that simulate from the perspective of a random individual.

Simulate kin counts for the focal individual

We estimate the expected number and full probability distributions of close genetic relatives (siblings, aunts/uncles, and cousins) for a focal individual. This is done across all fitted models (Poisson, NB, ZIP, ZINB), for both Black/African American and White groups, using both empirical mean and fixed mean scenarios. We then compare these focal-individual predictions to a hybrid “best-fit” model and a Poisson-fertility baseline.

simulate_focal_kin <- function(
  sib_Model, sib_mu, sib_size, sib_pi0,      # Sibling distribution params
  fert_Model, fert_mu, fert_size, fert_pi0,  # Fertility distribution params
  n_sim = 10000, max_sib = 12, max_kids = 12
) {
  # Helper: draw siblings under a distribution (always return integer)
  draw_sibs <- function(nn) {
    out <- if      (sib_Model == "Poisson") rpois(nn, sib_mu)
           else if (sib_Model == "NB")      rnbinom(nn, sib_size, mu = sib_mu)
           else if (sib_Model == "ZIP")     ifelse(runif(nn) < sib_pi0, 0L, rpois(nn, sib_mu))
           else if (sib_Model == "ZINB")    ifelse(runif(nn) < sib_pi0, 0L, rnbinom(nn, sib_size, mu = sib_mu))
           else stop("Unknown sib_model")
    as.integer(pmin(out, max_sib))
  }
  draw_fert <- function(nn) {
    out <- if      (fert_Model == "Poisson") rpois(nn, fert_mu)
           else if (fert_Model == "NB")      rnbinom(nn, fert_size, mu = fert_mu)
           else if (fert_Model == "ZIP")     ifelse(runif(nn) < fert_pi0, 0L, rpois(nn, fert_mu))
           else if (fert_Model == "ZINB")    ifelse(runif(nn) < fert_pi0, 0L, rnbinom(nn, fert_size, mu = fert_mu))
           else stop("Unknown fert_model")
    as.integer(pmin(out, max_kids))
  }

  # 1. Siblings for the focal individual
  siblings <- draw_sibs(n_sim)

  # 2. Aunts & uncles (no subtraction needed)
  aunts_maternal <- draw_sibs(n_sim)   # mother’s siblings
  aunts_paternal <- draw_sibs(n_sim)   # father’s siblings
  aunts_uncles   <- aunts_maternal + aunts_paternal


  # 3. Cousins: for each aunt/uncle, simulate their fertility and sum
  cousins <- purrr::map_int(aunts_uncles, function(n_au) {
    if (n_au == 0L) 0L else as.integer(sum(draw_fert(n_au)))
  })

  tibble::tibble(
    Siblings    = siblings,
    AuntsUncles = aunts_uncles,
    Cousins     = cousins,
    Parents     = 2L,
    Grandparents= 4L,
    TotalKin = siblings + aunts_uncles + cousins + 2 + 4
  )
}

Prepare Parameter Tables

# Sibling (focal individual) parameter tables: empirical and fixed mean scenarios
sib_param_tbl_exp <- sib_param_tbl %>%
  rename(sib_Model = Model, sib_mu = mu, sib_size = size, sib_pi0 = pi0, sib_AIC = AIC)

sib_param_tbl_fixed <- sib_param_tbl %>%
  mutate(sib_mu = ifelse(pi0 == 0, 1.5, 1.5 / (1 - pi0))) %>%
  rename(sib_Model = Model, sib_size = size, sib_pi0 = pi0, sib_AIC = AIC)

# Fertility (collateral relatives) parameter tables: empirical and fixed mean scenarios
fert_param_tbl_exp <- param_tbl %>%
  rename(fert_Model = Model, fert_mu = mu, fert_size = size, fert_pi0 = pi0, fert_AIC = AIC)

fert_param_tbl_fixed <- param_tbl %>%
  mutate(fert_mu = ifelse(pi0 == 0, 2.5, 2.5 / (1 - pi0))) %>%
  rename(fert_Model = Model, fert_size = size, fert_pi0 = pi0, fert_AIC = AIC)

Create Parameter Pairings and Run Simulations

# Parameter pairings for each scenario (all RACE x Model combinations)
param_pairs_emp <- expand_grid(
  RACE = unique(sib_param_tbl_exp$RACE),
  Model = unique(sib_param_tbl_exp$sib_Model)
) %>%
  left_join(sib_param_tbl_exp, by = c("RACE", "Model" = "sib_Model")) %>%
  left_join(fert_param_tbl_exp, by = c("RACE", "Model" = "fert_Model"))

param_pairs_fixed <- expand_grid(
  RACE = unique(sib_param_tbl_fixed$RACE),
  Model = unique(sib_param_tbl_fixed$sib_Model)
) %>%
  left_join(sib_param_tbl_fixed, by = c("RACE", "Model" = "sib_Model")) %>%
  left_join(fert_param_tbl_fixed, by = c("RACE", "Model" = "fert_Model"))

# Simulate kin counts for all parameter combinations (empirical and fixed mean)
focal_kin_sim_results_emp <- param_pairs_emp %>%
  mutate(
    sim = pmap(
      list(Model, sib_mu, sib_size, sib_pi0, fert_mu, fert_size, fert_pi0),
      ~ simulate_focal_kin(
        sib_Model = ..1, sib_mu = ..2, sib_size = ..3, sib_pi0 = ..4,
        fert_Model = ..1, fert_mu = ..5, fert_size = ..6, fert_pi0 = ..7,
        n_sim = 10000
      )
    )
  )

focal_kin_sim_results_fixed <- param_pairs_fixed %>%
  mutate(
    sim = pmap(
      list(Model, sib_mu, sib_size, sib_pi0, fert_mu, fert_size, fert_pi0),
      ~ simulate_focal_kin(
        sib_Model = ..1, sib_mu = ..2, sib_size = ..3, sib_pi0 = ..4,
        fert_Model = ..1, fert_mu = ..5, fert_size = ..6, fert_pi0 = ..7,
        n_sim = 10000
      )
    )
  )

Tidy Results and Label Scenarios

# Unnest and label scenario
focal_kin_counts_emp   <- focal_kin_sim_results_emp   %>% mutate(Scenario = "Empirical mean") %>% unnest(sim)
focal_kin_counts_fixed <- focal_kin_sim_results_fixed %>% mutate(Scenario = "Fixed mean")    %>% unnest(sim)

# Combine for downstream comparison
focal_kin_counts_all <- bind_rows(focal_kin_counts_emp, focal_kin_counts_fixed)

Summarize Mean Kin Counts

summary_focal_kin_all <- focal_kin_counts_all %>%
  group_by(Scenario, RACE, Model) %>%
  summarise(
    mean_sibs         = mean(Siblings),
    mean_aunts_uncles = mean(AuntsUncles),
    mean_cousins      = mean(Cousins),
    mean_total        = mean(TotalKin),
    .groups = "drop"
  )

knitr::kable(summary_focal_kin_all, digits = 2,
             caption = "Expected number of first-, second-, and third-degree relatives for a focal individual (simulated, by model, scenario, and race)")

Expected number of first-, second-, and third-degree relatives for a focal individual (simulated, by model, scenario, and race)
Scenario	RACE	Model	mean_sibs	mean_aunts_uncles	mean_cousins	mean_total
Empirical mean	Black/African American	NB	4.74	9.62	35.68	56.03
Empirical mean	Black/African American	Poisson	4.77	9.59	35.83	56.20
Empirical mean	Black/African American	ZINB	4.80	9.65	35.66	56.12
Empirical mean	Black/African American	ZIP	4.78	9.59	35.65	56.02
Empirical mean	White	NB	3.08	6.11	17.67	32.86
Empirical mean	White	Poisson	3.09	6.15	17.81	33.04
Empirical mean	White	ZINB	3.07	6.14	17.85	33.06
Empirical mean	White	ZIP	3.04	6.08	17.58	32.70
Fixed mean	Black/African American	NB	1.51	2.97	7.41	17.89
Fixed mean	Black/African American	Poisson	1.51	2.99	7.42	17.92
Fixed mean	Black/African American	ZINB	1.50	3.02	7.60	18.13
Fixed mean	Black/African American	ZIP	1.52	3.00	7.50	18.02
Fixed mean	White	NB	1.49	3.00	7.52	18.02
Fixed mean	White	Poisson	1.49	2.99	7.48	17.95
Fixed mean	White	ZINB	1.53	3.01	7.54	18.08
Fixed mean	White	ZIP	1.52	3.00	7.52	18.04

Visualize the Simulated Distributions

# Pivot for plotting
focal_kin_long <- focal_kin_counts_all %>%
  pivot_longer(c(Siblings, AuntsUncles, Cousins),
               names_to = "RelativeType", values_to = "Count") %>%
  mutate(
    Model = factor(Model, levels = c("Poisson", "NB", "ZIP", "ZINB")),
    RelativeType = factor(RelativeType, levels = c("Siblings", "AuntsUncles", "Cousins")),
    RACE = factor(RACE, levels = c("Black/African American", "White"))
  )

# Plot
library(ggplot2)
ggplot(focal_kin_long,
       aes(x = Model, y = Count, fill = RelativeType)) +
  geom_boxplot(position = position_dodge(width = 0.8),
               width = 0.7, alpha = .8, outlier.size = .3) +
  stat_summary(fun = mean, geom = "point", shape = 21, size = 1.4,
               aes(group = RelativeType),
               position = position_dodge(width = 0.8),
               colour = "black", fill = "white") +
  facet_grid(Scenario ~ RACE) +
  scale_fill_viridis_d(option = "D") +
  labs(title   = "Simulated number of close relatives for a focal individual",
       x       = "Sibling-count model",
       y       = "Relative count",
       fill    = "Relative type") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "top")

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

Regression Models Comparing Scenarios, Models, and Groups

library(broom)
library(knitr)

relatives <- c("Siblings", "AuntsUncles", "Cousins")

for (rel in relatives) {
  cat("\n\n### Focal-individual model for", rel, "\n\n")
  analysis_data <- focal_kin_long %>%
    filter(RelativeType == rel) %>%
    mutate(
      RACE     = factor(RACE, levels = c("White", "Black/African American")),   # White as reference
      Model    = factor(Model, levels = c("Poisson", "NB", "ZIP", "ZINB")),    # Poisson as reference
      Scenario = factor(Scenario, levels = c("Fixed mean", "Empirical mean"))   # Fixed mean as reference
    )
  lm_fit <- lm(Count ~ RACE * Model * Scenario, data = analysis_data)
  lm_tidy <- broom::tidy(lm_fit, conf.int = TRUE)
  print(knitr::kable(
    lm_tidy,
    digits = 3,
    caption = paste(
      "Regression coefficients for", rel, 
      "(focal-individual, reference: White, Poisson, Fixed mean\nSib mean=1.5, Fertility mean=2.5)"
    )
  ))
}

Focal-individual model for Siblings

Regression coefficients for Siblings (focal-individual, reference: White, Poisson, Fixed mean Sib mean=1.5, Fertility mean=2.5)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	1.491	0.017	85.741	0.000	1.457	1.525
RACEBlack/African American	0.020	0.025	0.825	0.409	-0.028	0.069
ModelNB	0.002	0.025	0.069	0.945	-0.047	0.050
ModelZIP	0.024	0.025	0.984	0.325	-0.024	0.072
ModelZINB	0.038	0.025	1.537	0.124	-0.010	0.086
ScenarioEmpirical mean	1.597	0.025	64.917	0.000	1.549	1.645
RACEBlack/African American:ModelNB	-0.008	0.035	-0.227	0.820	-0.076	0.060
RACEBlack/African American:ModelZIP	-0.020	0.035	-0.561	0.575	-0.088	0.049
RACEBlack/African American:ModelZINB	-0.045	0.035	-1.288	0.198	-0.113	0.023
RACEBlack/African American:ScenarioEmpirical mean	1.661	0.035	47.757	0.000	1.593	1.730
ModelNB:ScenarioEmpirical mean	-0.010	0.035	-0.276	0.783	-0.078	0.059
ModelZIP:ScenarioEmpirical mean	-0.071	0.035	-2.044	0.041	-0.139	-0.003
ModelZINB:ScenarioEmpirical mean	-0.052	0.035	-1.503	0.133	-0.120	0.016
RACEBlack/African American:ModelNB:ScenarioEmpirical mean	-0.014	0.049	-0.293	0.770	-0.111	0.082
RACEBlack/African American:ModelZIP:ScenarioEmpirical mean	0.076	0.049	1.541	0.123	-0.021	0.172
RACEBlack/African American:ModelZINB:ScenarioEmpirical mean	0.090	0.049	1.827	0.068	-0.007	0.186

Focal-individual model for AuntsUncles

Regression coefficients for AuntsUncles (focal-individual, reference: White, Poisson, Fixed mean Sib mean=1.5, Fertility mean=2.5)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	2.988	0.025	120.959	0.000	2.939	3.036
RACEBlack/African American	0.003	0.035	0.080	0.936	-0.066	0.071
ModelNB	0.017	0.035	0.484	0.629	-0.052	0.085
ModelZIP	0.017	0.035	0.490	0.624	-0.051	0.086
ModelZINB	0.020	0.035	0.564	0.573	-0.049	0.088
ScenarioEmpirical mean	3.158	0.035	90.401	0.000	3.089	3.226
RACEBlack/African American:ModelNB	-0.037	0.049	-0.741	0.459	-0.133	0.060
RACEBlack/African American:ModelZIP	-0.007	0.049	-0.148	0.883	-0.104	0.090
RACEBlack/African American:ModelZINB	0.015	0.049	0.294	0.769	-0.082	0.111
RACEBlack/African American:ScenarioEmpirical mean	3.446	0.049	69.755	0.000	3.349	3.543
ModelNB:ScenarioEmpirical mean	-0.050	0.049	-1.012	0.311	-0.147	0.047
ModelZIP:ScenarioEmpirical mean	-0.081	0.049	-1.632	0.103	-0.177	0.016
ModelZINB:ScenarioEmpirical mean	-0.027	0.049	-0.545	0.586	-0.124	0.070
RACEBlack/African American:ModelNB:ScenarioEmpirical mean	0.094	0.070	1.346	0.178	-0.043	0.231
RACEBlack/African American:ModelZIP:ScenarioEmpirical mean	0.065	0.070	0.932	0.351	-0.072	0.202
RACEBlack/African American:ModelZINB:ScenarioEmpirical mean	0.049	0.070	0.709	0.479	-0.087	0.186

Focal-individual model for Cousins

Regression coefficients for Cousins (focal-individual, reference: White, Poisson, Fixed mean Sib mean=1.5, Fertility mean=2.5)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	7.476	0.094	79.779	0.000	7.292	7.659
RACEBlack/African American	-0.059	0.133	-0.446	0.656	-0.319	0.201
ModelNB	0.047	0.133	0.355	0.723	-0.213	0.307
ModelZIP	0.042	0.133	0.318	0.751	-0.218	0.302
ModelZINB	0.069	0.133	0.519	0.604	-0.191	0.329
ScenarioEmpirical mean	10.335	0.133	77.992	0.000	10.075	10.595
RACEBlack/African American:ModelNB	-0.049	0.187	-0.259	0.795	-0.416	0.319
RACEBlack/African American:ModelZIP	0.041	0.187	0.219	0.826	-0.326	0.408
RACEBlack/African American:ModelZINB	0.111	0.187	0.590	0.555	-0.257	0.478
RACEBlack/African American:ScenarioEmpirical mean	18.082	0.187	96.487	0.000	17.715	18.450
ModelNB:ScenarioEmpirical mean	-0.189	0.187	-1.010	0.313	-0.557	0.178
ModelZIP:ScenarioEmpirical mean	-0.277	0.187	-1.477	0.140	-0.644	0.091
ModelZINB:ScenarioEmpirical mean	-0.032	0.187	-0.172	0.863	-0.400	0.335
RACEBlack/African American:ModelNB:ScenarioEmpirical mean	0.032	0.265	0.121	0.904	-0.487	0.551
RACEBlack/African American:ModelZIP:ScenarioEmpirical mean	0.012	0.265	0.046	0.963	-0.507	0.532
RACEBlack/African American:ModelZINB:ScenarioEmpirical mean	-0.317	0.265	-1.198	0.231	-0.837	0.202

Note: The reference group for all regressions is White, Poisson, Fixed mean (sibling mean = 1.5, fertility mean = 2.5). All coefficients are interpreted relative to this group.

Hybrid Best-Fit Model and Poisson Fertility Baseline

# 1. Best sibling model per race (individual-level sibs)
sib_best <- sib_param_tbl %>%
  group_by(RACE) %>% slice_min(AIC, n = 1) %>% ungroup()

# 2. Force ZINB for fertility model in both races (mother-level fertility)
fert_zinb <- param_tbl %>%
  filter(Model == "ZINB") %>%
  rename(
    fert_Model = Model, fert_mu = mu,
    fert_size  = size, fert_pi0 = pi0,
    fert_AIC   = AIC
  ) %>%
  # Now select using the NEW (renamed) column names
  dplyr::select(RACE, fert_Model, fert_mu, fert_size, fert_pi0, fert_AIC)


# 3. Rename sibling columns and join to form hybrid parameter table
sib_best_renamed <- sib_best %>%
  rename(
    sib_Model = Model, sib_mu = mu,
    sib_size  = size,  sib_pi0 = pi0,
    sib_AIC   = AIC
  )

hybrid_param_zinb <- sib_best_renamed %>%
  left_join(fert_zinb, by = "RACE")

knitr::kable(
  hybrid_param_zinb,
  digits = 3,
  caption = "Hybrid best-fit parameter table for focal-individual kin simulation (ZINB fertility for both groups)"
)

Hybrid best-fit parameter table for focal-individual kin simulation (ZINB fertility for both groups)
RACE	sib_Model	sib_mu	sib_size	sib_pi0	sib_AIC	fert_Model	fert_mu	fert_size	fert_pi0	fert_AIC
Black/African American	NB	4.803	6.474	0	197096.3	ZINB	3.932	4.289	0.053	51024.46
White	NB	3.053	18.371	0	1100190.5	ZINB	3.079	257628.938	0.060	376842.17

# 1. Fixed Poisson baseline: siblings = 1.5, fertility = 2.5, all else ignored
fixed_poisson_param <- tibble(
  RACE = unique(hybrid_param_zinb$RACE),
  sib_Model = "Poisson",
  sib_mu    = 1.5,
  sib_size  = Inf,
  sib_pi0   = 0,
  fert_Model = "Poisson",
  fert_mu    = 2.5,
  fert_size  = Inf,
  fert_pi0   = 0
)

# 2. Simulate
set.seed(1)
hybrid_sim <- hybrid_param_zinb %>%
  mutate(
    Scenario = "Hybrid best-fit (ZINB fert)",  # Add BEFORE sim
    sim = pmap(
      list(sib_Model, sib_mu, sib_size, sib_pi0,
           fert_Model, fert_mu, fert_size, fert_pi0),
      ~ simulate_focal_kin(..1, ..2, ..3, ..4,
                           ..5, ..6, ..7, ..8,
                           n_sim = 10000)
    )
  ) %>%
  unnest(sim)

fixed_poisson_sim <- fixed_poisson_param %>%
  mutate(
    Scenario = "Fixed Poisson (1.5/2.5)",  # Add BEFORE sim
    sim = pmap(
      list(sib_Model, sib_mu, sib_size, sib_pi0,
           fert_Model, fert_mu, fert_size, fert_pi0),
      ~ simulate_focal_kin(..1, ..2, ..3, ..4,
                           ..5, ..6, ..7, ..8,
                           n_sim = 10000)
    )
  ) %>%
  unnest(sim)

# 3. Summarize & compare (don't create comparison_long here)
compare_means <- bind_rows(hybrid_sim, fixed_poisson_sim) %>%
  group_by(Scenario, RACE) %>%
  summarise(across(c(Siblings, AuntsUncles, Cousins, TotalKin), mean),
            .groups = "drop") %>%
  arrange(RACE, Scenario)

knitr::kable(compare_means, digits = 2,
  caption = "Mean close-kin counts: hybrid best-fit (ZINB fertility both groups) vs. fixed Poisson baseline (siblings=1.5, fertility=2.5)"
)

Mean close-kin counts: hybrid best-fit (ZINB fertility both groups) vs. fixed Poisson baseline (siblings=1.5, fertility=2.5)
Scenario	RACE	Siblings	AuntsUncles	Cousins	TotalKin
Fixed Poisson (1.5/2.5)	Black/African American	1.49	3.00	7.52	18.01
Hybrid best-fit (ZINB fert)	Black/African American	4.77	9.52	35.19	55.48
Fixed Poisson (1.5/2.5)	White	1.52	3.01	7.53	18.06
Hybrid best-fit (ZINB fert)	White	3.04	6.15	17.81	33.01

Probability Mass Functions (PMFs) for Siblings, Aunts/Uncles, and Cousins

# Function to get PMF for a variable, scenario, and race
get_pmf <- function(df, var) {
  df %>%
    count(RACE, !!sym(var), name = "n") %>%
    group_by(RACE) %>%
    mutate(p = n / sum(n)) %>%
    ungroup()
}

# Hybrid scenario
sib_pmf_hybrid   <- get_pmf(hybrid_sim,   "Siblings")
aunts_pmf_hybrid <- get_pmf(hybrid_sim,   "AuntsUncles")
cousins_pmf_hybrid <- get_pmf(hybrid_sim, "Cousins")

# Fixed Poisson baseline
sib_pmf_pois   <- get_pmf(fixed_poisson_sim,   "Siblings")
aunts_pmf_pois <- get_pmf(fixed_poisson_sim,   "AuntsUncles")
cousins_pmf_pois <- get_pmf(fixed_poisson_sim, "Cousins")

# Write to CSV (useful for supplement)
write.csv(sib_pmf_hybrid,    "data/output_pmf_sibling_hybrid.csv",     row.names = FALSE)
write.csv(aunts_pmf_hybrid,  "data/output_pmf_avuncular_hybrid.csv",   row.names = FALSE)
write.csv(cousins_pmf_hybrid,"data/output_pmf_cousins_hybrid.csv",     row.names = FALSE)
write.csv(sib_pmf_pois,      "data/output_pmf_sibling_poisson.csv",    row.names = FALSE)
write.csv(aunts_pmf_pois,    "data/output_pmf_avuncular_poisson.csv",  row.names = FALSE)
write.csv(cousins_pmf_pois,  "data/output_pmf_cousins_poisson.csv",    row.names = FALSE)

print_pmf_table <- function(pmf, category, scenario, n = 12) {
  pmf %>%
    group_by(RACE) %>%
    slice_head(n = n) %>%
    ungroup() %>%
    knitr::kable(
      digits = 4,
      caption = paste0("First ", n, " rows of PMF for ", category, " (", scenario, ")")
    )
}

print_pmf_table(sib_pmf_hybrid,   "Siblings", "Hybrid best-fit")

First 12 rows of PMF for Siblings (Hybrid best-fit)
RACE	Siblings	n	p
Black/African American	0	281	0.0281
Black/African American	1	746	0.0746
Black/African American	2	1168	0.1168
Black/African American	3	1503	0.1503
Black/African American	4	1494	0.1494
Black/African American	5	1285	0.1285
Black/African American	6	1037	0.1037
Black/African American	7	833	0.0833
Black/African American	8	549	0.0549
Black/African American	9	445	0.0445
Black/African American	10	264	0.0264
Black/African American	11	157	0.0157
White	0	589	0.0589
White	1	1579	0.1579
White	2	2113	0.2113
White	3	2128	0.2128
White	4	1579	0.1579
White	5	985	0.0985
White	6	539	0.0539
White	7	272	0.0272
White	8	131	0.0131
White	9	48	0.0048
White	10	24	0.0024
White	11	8	0.0008

print_pmf_table(sib_pmf_pois,     "Siblings", "Fixed Poisson (1.5/2.5)")

First 12 rows of PMF for Siblings (Fixed Poisson (1.5/2.5))
RACE	Siblings	n	p
Black/African American	0	2203	0.2203
Black/African American	1	3404	0.3404
Black/African American	2	2489	0.2489
Black/African American	3	1281	0.1281
Black/African American	4	444	0.0444
Black/African American	5	142	0.0142
Black/African American	6	26	0.0026
Black/African American	7	10	0.0010
Black/African American	8	1	0.0001
White	0	2118	0.2118
White	1	3419	0.3419
White	2	2484	0.2484
White	3	1326	0.1326
White	4	487	0.0487
White	5	119	0.0119
White	6	38	0.0038
White	7	8	0.0008
White	8	1	0.0001

print_pmf_table(aunts_pmf_hybrid, "Aunts/Uncles", "Hybrid best-fit")

First 12 rows of PMF for Aunts/Uncles (Hybrid best-fit)
RACE	AuntsUncles	n	p
Black/African American	0	7	0.0007
Black/African American	1	46	0.0046
Black/African American	2	112	0.0112
Black/African American	3	260	0.0260
Black/African American	4	468	0.0468
Black/African American	5	669	0.0669
Black/African American	6	814	0.0814
Black/African American	7	950	0.0950
Black/African American	8	954	0.0954
Black/African American	9	1032	0.1032
Black/African American	10	961	0.0961
Black/African American	11	817	0.0817
White	0	36	0.0036
White	1	179	0.0179
White	2	490	0.0490
White	3	905	0.0905
White	4	1297	0.1297
White	5	1436	0.1436
White	6	1464	0.1464
White	7	1303	0.1303
White	8	1035	0.1035
White	9	743	0.0743
White	10	469	0.0469
White	11	298	0.0298

print_pmf_table(aunts_pmf_pois,   "Aunts/Uncles", "Fixed Poisson (1.5/2.5)")

First 12 rows of PMF for Aunts/Uncles (Fixed Poisson (1.5/2.5))
RACE	AuntsUncles	n	p
Black/African American	0	527	0.0527
Black/African American	1	1428	0.1428
Black/African American	2	2279	0.2279
Black/African American	3	2228	0.2228
Black/African American	4	1675	0.1675
Black/African American	5	1024	0.1024
Black/African American	6	509	0.0509
Black/African American	7	228	0.0228
Black/African American	8	66	0.0066
Black/African American	9	21	0.0021
Black/African American	10	13	0.0013
Black/African American	11	2	0.0002
White	0	466	0.0466
White	1	1511	0.1511
White	2	2275	0.2275
White	3	2199	0.2199
White	4	1676	0.1676
White	5	1021	0.1021
White	6	512	0.0512
White	7	210	0.0210
White	8	100	0.0100
White	9	19	0.0019
White	10	11	0.0011

print_pmf_table(cousins_pmf_hybrid, "Cousins", "Hybrid best-fit")

First 12 rows of PMF for Cousins (Hybrid best-fit)
RACE	Cousins	n	p
Black/African American	0	14	0.0014
Black/African American	1	14	0.0014
Black/African American	2	22	0.0022
Black/African American	3	21	0.0021
Black/African American	4	26	0.0026
Black/African American	5	47	0.0047
Black/African American	6	48	0.0048
Black/African American	7	49	0.0049
Black/African American	8	82	0.0082
Black/African American	9	87	0.0087
Black/African American	10	90	0.0090
Black/African American	11	82	0.0082
White	0	72	0.0072
White	1	38	0.0038
White	2	85	0.0085
White	3	116	0.0116
White	4	153	0.0153
White	5	166	0.0166
White	6	218	0.0218
White	7	287	0.0287
White	8	306	0.0306
White	9	353	0.0353
White	10	382	0.0382
White	11	399	0.0399

print_pmf_table(cousins_pmf_pois,   "Cousins", "Fixed Poisson (1.5/2.5)")

First 12 rows of PMF for Cousins (Fixed Poisson (1.5/2.5))
RACE	Cousins	n	p
Black/African American	0	669	0.0669
Black/African American	1	384	0.0384
Black/African American	2	579	0.0579
Black/African American	3	716	0.0716
Black/African American	4	789	0.0789
Black/African American	5	798	0.0798
Black/African American	6	818	0.0818
Black/African American	7	789	0.0789
Black/African American	8	737	0.0737
Black/African American	9	642	0.0642
Black/African American	10	512	0.0512
Black/African American	11	498	0.0498
White	0	615	0.0615
White	1	365	0.0365
White	2	590	0.0590
White	3	734	0.0734
White	4	869	0.0869
White	5	826	0.0826
White	6	741	0.0741
White	7	821	0.0821
White	8	668	0.0668
White	9	622	0.0622
White	10	628	0.0628
White	11	478	0.0478

Plotting PMFs for Siblings, Aunts/Uncles, and Cousins

library(ggplot2)
library(viridisLite)

plot_pmf_mirror <- function(pmf_df, count_var = "Siblings", category_label = "Siblings", cap = NULL) {
  cats <- sort(unique(pmf_df[[count_var]]))
  if (!is.null(cap)) {
    pmf_df[[count_var]] <- ifelse(pmf_df[[count_var]] > cap, cap, pmf_df[[count_var]])
    cats <- sort(unique(pmf_df[[count_var]]))
  }
  n_cat <- length(cats)
  color_palette <- viridis(n_cat, option = "D")
  df <- pmf_df %>%
    mutate(y = ifelse(RACE == "White", -p, p),
           count_cat = factor(!!rlang::sym(count_var), levels = cats))
  ggplot(df, aes(x = count_cat, y = y, fill = count_cat, alpha = RACE)) +
    geom_col() +
    geom_hline(yintercept = 0, color = "black", size = 0.5) +
    coord_flip() +
    scale_y_continuous(labels = function(x) abs(x)) +
    scale_fill_manual(values = color_palette, name = paste(category_label, "Count")) +
    scale_alpha_manual(values = c("White" = 0.7, "Black/African American" = 1), guide = "none") +
    labs(
      title = paste0("Mirrored PMF of ", category_label, " Counts (Simulated, Hybrid Best-fit)"),
      x = paste("Number of", category_label),
      y = "Probability",
      caption = "White (left, negative y), Black/African American (right, positive y)."
    ) +
    theme_minimal(base_size = 13) +
    theme(
      plot.title = element_text(size = 14, hjust = 0.5),
      axis.text.y = element_text(size = 10),
      legend.position = "right",
      panel.grid.major.y = element_blank(),
      panel.grid.minor.y = element_blank()
    )
}

plot_pmf_mirror(sib_pmf_hybrid, count_var = "Siblings", category_label = "Siblings", cap = 12)

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

plot_pmf_mirror(aunts_pmf_hybrid, count_var = "AuntsUncles", category_label = "Aunts/Uncles", cap = 32)

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

plot_pmf_mirror(cousins_pmf_hybrid, count_var = "Cousins", category_label = "Cousins", cap = 80)

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

# Combine hybrid and Poisson baseline sims
compare_means <- bind_rows(hybrid_sim, fixed_poisson_sim) %>%
  group_by(Scenario, RACE) %>%
  summarise(
    mean_sibs = mean(Siblings),
    mean_aunts_uncles = mean(AuntsUncles),
    mean_cousins = mean(Cousins),
    mean_total = mean(TotalKin),
    .groups = "drop"
  )

knitr::kable(compare_means, digits = 2,
  caption = "Mean close-kin counts: hybrid best-fit vs. fixed Poisson baseline (siblings=1.5, fertility=2.5)"
)

Mean close-kin counts: hybrid best-fit vs. fixed Poisson baseline (siblings=1.5, fertility=2.5)
Scenario	RACE	mean_sibs	mean_aunts_uncles	mean_cousins	mean_total
Fixed Poisson (1.5/2.5)	Black/African American	1.49	3.00	7.52	18.01
Fixed Poisson (1.5/2.5)	White	1.52	3.01	7.53	18.06
Hybrid best-fit (ZINB fert)	Black/African American	4.77	9.52	35.19	55.48
Hybrid best-fit (ZINB fert)	White	3.04	6.15	17.81	33.01

summarize_kin <- function(df) {
  df %>%
    tidyr::pivot_longer(
      c(Siblings, AuntsUncles, Cousins, TotalKin),
      names_to = "Relative",
      values_to = "Count"
    ) %>%
    group_by(RACE, Relative) %>%
    summarise(
      min    = min(Count),
      q1     = quantile(Count, 0.25),
      mean   = mean(Count),
      median = median(Count),
      q3     = quantile(Count, 0.75),
      max    = max(Count),
      .groups = "drop"
    )
}

hybrid_summary <- summarize_kin(hybrid_sim) %>%
  rename_with(~ paste0("hybrid_", .), c("min","q1","mean","median","q3","max"))

poisson_summary <- summarize_kin(fixed_poisson_sim) %>%
  rename_with(~ paste0("poisson_", .), c("min","q1","mean","median","q3","max"))

library(dplyr)

side_by_side <- hybrid_summary %>%
  left_join(poisson_summary, by = c("RACE", "Relative"))

knitr::kable(
  side_by_side,
  digits = 2,
  caption = "Summary statistics for simulated kin counts: Hybrid (NB/ZINB) vs. Fixed Poisson baseline"
)

Summary statistics for simulated kin counts: Hybrid (NB/ZINB) vs. Fixed Poisson baseline
RACE	Relative	hybrid_min	hybrid_q1	hybrid_mean	hybrid_median	hybrid_q3	hybrid_max	poisson_min	poisson_q1	poisson_mean	poisson_median	poisson_q3	poisson_max
Black/African American	AuntsUncles	0	7	9.52	9.0	12	24	0	2	3.00	3	4	11
Black/African American	Cousins	0	23	35.19	34.0	46	117	0	4	7.52	7	11	36
Black/African American	Siblings	0	3	4.77	4.0	6	12	0	1	1.49	1	2	8
Black/African American	TotalKin	7	40	55.48	53.5	69	151	6	13	18.01	17	22	55
White	AuntsUncles	0	4	6.15	6.0	8	20	0	2	3.01	3	4	10
White	Cousins	0	11	17.81	17.0	23	61	0	4	7.53	7	11	35
White	Siblings	0	2	3.04	3.0	4	12	0	1	1.52	1	2	8
White	TotalKin	6	25	33.01	32.0	40	87	6	13	18.06	17	22	52

library(ggplot2)

# Create comparison_long for plotting (check if Scenario exists first)
comparison_long <- bind_rows(hybrid_sim, fixed_poisson_sim) %>%
  tidyr::pivot_longer(
    cols = c(Siblings, AuntsUncles, Cousins, TotalKin),
    names_to = "RelativeType",
    values_to = "Count"
  ) %>%
  mutate(
    Scenario = factor(Scenario, levels = c("Fixed Poisson (1.5/2.5)", "Hybrid best-fit (ZINB fert)")),
    RelativeType = factor(RelativeType, levels = c("Siblings", "AuntsUncles", "Cousins", "TotalKin"),
                         labels = c("Siblings", "Aunts/Uncles", "Cousins", "Total Kin"))
  )

# Function to plot each relative type separately (with free y-axis)
plot_kin_box <- function(rel_type) {
  ggplot(comparison_long %>% filter(RelativeType == rel_type),
         aes(x = Scenario, y = Count, fill = RACE)) +
    geom_boxplot(outlier.size = .3, width = 0.7, alpha = 0.8, position = position_dodge(width = 0.7)) +
    stat_summary(
      fun = mean, geom = "point", shape = 21, size = 1.7,
      aes(group = RACE), color = "black", fill = "white",
      position = position_dodge(width = 0.7)
    ) +
    scale_fill_viridis_d(option = "D") +
    labs(
      title = paste("Distribution of", rel_type),
      x = "Simulation scenario",
      y = "Count",
      fill = "Race"
    ) +
    theme_minimal(base_size = 13) +
    theme(
      legend.position = "top",
      panel.grid.major.x = element_blank(),
      strip.text = element_text(size = 12),
      axis.text.x = element_text(angle = 15, vjust = 1, hjust = 1)
    )
}

# Now, print all four plots (one for each relative type)
plot_kin_box("Siblings")

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

plot_kin_box("Aunts/Uncles")

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

plot_kin_box("Cousins")

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

plot_kin_box("Total Kin")

Version	Author	Date
b9c32a9	Tina Lasisi	2025-06-18

# Make sure Scenario is present and named correctly
names(hybrid_sim)

 [1] "RACE"         "sib_Model"    "sib_mu"       "sib_size"     "sib_pi0"     
 [6] "sib_AIC"      "fert_Model"   "fert_mu"      "fert_size"    "fert_pi0"    
[11] "fert_AIC"     "Scenario"     "Siblings"     "AuntsUncles"  "Cousins"     
[16] "Parents"      "Grandparents" "TotalKin"

names(fixed_poisson_sim)

 [1] "RACE"         "sib_Model"    "sib_mu"       "sib_size"     "sib_pi0"     
 [6] "fert_Model"   "fert_mu"      "fert_size"    "fert_pi0"     "Scenario"    
[11] "Siblings"     "AuntsUncles"  "Cousins"      "Parents"      "Grandparents"
[16] "TotalKin"

# Filter for just TotalKin before running the regression
comparison_total <- bind_rows(
  hybrid_sim %>% mutate(Scenario = "Hybrid best-fit (ZINB fert)"),
  fixed_poisson_sim %>% mutate(Scenario = "Fixed Poisson (1.5/2.5)")
) %>%
  mutate(
    Scenario = factor(Scenario, levels = c("Fixed Poisson (1.5/2.5)", "Hybrid best-fit (ZINB fert)")),
    RACE = factor(RACE, levels = c("White", "Black/African American"))
  )

# Now run the regression on TotalKin
lm_hybrid_vs_pois <- lm(TotalKin ~ Scenario * RACE, data = comparison_total)
summary(lm_hybrid_vs_pois)


Call:
lm(formula = TotalKin ~ Scenario * RACE, data = comparison_total)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.479  -7.011  -1.011   5.989  95.521 

Coefficients:
                                                               Estimate
(Intercept)                                                     18.0622
ScenarioHybrid best-fit (ZINB fert)                             14.9490
RACEBlack/African American                                      -0.0507
ScenarioHybrid best-fit (ZINB fert):RACEBlack/African American  22.5188
                                                               Std. Error
(Intercept)                                                        0.1268
ScenarioHybrid best-fit (ZINB fert)                                0.1794
RACEBlack/African American                                         0.1794
ScenarioHybrid best-fit (ZINB fert):RACEBlack/African American     0.2537
                                                               t value Pr(>|t|)
(Intercept)                                                    142.415   <2e-16
ScenarioHybrid best-fit (ZINB fert)                             83.345   <2e-16
RACEBlack/African American                                      -0.283    0.777
ScenarioHybrid best-fit (ZINB fert):RACEBlack/African American  88.777   <2e-16
                                                                  
(Intercept)                                                    ***
ScenarioHybrid best-fit (ZINB fert)                            ***
RACEBlack/African American                                        
ScenarioHybrid best-fit (ZINB fert):RACEBlack/African American ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.68 on 39996 degrees of freedom
Multiple R-squared:  0.5935,    Adjusted R-squared:  0.5935 
F-statistic: 1.946e+04 on 3 and 39996 DF,  p-value: < 2.2e-16

knitr::kable(broom::tidy(lm_hybrid_vs_pois), digits = 3,
             caption = "Regression coefficients: effect of simulation scenario (Hybrid vs Poisson) on total close relatives")

Regression coefficients: effect of simulation scenario (Hybrid vs Poisson) on total close relatives
term	estimate	std.error	statistic	p.value
(Intercept)	18.062	0.127	142.415	0.000
ScenarioHybrid best-fit (ZINB fert)	14.949	0.179	83.345	0.000
RACEBlack/African American	-0.051	0.179	-0.283	0.777
ScenarioHybrid best-fit (ZINB fert):RACEBlack/African American	22.519	0.254	88.777	0.000

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Detroit
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] viridisLite_0.4.2 broom_1.0.8       ggplot2_3.5.2     purrr_1.0.4      
 [5] knitr_1.50        pscl_1.5.9        MASS_7.3-65       readr_2.1.5      
 [9] tidyr_1.3.1       dplyr_1.1.4       workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] sass_0.4.10        utf8_1.2.5         generics_0.1.4     stringi_1.8.7     
 [5] hms_1.1.3          digest_0.6.37      magrittr_2.0.3     RColorBrewer_1.1-3
 [9] grid_4.5.0         evaluate_1.0.3     fastmap_1.2.0      rprojroot_2.0.4   
[13] jsonlite_2.0.0     processx_3.8.6     whisker_0.4.1      backports_1.5.0   
[17] ps_1.9.1           promises_1.3.3     httr_1.4.7         scales_1.4.0      
[21] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6        crayon_1.5.3      
[25] bit64_4.6.0-1      withr_3.0.2        cachem_1.1.0       yaml_2.3.10       
[29] tools_4.5.0        parallel_4.5.0     tzdb_0.5.0         httpuv_1.6.16     
[33] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4    git2r_0.36.2      
[37] stringr_1.5.1      fs_1.6.6           bit_4.6.0          vroom_1.6.5       
[41] pkgconfig_2.0.3    callr_3.7.6        gtable_0.3.6       pillar_1.10.2     
[45] bslib_0.9.0        later_1.4.2        glue_1.8.0         Rcpp_1.0.14       
[49] xfun_0.52          tibble_3.3.0       tidyselect_1.2.1   dichromat_2.0-0.1 
[53] rstudioapi_0.17.1  farver_2.1.2       htmltools_0.5.8.1  labeling_0.4.3    
[57] rmarkdown_2.29     compiler_4.5.0     getPass_0.2-4

Family structure distribution fitting for genetic surveillance simulations

Tina Lasisi

2025-06-19