Phase 1 — Conservation & Expression Breadth

PhyloP evolutionary conservation and GTEx tissue expression

Overview

Phase 1 adds two independent constraint axes to the network analysis:

  • Phase 1.1 — PhyloP: Mean vertebrate conservation score (100-way, hg38) per gene, fetched from UCSC. Tests whether functional categories differ in evolutionary conservation.
  • Phase 1.2 — GTEx: Tissue expression breadth (number of tissues with median TPM > 1, GTEx v8, 54 tissues). Tests Hypothesis 2: broadly expressed genes show stronger constraint.

Both metrics are merged with the Raghunath 129-gene LOEUF dataset in:

  • data/network_constraint_phylop.csv
  • data/network_constraint_gtex.csv

Phase 1.1 — PhyloP evolutionary conservation

Spearman ρ(PhyloP, LOEUF) = -0.166, p = 6.164e-02 — genes with higher PhyloP conservation scores tend to be more LoF-intolerant (lower LOEUF), consistent with both metrics capturing evolutionary constraint. The Kruskal-Wallis test across functional categories is significant (p = 3.234e-02), with Developmental/NC genes most conserved and Pigment-specific genes least conserved.

Key genes

gene functional_category LOEUF Mean PhyloP 100-way
0 TFAP2A Developmental/NC 0.261 1.360723
1 SOX10 Developmental/NC 0.209 0.485107
2 PAX3 Developmental/NC 0.475 0.465089
3 TYRP1 Pigment-specific 1.889 0.265219
4 MITF Developmental/NC 0.308 0.171432
5 DCT Pigment-specific 1.174 0.151017
6 TYR Pigment-specific 1.964 0.071985
7 OCA2 Pigment-specific 0.856 -0.009525
8 MC1R Pigment-specific 1.967 -0.114877

PhyloP summary by functional category

N Median PhyloP Mean PhyloP
functional_category
Pigment-specific 7 0.087 0.154
Developmental/NC 8 0.288 0.419
Generic signaling 46 0.185 0.231
Cytokines/growth factors 13 0.233 0.199
Apoptosis/cell death 18 0.068 0.067
Other 35 0.178 0.301

Phase 1.2 — GTEx tissue expression breadth

Spearman ρ(tissue breadth, LOEUF) = -0.322, p = 2.261e-04 — the strongest constraint signal in Phase 1. Genes expressed broadly across tissues are far more LoF-intolerant than tissue-specific genes. The Kruskal-Wallis test is highly significant (p = 6.276e-09).

Hypothesis 2 test — Regression: LOEUF ~ tissue breadth + functional category

coef std err t P>|t| [0.025 0.975]
Intercept 1.6758 0.145 11.524 0.000 1.388 1.964
C(functional_category, Treatment("Pigment-specific"))[T.Apoptosis/cell death] -0.4717 0.183 -2.575 0.011 -0.834 -0.109
C(functional_category, Treatment("Pigment-specific"))[T.Cytokines/growth factors] -0.8837 0.175 -5.038 0.000 -1.231 -0.536
C(functional_category, Treatment("Pigment-specific"))[T.Developmental/NC] -1.0342 0.205 -5.054 0.000 -1.439 -0.629
C(functional_category, Treatment("Pigment-specific"))[T.Generic signaling] -1.0130 0.174 -5.809 0.000 -1.358 -0.668
C(functional_category, Treatment("Pigment-specific"))[T.Other] -0.6559 0.165 -3.977 0.000 -0.982 -0.329
tissue_breadth -0.0054 0.003 -2.142 0.034 -0.010 -0.000

After controlling for functional category, tissue breadth remains a significant predictor of LOEUF (see tissue_breadth coefficient above). Pigment-specific genes are used as the reference category.

Tissue breadth summary by functional category

N Median breadth Mean breadth
functional_category
Pigment-specific 7 9.0 16.9
Developmental/NC 7 49.0 39.3
Generic signaling 47 54.0 52.6
Cytokines/growth factors 13 34.0 29.6
Apoptosis/cell death 18 54.0 49.2
Other 35 53.0 41.8

Phase 1.2 supplementary — Tissue specificity (Tau) vs. LOEUF

Tau (τ) is a continuous tissue-specificity index from log2(TPM+1) expression: τ = 0 means uniform expression across all 54 tissues, τ = 1 means expressed in only one tissue. Stronger signal than the binary tissue-breadth count.


Phase 1.2 supplementary — Per-tissue effect on LOEUF

For each of the 54 GTEx tissues, the difference in median LOEUF between genes expressed there (TPM > 1) vs. not expressed (Mann-Whitney). Negative ΔLOEUF means genes expressed in that tissue are more LoF-constrained.


Phase 1.2 supplementary — Clustered expression heatmap

Genes (rows) sorted by LOEUF (top = most constrained) with functional category strip; tissues (columns) hierarchically clustered by co-expression (correlation distance). Color = log2(TPM + 1).


Phase 1.2 supplementary — UpSet plot of tissue intersections

Top intersections (≥2 genes) of expression across all 54 tissues, with LOEUF distribution per intersection. Most network genes (n = 71) are expressed in all 54 tissues — true housekeeping pattern.


Alternative categorizations (data-driven)

The hand-curated functional_category column (Pigment-specific, Developmental/NC, Generic signaling, etc.) comes from analysis/notebooks/melanogenesis_network_constraint_v2.ipynb and is persisted in data/LOEUF_by_functional_category.xlsx. To compare hand-curation against unbiased schemes, two additional categories are computed in analysis/phase1_new_categories.py and merged into data/network_constraint_categorized.csv.

gtex_tissue_category — derived from GTEx v8 expression

Computed on log2(TPM + 1) across all 54 GTEx tissues:

  • τ (tau) — Yanai 2005 tissue specificity index. τ = 0 → uniform; τ = 1 → expressed in one tissue.
  • n_expr — # tissues with median TPM > 1
  • max_tissue — tissue with highest TPM

Decision tree (first match wins):

Category Rule
Housekeeping τ < 0.4 AND n_expr ≥ 40
Skin-restricted τ ≥ 0.6 AND max_tissue ∈ skin
Brain-restricted τ ≥ 0.6 AND max_tissue ∈ CNS / nerve / pituitary
Reproductive-restricted τ ≥ 0.6 AND max_tissue ∈ gonads / uterus / cervix / prostate
Immune-restricted τ ≥ 0.6 AND max_tissue ∈ blood / spleen / lymphocytes
Liver-restricted τ ≥ 0.6 AND max_tissue == Liver
Other-restricted τ ≥ 0.6 AND any other tissue
Broad everything else (intermediate τ)

Thresholds (τ ≥ 0.6, n_expr ≥ 40) follow conventions in the tissue-specificity literature (Sonawane 2017, Kryuchkova-Mostacci 2017). To change them, edit the constants at the top of phase1_new_categories.py.

kegg_primary_pathway — derived from KEGG pathway membership

analysis/fetch_kegg_pathways.py fetches link/pathway/hsa and list/pathway/hsa from the KEGG REST API and writes the long-form per-gene pathway list to data/kegg_pathway_lists.csv.

Each gene is assigned a single primary pathway using this priority order (first match wins, so pigmentation-specific Melanogenesis outranks generic signaling cascades):

Priority Pathway KEGG ID
1 Melanogenesis hsa04916
2 MAPK signaling hsa04010
3 PI3K-Akt signaling hsa04151
4 Apoptosis hsa04210
5 Cytokine-cytokine receptor hsa04060
6 Wnt signaling hsa04310
7 JAK-STAT signaling hsa04630
8 NF-κB signaling hsa04064
9 Other (in KEGG) (any pathway, none above)
10 Not in KEGG (gene absent from KEGG hsa)

Network gene counts under each scheme

GTEx tissue category:
Genes
gtex_tissue_category
Housekeeping 61
Broad 31
Other-restricted 14
Skin-restricted 7
Immune-restricted 7
Reproductive-restricted 4
Brain-restricted 3
Liver-restricted 1

KEGG primary pathway:
Genes
kegg_primary_pathway
MAPK signaling 36
Other (in KEGG) 29
Melanogenesis 23
PI3K-Akt signaling 14
Apoptosis 10
Not in KEGG 7
Cytokine-cytokine receptor 4
JAK-STAT signaling 3
NF-kB signaling 2
Wnt signaling 1

Cross-tabulation: GTEx category × KEGG pathway

kegg_primary_pathway Apoptosis Cytokine-cytokine receptor JAK-STAT signaling MAPK signaling Melanogenesis NF-kB signaling Not in KEGG Other (in KEGG) PI3K-Akt signaling Wnt signaling
gtex_tissue_category
Brain-restricted 0 0 0 0 1 0 0 2 0 0
Broad 1 0 1 9 7 1 0 7 5 0
Housekeeping 8 0 2 21 11 0 1 9 8 1
Immune-restricted 1 3 0 3 0 0 0 0 0 0
Liver-restricted 0 0 0 0 0 0 0 1 0 0
Other-restricted 0 1 0 1 1 1 2 7 1 0
Reproductive-restricted 0 0 0 2 0 0 0 2 0 0
Skin-restricted 0 0 0 0 3 0 3 1 0 0

Data provenance

File Description Source
data/phylop_scores.csv Mean PhyloP 100-way per gene (hg38) UCSC REST API via analysis/fetch_phylop_scores.py
data/GTEx_v8_gene_median_tpm.gct.gz GTEx v8 median TPM, 54 tissues GTEx Portal (auto-downloaded)
data/network_constraint_phylop.csv LOEUF + PhyloP merged analysis/phase1_phylop_analysis.py
data/network_constraint_gtex.csv LOEUF + tissue breadth merged analysis/phase1_gtex_analysis.py
data/gtex_tissue_membership.csv Per-gene boolean expression in all 54 GTEx tissues analysis/phase1_gtex_upset.py
output/table_phase1_gtex_per_tissue.csv ΔLOEUF and Mann-Whitney p per tissue analysis/phase1_gtex_extras.py
data/kegg_pathway_lists.csv Long-form gene × KEGG pathway membership with names analysis/fetch_kegg_pathways.py
data/network_constraint_categorized.csv Network LOEUF + tau, n_expr, max_tissue, gtex_tissue_category, kegg_primary_pathway analysis/phase1_new_categories.py