5. Pathway analysis

Pathway analysis methods provide meaningful insights to better understand how differentially expressed genes (DEG) impact biological processes. Let’s see how we can run Over-Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) on our data, using gseapy package. As the returned objects are pretty big, we delete them at the end of each section for the notebook to save memory.

For more details about the functions used here, you can refer to gseapy documentation, and for more information about GSEA in general (input data etc) : gsea-msigdb website

[1]:

from pylluminator.utils import load_object
from pylluminator.dm import combine_p_values_stouffer
from pylluminator.utils import set_logger
from pylluminator.visualizations import visualize_gene

import numpy as np
import gseapy as gp
import networkx as nx
import matplotlib.pyplot as plt

set_logger('WARNING')  # set the verbosity level, can be DEBUG, INFO, WARNING, ERROR

5.1. Data preparation

To run ORA and Pre-rank methods, we need to load DMPs or DMRs. If you haven’t done it yet, check out notebook 3 - DMPs and DMRs.

For GSEA method, we use samples data directly - we use the samples stored in your DM object.

[2]:

my_dms = load_object('dms')  # load a DM object

[3]:

# get the genes associated with each probe
annotation_colname = 'genes'
gene_info = my_dms.samples.annotation.probe_infos[['probe_id', annotation_colname]].drop_duplicates().dropna()
# if some probes are associated to several genes, make it one row per gene. Make sure genes are upper case for GSEApy compatibilty
gene_info[annotation_colname] = gene_info[annotation_colname].apply(lambda x: x.upper().split(';'))
gene_info = gene_info.explode(annotation_colname).drop_duplicates()
gene_info.head()

[3]:

	probe_id	genes
illumina_id
41791408	cg00000029_TC21	ENSG00000302917
66725308	cg00000109_TC21	FNDC3B
87669537	cg00000155_BC21	BRAT1
47668938	cg00000158_BC21	IARS1
45649248	cg00000165_TC21	ENSG00000287076

For each gene, we use the DMPs or the DMRs to compute the fold change (average beta difference between the two conditions) and the significance (by combining the p-values of all associated probes using Stouffer’s method).

[4]:

# chose whether to use DMPs or DMRs
input_type = 'DMP'

# if we work on DMRs, add the probe_ids to the DMRs - for DMPs, just use the DMPs dataframe directly
dm_df = my_dms.dmr.join(my_dms.segments.reset_index().set_index('segment_id')) if input_type == 'DMR' else my_dms.dmp
# add the gene information
dm_df = dm_df.merge(gene_info, on='probe_id')

[5]:

# set the columns to use to select or rank genes (you can check available columns with dm_df.columns)
significance_colname = 'sample_type[T.PREC]_p_value_adjusted'
fold_change_colname = 'sample_type[T.PREC]_avg_beta_delta'
# set the column of the sample sheet that define the type of sample
class_colname = 'sample_type'

# keep only useful columns and remove NAs
dm_df = dm_df[[annotation_colname, fold_change_colname, significance_colname]].dropna()
# aggregate values for each gene
gene_fc_sig_df = dm_df.groupby(annotation_colname).agg({fold_change_colname: 'mean', significance_colname: combine_p_values_stouffer})

gene_fc_sig_df.head()

[5]:

	sample_type[T.PREC]_avg_beta_delta	sample_type[T.PREC]_p_value_adjusted
genes
A1BG	4.272091	3.344050e-04
A1BG-AS1	5.190440	6.649850e-08
A2M	0.126499	6.235096e-01
A2ML1	0.019203	8.019561e-01
A2ML1-AS1	-1.116004	8.374415e-04

[6]:

dm_df[dm_df[significance_colname] < 0.05].sort_values(fold_change_colname, ascending=False)

[6]:

	genes	sample_type[T.PREC]_avg_beta_delta	sample_type[T.PREC]_p_value_adjusted
48550	ADD3-AS1	11.131983	0.000006
48551	ADD3	11.131983	0.000006
48909	MICALL2	11.052155	0.000007
54463	ENSG00000249526	10.982889	0.000006
47863	ENSG00000309175	10.912512	0.000007
...	...	...	...
51252	SHISA9	-10.041313	0.000007
42354	WSCD1	-10.047127	0.000012
67052	REXO1L5P	-10.138548	0.000007
50033	TMEM132D	-10.182215	0.000006
5064	ENSG00000296007	-10.659273	0.000012

31828 rows × 3 columns

[7]:

gene_fc_sig_df[gene_fc_sig_df[significance_colname] < 0.05].sort_values(fold_change_colname, ascending=False)

[7]:

	sample_type[T.PREC]_avg_beta_delta	sample_type[T.PREC]_p_value_adjusted
genes
CSF1	10.830772	3.274393e-06
GPR88	10.662909	1.103371e-04
CLEC14A	10.336624	1.010700e-05
ENSG00000301685	10.303629	4.258591e-05
ENSG00000298776	10.254025	1.820109e-05
...	...	...
REXO1L5P	-9.414345	4.298256e-10
ENSG00000286749	-9.451847	9.377621e-06
NMUR2	-9.451847	9.377621e-06
ENSG00000259862	-9.483982	8.264240e-06
ENSG00000296007	-10.154915	8.800554e-10

11235 rows × 2 columns

For GSEA, we compute the mean beta values per gene, for each sample

[8]:

probes_betas_df = my_dms.samples.get_betas().reset_index().set_index('probe_id').drop(columns=['type', 'channel', 'probe_type'])
genes_betas_df = probes_betas_df.join(gene_info.set_index('probe_id'), how='right').groupby(annotation_colname).mean()

# get the list of sample types, ordered like the beta dataframe
sample_info = my_dms.samples.sample_sheet.set_index(my_dms.samples.sample_label_name)
sample_types = sample_info.loc[probes_betas_df.columns, class_colname]

genes_betas_df.head()

[8]:

	LNCAP_500_1	LNCAP_500_2	LNCAP_500_3	PREC_500_1	PREC_500_2	PREC_500_3
genes
5S_RRNA	0.966769	0.958945	0.962262	0.957187	0.953667	0.960831
7SK	NaN	NaN	NaN	NaN	NaN	NaN
A1BG	0.585800	0.585304	0.550419	0.227678	0.211937	0.223345
A1BG-AS1	0.813419	0.808735	0.808907	0.472105	0.458322	0.467913
A1CF	0.331372	0.354633	0.322745	0.779023	0.780717	0.784422

[9]:

del dm_df
del probes_betas_df
del gene_info

5.1.1. Gene set selection

You will first need to select the gene set(s) to use in you pathway analysis. You can browse enrichr website or use the gp.get_library_name() function to list available libraries

[10]:

gene_sets = ['GO_Biological_Process_2025', 'KEGG_2021_Human']

5.1.2. Organism specification

Specify the organism you’re working on: Human, Mouse, Yeast, Fly, Fish or Worm

[11]:

organism = 'human'

5.2. Over-Representation Analysis

5.2.1. ORA without background

The minimal input you need to run an ORA is a list of differentially expressed genes (DEG) from your dataset. Here we use the threshold of p-value < 0.02 and abs(fold-change) > 0.85

[12]:

deg = list(set(gene_fc_sig_df[(gene_fc_sig_df[significance_colname] < 0.02) & (abs(gene_fc_sig_df[fold_change_colname]) > 0.85)].index.to_numpy()))
print(f'Number of genes selected: {len(deg)}/{len(set(gene_fc_sig_df.index))}\n')

enr = gp.enrichr(gene_list=deg, gene_sets=gene_sets, organism=organism)

# output most significant results
enr.results.sort_values(['Adjusted P-value', 'P-value']).head()

Number of genes selected: 8051/21863

[12]:

	Gene_set	Term	Overlap	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
4770	KEGG_2021_Human	Maturity onset diabetes of the young	17/26	0.008432	0.999997	2.807236	13.406480	PKLR;ONECUT1;PDX1;HNF1B;PAX6;HNF1A;MAFA;GCK;IN...
4771	KEGG_2021_Human	Neomycin, kanamycin and gentamicin biosynthesis	3/5	0.321838	0.999997	2.226702	2.524427	HK3;HKDC1;GCK
4772	KEGG_2021_Human	Arrhythmogenic right ventricular cardiomyopathy	33/77	0.360780	0.999997	1.113588	1.135288	ITGB1;RYR2;LEF1;TCF7;ATP2A2;CACNA1D;SLC8A1;ACT...
4773	KEGG_2021_Human	Nitrogen metabolism	8/17	0.367660	0.999997	1.319574	1.320361	CA12;CA2;CA5A;CA4;CA7;CA6;CA9;CA13
4774	KEGG_2021_Human	ECM-receptor interaction	37/88	0.404867	0.999997	1.077100	0.973910	ITGB1;LAMC3;LAMA3;TNC;DMP1;THBS2;THBS1;THBS4;C...

Adjusted P-Values are too high to extract any significant pathway. One reason can be that we are working on a limited number of probes (in notebook 4, we selected only 10% of the probes to compute the DMP, to speed up the demo - you can try with all the probes and). We can also try a different approach, ORA with background, to focus our analysis on the genes we are actually studying.

[13]:

# Here is the code to visualize the results. Use the cutoff parameter to adjust the p-value threshold
# ax = gp.dotplot(enr.results, x='Gene_set', size=3, top_term=5, title='ORA without background', xticklabels_rot=30, show_ring=True, figsize=(5,5))
# # use smaller fonts
# for item in ax.get_xticklabels() + ax.get_yticklabels():
#     item.set_fontsize(8)
# ax.xaxis.label.set_size(10)
# ax.title.set_size(15)

5.2.2. ORA with background

By default, selected genes are tested against all genes available in the gene sets. For better results, it can be relevant to narrow it down to a list of genes of interest.

Let’s use the genes that were detected in our experiment as background genes, and see how it changes the result.

[14]:

background_genes = list(set(gene_fc_sig_df.index))
print('Number of selected genes: ', len(deg))
print('Number of background genes: ', len(background_genes), '\n')

enr_bg = gp.enrichr(gene_list=deg, gene_sets=gene_sets, organism=organism, background=background_genes)
# output top 5 results results
enr_bg.results.sort_values('Adjusted P-value')[:5]

Number of selected genes:  8051
Number of background genes:  21863

[14]:

	Gene_set	Term	P-value	Adjusted P-value	Odds Ratio	Combined Score	Genes
4770	KEGG_2021_Human	Neuroactive ligand-receptor interaction	4.249144e-14	1.338480e-11	2.919893	89.901983	CHRM2;CHRM3;OXTR;NPFFR1;CHRM4;NR3C1;RXFP1;RXFP...
4771	KEGG_2021_Human	Cytokine-cytokine receptor interaction	1.090152e-07	1.716989e-05	2.571359	41.223459	ACVRL1;CNTFR;CXCL6;IL25;TNFRSF13B;CSF1;EPO;TNF...
4772	KEGG_2021_Human	Viral protein interaction with cytokine and cy...	3.331175e-05	3.497734e-03	4.299153	44.322547	CX3CR1;CXCL6;CSF1;TNF;CXCL2;CX3CL1;CXCL5;IL18R...
4773	KEGG_2021_Human	Maturity onset diabetes of the young	4.548521e-05	3.581961e-03	7.304456	73.030852	PKLR;ONECUT1;PDX1;HNF1B;PAX6;HNF1A;MAFA;GCK;IN...
4775	KEGG_2021_Human	Complement and coagulation cascades	2.738363e-04	1.437641e-02	3.095040	25.388550	CFD;SERPINA1;C1R;SERPINE1;C5AR1;F13A1;PLAT;SER...

Visualize the top 5 terms for each gene set, with a significance cutoff set at 0.1

[15]:

ax = gp.dotplot(enr_bg.results, x='Gene_set', size=2, top_term=5, figsize=(5, 5), y_order=True, title='ORA with background', xticklabels_rot=30, cutoff=0.1)
# use smaller fonts
for item in ax.get_xticklabels() + ax.get_yticklabels():
    item.set_fontsize(8)
ax.xaxis.label.set_size(10)
ax.title.set_size(15)

../_images/tutorials_5_-_Pathway_analysis_25_0.png

[16]:

del deg
del enr
del enr_bg

5.3. Pre-rank GSEA

As we have already calculate DMPs an DMRs, we can use them to rank the genes and use the pre-rank method provided by gseapy.

Here we use the formula sign(FC) * - log10(significance) to rank the genes, that bring the most upregulated genes at the beginning of the list, the most downregulated at the end, and the less differentialy expressed genes in the middle.

[17]:

def rank_formula(row):
    if row[significance_colname] == 0:
        return np.sign(row[fold_change_colname]) * np.inf
    return np.sign(row[fold_change_colname]) * -np.log10(row[significance_colname])

rank_data = gene_fc_sig_df.apply(rank_formula, axis=1).sort_values(ascending=False)

# replace infinity values by the defined min and max
max_val = rank_data[rank_data != np.inf].iloc[0]
min_val = rank_data[rank_data != -np.inf].iloc[-1]
rank_data[rank_data == np.inf] = max_val
rank_data[rank_data == -np.inf] = min_val

[18]:

# we chose a low permutation number to speed up the demo
pre_res = gp.prerank(rnk=rank_data, gene_sets=gene_sets, verbose=True, permutation_num=500, max_size=300, threads=4)

2026-04-08 09:12:20,280 [WARNING] Duplicated values found in preranked stats: 5.58% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2026-04-08 09:12:20,281 [INFO] Parsing data files for GSEA.............................
2026-04-08 09:12:20,402 [INFO] Downloading and generating Enrichr library gene sets......
2026-04-08 09:12:23,430 [INFO] Downloading and generating Enrichr library gene sets......
2026-04-08 09:12:24,064 [INFO] 3468 gene_sets have been filtered out when max_size=300 and min_size=15
2026-04-08 09:12:24,065 [INFO] 2195 gene_sets used for further statistical testing.....
2026-04-08 09:12:24,065 [INFO] Start to run GSEA...Might take a while..................
2026-04-08 09:13:39,253 [INFO] Congratulations. GSEApy runs successfully................

Order the pathways by their False Discovery Rate (FDR), and show the ones that have a FDR lower than 0.05

[19]:

pre_res.res2d = pre_res.res2d.sort_values('FDR q-val').reset_index(drop=True)
pre_res.res2d[pre_res.res2d['FDR q-val'] < 0.05]

[19]:

	Name	Term	ES	NES	NOM p-val	FDR q-val	FWER p-val	Tag %	Gene %	Lead_genes
0	prerank	GO_Biological_Process_2025__Anterior/Posterior...	0.788043	1.786364	0.0	0.001958	0.002	23/49	10.44%	HOXA3;HOXB3;HOXD3;HOXC4;GATA4;WT1;SKI;HELT;HOX...
1	prerank	GO_Biological_Process_2025__Epithelium Develop...	0.713769	1.726445	0.0	0.012137	0.054	32/104	12.64%	WT1;SALL1;PAX2;SIX1;TBX5;TGFB2;TBX2;NKX2-5;WNT...
2	prerank	GO_Biological_Process_2025__Embryonic Limb Mor...	0.805667	1.73306	0.002304	0.013213	0.048	14/33	11.37%	SALL1;TBX5;RARG;SKI;TGFB2;TBX2;SHH;TBX4;ECE1;O...
3	prerank	GO_Biological_Process_2025__Mismatch Repair (G...	0.860914	1.738964	0.0	0.013703	0.038	1/21	0.51%	TP73
4	prerank	GO_Biological_Process_2025__Branching Involved...	0.894966	1.694931	0.0	0.018053	0.144	9/15	7.08%	WT1;SALL1;PAX2;BMP4;SIX1;SHH;PAX8;HOXD11;GDNF
5	prerank	GO_Biological_Process_2025__Renal System Devel...	0.75731	1.690884	0.0	0.018596	0.158	14/43	6.70%	WT1;SALL1;PAX2;BMP4;SIX1;TFAP2B;GATA3;TGFB2;HN...
6	prerank	GO_Biological_Process_2025__Ureteric Bud Morph...	0.901892	1.710776	0.0	0.018923	0.1	10/15	7.08%	WT1;SALL1;PAX2;BMP4;SIX1;GATA3;SHH;PAX8;HOXD11...
7	prerank	GO_Biological_Process_2025__Bicellular Tight J...	0.813137	1.686176	0.002262	0.019397	0.182	8/27	6.49%	CLDN6;RAMP2;CLDN10;TBCD;CLDN14;CLDN5;MICALL2;P...
8	prerank	GO_Biological_Process_2025__Embryonic Skeletal...	0.800425	1.701648	0.0	0.019575	0.124	17/30	7.15%	HOXA3;HOXB3;HOXD3;HOXC4;SIX1;HOXC9;HOXB4;COL2A...
9	prerank	GO_Biological_Process_2025__Kidney Development...	0.73744	1.696753	0.0	0.01982	0.142	19/55	7.08%	WT1;SALL1;PAX2;BMP4;SIX1;TFAP2B;GATA3;TGFB2;GD...
10	prerank	GO_Biological_Process_2025__Ear Morphogenesis ...	0.848769	1.739532	0.0	0.020554	0.038	13/21	10.24%	PAX2;SIX1;ZIC1;PAX8;OSR2;GSC;TBX1;TMIE;OSR1;TF...
11	prerank	GO_Biological_Process_2025__Dopaminergic Neuro...	0.803457	1.666185	0.0	0.033604	0.324	12/24	8.69%	WNT3A;EN1;LMX1A;WNT3;SHH;PHOX2B;LMX1B;FGF8;OTX...
12	prerank	GO_Biological_Process_2025__Metanephros Develo...	0.808404	1.6632	0.0	0.03388	0.344	12/23	12.01%	WT1;PAX2;SIX1;GDF6;SHH;PAX8;OSR2;FGF8;OSR1;GDN...
13	prerank	GO_Biological_Process_2025__Embryonic Digit Mo...	0.855405	1.656641	0.002577	0.036913	0.376	7/16	8.32%	SALL1;TBX2;SHH;ECE1;OSR2;OSR1;MSX1
14	prerank	GO_Biological_Process_2025__Apical Junction As...	0.780649	1.650685	0.002262	0.040977	0.44	8/32	6.49%	CLDN6;RAMP2;CLDN10;TBCD;CLDN14;CLDN5;MICALL2;P...
15	prerank	GO_Biological_Process_2025__Regulation of Smoo...	0.745976	1.641383	0.0	0.049916	0.504	12/39	12.61%	RAB34;GLIS2;SHH;MGRN1;PTCH2;ZIC1;OTX2;FGFR2;KI...

Now we can visualize the most significant term

[20]:

term = pre_res.res2d.Term[0]
fig = pre_res.plot(terms=term, figsize=(10, 5))
# use smaller fonts
for ax in fig.axes:
    ax.set_xlabel(ax.get_xlabel(), fontsize=10)
    ax.set_ylabel(ax.get_ylabel(), fontsize=10)
_ = fig.suptitle(term, fontsize=15)

../_images/tutorials_5_-_Pathway_analysis_33_0.png

Or visualize the top 5 terms on the same plot:

[21]:

fig = pre_res.plot(terms=pre_res.res2d.Term[:5],figsize=(3,4))
# use smaller fonts
for ax in fig.axes:
    ax.set_xlabel(ax.get_xlabel(), fontsize=10)
    ax.set_ylabel(ax.get_ylabel(), fontsize=10)

../_images/tutorials_5_-_Pathway_analysis_35_0.png

[22]:

ax = gp.dotplot(pre_res.res2d, column='FDR q-val', title='Pre-ranked GSEA', cmap=plt.cm.viridis, size=3, figsize=(3, 4), show_ring=True)
# use smaller fonts
for item in ax.get_xticklabels() + ax.get_yticklabels():
    item.set_fontsize(8)
ax.xaxis.label.set_size(10)
ax.title.set_size(15)

../_images/tutorials_5_-_Pathway_analysis_36_0.png

[23]:

term_idx = 0  # chose the first term
genes = pre_res.res2d.Lead_genes[term_idx].split(";")
ax = gp.heatmap(df = genes_betas_df.loc[genes], z_score=0, title=pre_res.res2d.Term[term_idx], figsize=(7,6), yticklabels=False)
# use smaller fonts and show all labels
genes.reverse()  #for labels to be in the right order
ax.title.set_size(15)
_ = ax.set_xticks(range(len(genes_betas_df.columns)), labels=genes_betas_df.columns, rotation=30, ha="center",  fontsize=8)
_ = ax.set_yticks(range(len(genes)), labels=genes, va="bottom", fontsize=8)

../_images/tutorials_5_-_Pathway_analysis_37_0.png

Using pylluminator’s function, we can go back to checking the beta values of the probes associated to a given gene

[24]:

visualize_gene(my_dms.samples, 'PAX2', figsize=(20, 6))

../_images/tutorials_5_-_Pathway_analysis_39_0.png

Finally, compute the enrichment map to show the relations between the detected relevant pathways

[25]:

nodes, edges = gp.enrichment_map(pre_res.res2d, cutoff=0.1)  # chose a higher p-value cutoff to show more nodes, default is 0.05

# build graph
G = nx.from_pandas_edgelist(edges,
                            source='src_idx',
                            target='targ_idx',
                            edge_attr=['jaccard_coef', 'overlap_coef', 'overlap_genes'])

nodes = nodes.loc[G.nodes.keys()]  # remove nodes that are not connected by any edge

fig, ax = plt.subplots(figsize=(10, 6), layout='constrained')

# init node coordinates
pos=nx.layout.bfs_layout(G, start=nodes.index[0])

# add a colorbar
nodes_cmap = plt.cm.RdYlBu
max_NES = max(abs(min(nodes.NES)), max(nodes.NES))
sm = plt.cm.ScalarMappable(cmap=nodes_cmap, norm=plt.Normalize(vmin=-max_NES, vmax=max_NES))
sm.set_array([])
cbar = plt.colorbar(sm, ax=ax, shrink=0.25)
cbar.set_label('NES (Normalized Enrichment Score)', rotation=270, labelpad=15)

# draw nodes
nx.draw_networkx_nodes(G, pos=pos,
                       cmap=nodes_cmap, node_color=nodes.NES, vmin=-max_NES, vmax=max_NES,
                       margins=0.3, node_size=list(nodes.Hits_ratio *1000))
# draw node labels - wrap labels so that they don't overlap
max_length = 35
wrapped_labels = {k: "\n".join([v[i:i+max_length] for i in range(0, len(v), max_length)]) for k, v in nodes.Term.to_dict().items()}
nx.draw_networkx_labels(G, pos=pos, font_size= 6, labels=wrapped_labels)

# draw edges
edge_weight = nx.get_edge_attributes(G, 'jaccard_coef').values()
nx.draw_networkx_edges(G, pos=pos, width=list(map(lambda x: x*10, edge_weight)))

plt.axis('off')
plt.show()

../_images/tutorials_5_-_Pathway_analysis_41_0.png

[26]:

del gene_fc_sig_df
del pre_res
del my_dms

5.4. GSEA

Since we are working with two groups (healthy control cells and prostate cancer cells), we can directly use the GSEA function on our ‘raw’ data (here, the genes beta values) with the corresponding phenotype dataframe (sample_types), to find the enriched pathways. Uncomment the following code to run the analysis.

[27]:

# gs = gp.GSEA(data=genes_betas_df.dropna(), gene_sets=gene_sets, classes=sample_types, permutation_num=1000)
# gs.pheno_neg = 'PREC'  # control samples
# gs.pheno_pos = 'LNCAP'  # samples of prostate cancer cells
# gs.run()

Visualize the top 5 identified pathways

[28]:

# _ = gs.plot(gs.res2d.Term[:5])

Visualize, for a given pathway and for each gene, the samples z-score.

[29]:

# term_idx = 0  # chose the first term
# genes = gs.res2d.Lead_genes[term_idx].split(";")
# ax = gp.heatmap(df = genes_betas_df.loc[genes], z_score=0, title=gs.res2d.Term[term_idx], figsize=(8,4))