DM

class pylluminator.dm.DM(samples: Samples, formula: str, reference_value: dict | None = None, custom_sheet: None | DataFrame = None, drop_na=False, apply_mask=True, probe_ids: None | list[str] = None, group_column: str | None = None)

Bases: object

Methods

`__init__`(samples, formula[, ...])	Initialize the object by calculating the Differentially Methylated Probes (DMPs).
`compute_dmp`(samples, formula[, ...])	Find Differentially Methylated Probes (DMPs) by fitting an Ordinary Least Square model (OLS) for each probe, following the given formula.
`compute_dmr`([contrast, dist_cutoff, ...])	Find Differentially Methylated Regions (DMRs) based on euclidian distance between beta values
`get_top`(dm_type, contrast[, chromosome_col, ...])	Get the top DMPs or DMRs, ranked by the p-value of the given contrast.
`select_dmps`([effect_size_th, p_value_th, ...])	Select DMPs based on effect size and p-value thresholds.

Methods and attributes detail

__init__(samples: Samples, formula: str, reference_value: dict | None = None, custom_sheet: None | DataFrame = None, drop_na=False, apply_mask=True, probe_ids: None | list[str] = None, group_column: str | None = None)

Initialize the object by calculating the Differentially Methylated Probes (DMPs). It fits an Ordinary Least Square model (OLS) for each probe, following the given formula. The predictors used in the formula are column names of the sample sheet. If a group column name is given, use a Mixed Model to account for random effects. The Benjamini-Hochberg procedure is used to adjust the p-values.

More info on design matrices and formulas:

Parameters:

samples (Samples) – samples to use
formula (str) – R-like formula used in the design matrix to describe the statistical model. e.g. ‘~age + sex’
reference_value (dict | None) – reference value for each predictor. Dictionary where keys are the predictor names, and values are their reference value. For example, {‘sex’: ‘female’} to set females as the refence. Default: None
custom_sheet (pandas.DataFrame) – a sample sheet to use. By default, use the samples’ sheet. Useful if you want to filter the samples to display
drop_na (bool) – drop probes that have NA values. Default: False
apply_mask (bool) – set to True to apply mask. Default: True
probe_ids (list[str] | None) – list of probe IDs to use. Useful to work on a subset for testing purposes. Default: None
group_column (str | None) – name of the column of the sample sheet that holds replicates information. If provided, a Mixed Model will be used to account for replicates instead of an Ordinary Least Square. Default: None

Returns:

dataframe with probes as rows and p_vales and model estimates in columns, list of contrast levels

Return type:

pandas.DataFrame, list[str]

compute_dmp(samples: Samples, formula: str, reference_value: dict | None = None, custom_sheet: None | DataFrame = None, drop_na=False, apply_mask=True, probe_ids: None | list[str] = None, group_column: str | None = None)

Find Differentially Methylated Probes (DMPs) by fitting an Ordinary Least Square model (OLS) for each probe, following the given formula. The predictors used in the formula are column names of the sample sheet. If a group column name is given, use a Linear Mixed Model (LMM) to account for random effects. The Benjamini-Hochberg procedure is used to adjust the p-values.

More info on design matrices and formulas:

Parameters:

samples (Samples) – samples to use
formula (str) – R-like formula used in the design matrix to describe the statistical model. e.g. ‘~age + sex’
reference_value (dict | None) – reference value for each predicto. Dictionary where keys are the predictor names, and values are their reference value. For example, {‘sex’: ‘female’} to set females as the refence. Default: None
custom_sheet (pandas.DataFrame) – a sample sheet to use. By default, use the samples’ sheet. Useful if you want to filter the samples to display
drop_na (bool) – drop probes that have NA values. Default: False
apply_mask (bool) – set to True to apply mask. Default: True
probe_ids (list[str] | None) – list of probe IDs to use. Useful to work on a subset for testing purposes. Default: None
group_column (str | None) – name of the column of the sample sheet that holds replicates information. If provided, a Mixed Model will be used to account for replicates instead of an Ordinary Least Square. Default: None

Returns:

dataframe with probes as rows and p_vales and model estimates in columns, list of contrast levels

Return type:

pandas.DataFrame, list[str]

compute_dmr(contrast: str | list[str] | None = None, dist_cutoff: float | None = None, seg_per_locus: float = 0.5, probe_ids: None | list[str] = None)

Find Differentially Methylated Regions (DMRs) based on euclidian distance between beta values

Parameters:

contrast (str | list[str] | None) – contrast(s) to use for DMRs detection
dist_cutoff (float | None) – cutoff used to find change points between DMRs, used on euclidian distance between beta values. If set to None (default) will be calculated depending on seg_per_locus parameter value. Default: None
seg_per_locus (float) – used if dist_cutoff is not set : defines what quartile should be used as a distance cut-off. Higher values leads to more segments. Should be 0 < seg_per_locus < 1. Default: 0.5.
probe_ids (list[str] | None) – list of probe IDs to use. Useful to work on a subset for testing purposes. Default: None

get_top(dm_type: DM_TYPE | str, contrast: str, chromosome_col='chromosome', annotation_col: str = 'genes', n_dms=10, columns_to_keep: list[str] = None) → DataFrame | None

Get the top DMPs or DMRs, ranked by the p-value of the given contrast. By default, the results will be annotated with the genes associated with the probes in the DMPs/DMRs. You can control the annotation information with the annotation_col parameter.

Parameters:

dm_type (DM_TYPE | str) – type of Differentially Methylated object to get (DMRs or DMPs).
contrast (str) – contrast to use for ranking the DMRs
chromosome_col (str) – name of the column holding the chromosome information. Default: ‘chromosome’
annotation_col (str) – name of the column holding the annotation information. Default: ‘genes’
n_dms (int) – number of DM probes/segments to return. Default: 10
columns_to_keep (list[str] | None) – list of columns to keep in the output dataframe. Default: None

Returns:

dataframe with the top DMRs

Return type:

pandas.DataFrame | None

Select DMPs based on effect size and p-value thresholds. If several p-value columns are available, you can specify which one to use with the p_value_th_col parameter. If not specified, the function will try to find a p-value column automatically.

Parameters:

effect_size_th (float | None) – effect size threshold. Default: None
p_value_th (float | None) – p-value threshold. Must be between 0 and 1. Default: None
p_value_th_col (str | None) – name of the p-value column to use for filtering. If not specified, the function will try to find a p-value column automatically, either taking the F-statistics p-value if it exists, or the predictor p-value colum if there is only one. Default:None
sort_by (str | None) – column name to use for sorting the results. If not specified, the effect_size column will be used. Default: None
ascending (bool) – set to True to sort values in ascending order. Default: False

Returns:

dataframe with the selected DMPs, or None if an error occurred

Return type:

pandas.DataFrame | None