DM

class pylluminator.dm.DM(samples: Samples, formula: str, reference_value: dict | None = None, custom_sheet: None | DataFrame = None, drop_na=False, apply_mask=True, probe_ids: None | list[str] = None, group_column: str | None = None, use_m_values: bool = True)

Bases: object

Methods

`__init__`(samples, formula[, ...])	Initalize the object by calling the compute_dmp function, which estimate Differentially Methylated Probes (DMPs) by fitting an Ordinary Least Square (OLS) linear regression model to each probe's methylation value (either M-values (default) or beta values), following the given formula.
`compute_dmp`(samples, formula[, ...])	Estimate Differentially Methylated Probes (DMPs) by fitting an Ordinary Least Square (OLS) linear regression model to each probe's methylation value (either M-values (default) or beta values), following the given formula.
`compute_dmr`([contrast, dist_cutoff, ...])	Find Differentially Methylated Regions (DMRs) based on euclidian distance between beta values
`get_top_dmp`([contrast, annotation_col, ...])	Get the top DMPs, ranked by the p-value of the given contrast.
`get_top_dmr`([contrast, chromosome_col, ...])	Get the top DMRs, ranked by the p-value of the given contrast.
`select_dmps`([effect_size_th, p_value_th, ...])	Select DMPs based on effect size and p-value thresholds.

Methods and attributes detail

__init__(samples: Samples, formula: str, reference_value: dict | None = None, custom_sheet: None | DataFrame = None, drop_na=False, apply_mask=True, probe_ids: None | list[str] = None, group_column: str | None = None, use_m_values: bool = True)

Initalize the object by calling the compute_dmp function, which estimate Differentially Methylated Probes (DMPs) by fitting an Ordinary Least Square (OLS) linear regression model to each probe’s methylation value (either M-values (default) or beta values), following the given formula. The predictors used in the formula must be column names of the sample sheet. To account for random effects in the model (e.g. technical replicates, batch effect), set the group_column parameter to the column name of the sample sheet describing the random effect (e.g. “batch_id”). If this parameter is set, a Linear Mixed Model (LMM) will be used instead of the OLS model. The Benjamini-Hochberg procedure is used to adjust the p-values.

More info on design matrices and formulas:

Parameters:

samples (Samples) – samples to use
formula (str) – R-like formula used in the design matrix to describe the statistical model. e.g. ‘~age + sex’
reference_value (dict | None) – reference value for each predictor. Dictionary where keys are the predictor names, and values are their reference value. For example, {‘sex’: ‘female’} to set females as the refence. Default: None
custom_sheet (pandas.DataFrame) – a sample sheet to use. By default, use the samples’ sheet. Useful if you want to filter the samples to display
drop_na (bool) – drop probes that have NA values. Default: False
apply_mask (bool) – set to True to apply mask. Default: True
probe_ids (list[str] | None) – list of probe IDs to use. Useful to work on a subset for testing purposes. Default: None
group_column (str | None) – name of the column of the sample sheet that holds replicates information. If provided, a Mixed Model will be used to account for replicates instead of an Ordinary Least Square. Default: None
use_m_values (bool) – if True, fits the linear regression on M-values instead of beta values, so that fitted values are not constrained in the [0:1] range. Set to False to match SeSAMe. Recommended value: True. Default: True

Returns:

dataframe with probes as rows and p_vales and model estimates in columns, list of contrast levels

Return type:

pandas.DataFrame, list[str]

compute_dmp(samples: Samples, formula: str, reference_value: dict | None = None, custom_sheet: None | DataFrame = None, drop_na=False, apply_mask=True, probe_ids: None | list[str] = None, group_column: str | None = None, use_m_values: bool = True)

Estimate Differentially Methylated Probes (DMPs) by fitting an Ordinary Least Square (OLS) linear regression model to each probe’s methylation value (either M-values (default) or beta values), following the given formula. The predictors used in the formula must be column names of the sample sheet. To account for random effects in the model (e.g. technical replicates, batch effect), set the group_column parameter to the column name of the sample sheet describing the random effect (e.g. “batch_id”). If this parameter is set, a Linear Mixed Model (LMM) will be used instead of the OLS model. The Benjamini-Hochberg procedure is used to adjust the p-values.

More info on design matrices and formulas:

Parameters:

samples (Samples) – samples to use
formula (str) – R-like formula used in the design matrix to describe the statistical model. e.g. ‘~age + sex’
reference_value (dict | None) – reference value for each predicto. Dictionary where keys are the predictor names, and values are their reference value. For example, {‘sex’: ‘female’} to set females as the refence. Default: None
custom_sheet (pandas.DataFrame) – a sample sheet to use. By default, use the samples’ sheet. Useful if you want to filter the samples to display
drop_na (bool) – drop probes that have NA values. Default: False
apply_mask (bool) – set to True to apply mask. Default: True
probe_ids (list[str] | None) – list of probe IDs to use. Useful to work on a subset for testing purposes. Default: None
group_column (str | None) – name of the column of the sample sheet that holds replicates information. If provided, a Mixed Model will be used to account for replicates instead of an Ordinary Least Square. Default: None
use_m_values (bool) – if True, fits the linear regression on M-values instead of beta values, so that fitted values are not constrained in the [0:1] range. Set to False to match SeSAMe. Recommended value: True. Default: True

Returns:

dataframe with probes as rows and p_vales and model estimates in columns, list of contrast levels

Return type:

pandas.DataFrame, list[str]

compute_dmr(contrast: str | list[str] | None = None, dist_cutoff: float | None = None, seg_per_locus: float = 0.5, probe_ids: None | list[str] = None)

Find Differentially Methylated Regions (DMRs) based on euclidian distance between beta values

Parameters:

contrast (str | list[str] | None) – contrast(s) to use for DMRs detection
dist_cutoff (float | None) – cutoff used to find change points between DMRs, used on euclidian distance between beta values. If set to None (default) will be calculated depending on seg_per_locus parameter value. Default: None
seg_per_locus (float) – used if dist_cutoff is not set : defines what quartile should be used as a distance cut-off. Higher values leads to more segments. Should be 0 < seg_per_locus < 1. Default: 0.5.
probe_ids (list[str] | None) – list of probe IDs to use. Useful to work on a subset for testing purposes. Default: None

get_top_dmp(contrast: str | None = None, annotation_col: str = 'genes', sort_by: str = 'effect_size', ascending=False, pval_threshold=0.05, effect_size_threshold: float | None = None, n_dms=10, columns_to_keep: list[str] = None) → DataFrame | None

Get the top DMPs, ranked by the p-value of the given contrast. By default, the results will be annotated with the genes associated with the probes in the DMPs/DMRs. You can control the annotation information with the annotation_col parameter.

Parameters:

contrast (str) – contrast to use for ranking the DMPs.
annotation_col (str) – name of the column holding the annotation information. Default: ‘genes’
n_dms (int) – number of DMPs to return. Default: 10
columns_to_keep (list[str] | None) – list of columns to keep in the output dataframe. Default: None

Returns:

dataframe with the top DMPs

Return type:

pandas.DataFrame | None

get_top_dmr(contrast: str | None = None, chromosome_col='chromosome', annotation_col: str = 'genes', sort_by: str = 'effect_size', ascending=False, pval_threshold=0.05, effect_size_threshold: float | None = None, n_dms=10, columns_to_keep: list[str] = None) → DataFrame | None

Get the top DMRs, ranked by the p-value of the given contrast. By default, the results will be annotated with the genes associated with the probes in the DMRs. You can control the annotation information with the annotation_col parameter.

Parameters:

contrast (str) – contrast to use for ranking the DMRs. None works only if there is only one possible contrast. Default: None.
chromosome_col (str) – name of the column holding the chromosome information. Default: ‘chromosome’
annotation_col (str) – name of the column holding the annotation information. Default: ‘genes’
n_dms (int) – number of DMRs to return. Default: 10
columns_to_keep (list[str] | None) – list of columns to keep in the output dataframe. Default: None

Returns:

dataframe with the top DMRs

Return type:

pandas.DataFrame | None

Select DMPs based on effect size and p-value thresholds. If several p-value columns are available, you can specify which one to use with the p_value_th_col parameter. If not specified, the function will try to find a p-value column automatically.

Parameters:

effect_size_th (float | None) – effect size threshold. Default: None
p_value_th (float | None) – p-value threshold. Must be between 0 and 1. Default: None
p_value_th_col (str | None) – name of the p-value column to use for filtering. If not specified, the function will try to find a p-value column automatically, either taking the F-statistics p-value if it exists, or the predictor p-value colum if there is only one. Default:None
sort_by (str | None) – column name to use for sorting the results. If not specified, the effect_size column will be used. Default: None
ascending (bool) – set to True to sort values in ascending order. Default: False

Returns:

dataframe with the selected DMPs, or None if an error occurred

Return type:

pandas.DataFrame | None