Samples

class pylluminator.samples.Samples(sample_sheet_df: DataFrame | None = None)

Bases: object

Samples objects hold sample methylation signal in a dataframe, as well as annotation information, sample sheet data and probes masks.

Variables:
  • annotation (Annotations | None) – probes metadata. Default: None.

  • sample_sheet (pandas.DataFrame | None) – samples information given by the csv sample sheet. Default: None

  • min_beads (int | None) – minimum number of beads required for a probe to be considered. Default: None

  • idata (dict[str, dict[Channel, pandas.DataFrame]]) – dictionary of dataframes containing the raw signal values for each sample and channel. Default: {}

  • masks (MaskCollection) – collection of probes masks. Default: MaskCollection()

Methods

__init__([sample_sheet_df])

Initialize the object with only a sample-sheet.

add_annotation_info(annotation, label_column)

Merge manifest dataframe with probe signal values read from idat files to build the signal dataframe, adding channel information, methylation state and mask names for each probe.

batch_correction(batch[, apply_mask, ...])

Applies ComBat algorithm for batch correction on beta values.

calculate_betas([include_out_of_band])

Calculate beta values for all probes.

cg_probes([apply_mask, sigdf])

Get CG (CpG) type probes, and apply the mask if apply_mask is True

ch_probes([apply_mask, sigdf])

Get CH (CpH) type probes, and apply the mask if apply_mask is True

controls([apply_mask, pattern, sigdf])

Get the subset of control probes, matching the pattern with the probe_ids if a pattern is provided

copy()

Create a copy of the Samples object

drop_samples(sample_labels)

Remove some samples.

dye_bias_correction([sample_label, ...])

Correct dye bias by linear scaling.

dye_bias_correction_l([sample_label, ...])

Correct dye bias by linear scaling.

dye_bias_correction_nl([sample_labels, ...])

Dye bias correction by matching green and red to mid-point.

get_betas([sample_label, drop_na, ...])

Get the beta values for the sample.

get_mean_ib_intensity([sample_label, apply_mask])

Computes the mean intensity of all the in-band measurements.

get_negative_controls([apply_mask, sigdf])

Get negative control signal

get_normalization_controls([apply_mask, ...])

Returns the control values to normalize green and red probes.

get_probes(probe_ids[, apply_mask, sigdf])

Returns the probes dataframe filtered on a list of probe IDs

get_probes_with_probe_type(probe_type[, ...])

Select probes by probe type, meaning e.g. CG, Control, SNP.

get_signal_df([apply_mask])

Get the methylation signal dataframe, and apply the mask if apply_mask is True

get_total_ib_intensity([sample_label, ...])

Computes the total intensity of all the in-band measurements.

has_betas()

ib([apply_mask, sigdf])

Get the subset of in-band probes (for type I probes only), and apply the mask if apply_mask is True

ib_green([apply_mask, sigdf])

Get the subset of in-band green probes (for type I probes only), and apply the mask if apply_mask is True

ib_red([apply_mask, sigdf])

Get the subset of in-band red probes (for type I probes only), and apply the mask if apply_mask is True

infer_type1_channel([sample_labels, ...])

For Infinium type I probes, infer the channel from the signal values, setting it to the channel with the max signal.

load(filepath)

Load a pickled Samples object from filepath

mask_control_probes([sample_label])

Shortcut to mask control probes

mask_non_cg_probes([sample_label])

Shortcut to mask non-CpG probes

mask_non_unique_probes([sample_label])

Shortcut to mask non-unique probes on this sample

mask_probes_by_names(names_to_mask[, ...])

Match the names provided in names_to_mask with the probes mask info and mask these probes, adding them to the current mask if there is any.

mask_quality_probes([sample_label])

Shortcut to mask quality probes

mask_snp_probes([sample_label])

Shortcut to mask snp probes

mask_xy_probes([sample_label])

Shortcut to mask probes from XY chromosome

merge_samples_by(by[, apply_mask])

Merge the beads signal values of different samples by averaging them.

meth([apply_mask, sigdf])

Get the subset of methylated probes, and apply the mask if apply_mask is True

noob_background_correction([sample_labels, ...])

Subtract the background for a sample.

oob([apply_mask, sigdf])

Get the subset of out-of-band probes (for type I probes only), and apply the mask if apply_mask is True

oob_green([apply_mask, sigdf])

Get the subset of out-of-band green probes (for type I probes only), and apply the mask if apply_mask is True

oob_red([apply_mask, sigdf])

Get the subset of out-of-band red probes (for type I probes only), and apply the mask if apply_mask is True

poobah([sample_labels, apply_mask, ...])

Detection P-value based on empirical cumulative distribution function (ECDF) of out-of-band signal aka pOOBAH (p-vals by Out-Of-Band Array Hybridization).

remove_probes_suffix([apply_mask])

Merge probes that have the same ID but different suffixes (e.g. _BC11, _TC21..) by averaging their signal values.

reset_betas()

Remove betas dataframe

reset_poobah()

Remove poobah pvalues from the signal dataframe

save(filepath)

Save the current Samples object to filepath, as a pickle file

scrub_background_correction([sample_label, ...])

Subtract residual background using background median.

snp_probes([apply_mask, sigdf])

Get SNP type probes ('rs' probes in manifest, but replaced by 'snp' when loaded), and apply the mask if apply_mask is True

subset(sample_labels)

Keep only the specified samples.

type1([apply_mask, sigdf])

Get the subset of Infinium type I probes, and apply the mask if apply_mask is True

type1_green([apply_mask, sigdf])

Get the subset of type I green probes, and apply the mask if apply_mask is True

type1_red([apply_mask, sigdf])

Get the subset of type I red probes, and apply the mask if apply_mask is True

type2([apply_mask, sigdf])

Get the subset of Infinium type II probes, and apply the mask if apply_mask is True

unmeth([apply_mask, sigdf])

Get the subset of unmethylated probes, and apply the mask if apply_mask is True

Attributes

nb_probes

Count the number of probes in the signal dataframe

nb_samples

Count the number of samples contained in the object

probe_ids

Return the list of probe IDs contained in the signal dataframe

sample_label_name

Return the name of the sample sheet column used as sample labels.

sample_labels

Return the names of the samples contained in this object, that also exist in the sample sheet.

Methods and attributes detail

__init__(sample_sheet_df: DataFrame | None = None)

Initialize the object with only a sample-sheet.

Parameters:

sample_sheet_df (pandas.DataFrame | None) – sample sheet dataframe. Default: None

add_annotation_info(annotation: Annotations, label_column: str, keep_idat=False, min_beads=1) None

Merge manifest dataframe with probe signal values read from idat files to build the signal dataframe, adding channel information, methylation state and mask names for each probe.

For manifest file, merging is done on Illumina IDs, contained in columns address_a and address_b of the manifest file.

Parameters:
  • annotation (Annotations) – annotation data corresponding to the sample

  • label_column (str) – the name of the sample sheet column used for sample labels (eg sample_id, sample_name)

  • min_beads (int) – filter probes with less than min_beads beads. Default: 1

  • keep_idat (bool) – if set to True, keep idat data after merging the annotations. Default: False

Returns:

None

batch_correction(batch: list | str, apply_mask: bool = True, covariates: str | list[str] | None = None, par_prior=True, mean_only=False, ref_batch=None, precision=None, na_cov_action='raise') None

Applies ComBat algorithm for batch correction on beta values.

Parameters:
  • batch (str | list) – If a string is provided, it’s interpreted as the name of the column in the sample sheet that contains the batch information. If a list is provided, it should contain the batch indices, with as many values as samples.

  • apply_mask (bool) – set to False if you don’t want any mask to be applied. Default: True

  • covariates (str | list[str] | None) – a list of column names from the sample sheet to use as covariates in the model. It only supports categorical or string variables. Default: None

  • par_prior (bool) – False for non-parametric estimation of batch effects. Default: True

  • mean_only (bool) – True iff just adjusting the means and not individual batch effects Default: False

  • ref_batch – batch id of the batch to use as reference. Default: None

  • precision (float) – level of precision for precision computing. Default: None

  • na_cov_action – choose the way to handle missing covariates : raise raise an error if missing covariates and stop the code, remove remove samples with missing covariates and raise a warning, fill handle missing covariates, by creating a distinct covariate per batch. Default: raise

Returns:

None

calculate_betas(include_out_of_band=False) None

Calculate beta values for all probes. Values are stored in a dataframe and can be accessed via the betas() function

Parameters:

include_out_of_band (bool) – is set to true, the Type 1 probes beta values will be calculated on in-band AND out-of-band signal values. If set to false, they will be calculated on in-band values only. equivalent to sumTypeI in sesame. Default: False

Returns:

None

cg_probes(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get CG (CpG) type probes, and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

ch_probes(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get CH (CpH) type probes, and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

controls(apply_mask: bool = True, pattern: str | None = None, sigdf: DataFrame | None = None) DataFrame | None

Get the subset of control probes, matching the pattern with the probe_ids if a pattern is provided

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Default: True

  • pattern (str | None) – pattern to match against control probe IDs, case is ignored. Default: None

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe of the control probes, or None if None was found

Return type:

pandas.DataFrame | None

copy()

Create a copy of the Samples object

Returns:

a copy of the object

Return type:

Samples

drop_samples(sample_labels: str | list[str]) None

Remove some samples. Delete the signal information, beta values, sample sheet rows and masks. Ignores non-existent sample names

Parameters:

sample_labels (str | list[str]) – list of the labels of the samples to drop

Returns:

None

dye_bias_correction(sample_label: str | None = None, apply_mask: bool = True, reference: dict | None = None) None

Correct dye bias by linear scaling. Scale both the green and red signal to a reference level. If the reference level is not given, it is set to the mean intensity of all the in-band signals.

Parameters:
  • sample_label (str | None) – the name of the sample to correct dye bias for. If None, correct dye bias for all samples.

  • apply_mask (bool) – set to False if you don’t want any mask to be applied. Default: True

  • reference – values to use as reference to scale red and green signal for each sample (=dict keys). Default: None

Type:

dict | None

Returns:

None

dye_bias_correction_l(sample_label: str | None = None, apply_mask: bool = True, reference: dict | None = None) None

Correct dye bias by linear scaling. Scale both the green and red signal to a reference level. If the reference level is not given, it is set to the mean intensity of all the in-band signals.

Parameters:
  • sample_label (str | None) – the name of the sample to correct dye bias for. If None, correct dye bias for all samples.

  • apply_mask (bool) – set to False if you don’t want any mask to be applied. Default: True

  • reference – values to use as reference to scale red and green signal for each sample (=dict keys). Default: None

Type:

dict | None

Returns:

None

dye_bias_correction_nl(sample_labels: str | list[str] | None = None, apply_mask: bool = True) None

Dye bias correction by matching green and red to mid-point. Each sample is handled separately.

This function compares the Type-I Red probes and Type-I Grn probes and generates and mapping to correct signal of the two channels to the middle.

Parameters:
  • sample_labels (str | list[str] | None) – the name of the sample to correct dye bias for. If None, correct dye bias for all samples.

  • apply_mask (bool) – if True include masked probes in Infinium-I probes. No big difference is noted in practice. More probes are generally better. Default: True

Returns:

None

get_betas(sample_label: str | None = None, drop_na: bool = False, custom_sheet: DataFrame | None = None, apply_mask: bool = True) DataFrame | Series | None

Get the beta values for the sample. If no sample name is provided, return beta values for all samples.

Parameters:
  • sample_label (str | None) – the name of the sample to get beta values for. If None, return beta values for all samples.

  • drop_na (bool) – if set to True, drop rows with NA values. Default: False

  • custom_sheet (pandas.DataFrame | None) – a custom sample sheet to filter samples. Ignored if sample_label is provided. Default: None

  • apply_mask (bool) – set to False if you don’t want any mask to be applied. Default: False

Returns:

beta values as a DataFrame, or Series if sample_label is provided. If no beta values are found, return None

Return type:

pandas.DataFrame | pandas.Series | None

get_mean_ib_intensity(sample_label: str | None = None, apply_mask=True) dict

Computes the mean intensity of all the in-band measurements. This includes all Type-I in-band measurements and all Type-II probe measurements. Both methylated and unmethylated alleles are considered.

Parameters:
  • sample_label (str | None) – the name of the sample to get mean in-band intensity values for. If None, return mean in-band intensity values for every sample.

  • apply_mask (bool) – set to False if you don’t want any mask to be applied. Default: True

Returns:

mean in-band intensity value

Return type:

float

get_negative_controls(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame | None

Get negative control signal

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

the negative controls, or None if None were found

Return type:

pandas.DataFrame | None

get_normalization_controls(apply_mask: bool = True, average=False, sigdf: DataFrame | None = None) dict | DataFrame | None

Returns the control values to normalize green and red probes.

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Default: True

  • average (bool) – if set to True, returns a dict with keys ‘G’ and ‘R’ containing the average of the control probes. Otherwise, returns a dataframe with selected probes. Default: False

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

the normalization controls as a dict or a dataframe, or None if None were found

Return type:

dict | pandas.DataFrame | None

get_probes(probe_ids: list[str] | str, apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Returns the probes dataframe filtered on a list of probe IDs

Parameters:
  • probe_ids (list[str]) – the IDs of the probes to select

  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

get_probes_with_probe_type(probe_type: str, apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Select probes by probe type, meaning e.g. CG, Control, SNP… (not infinium type I/II type), and apply the mask if apply_mask is True

Parameters:
  • probe_type (str) – the type of probe to select (e.g. ‘cg’, ‘snp’…)

  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

get_signal_df(apply_mask: bool = True) DataFrame

Get the methylation signal dataframe, and apply the mask if apply_mask is True

Parameters:

apply_mask (bool) – True set masked probes values to None. Default: True

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

get_total_ib_intensity(sample_label: str | list[str] | None = None, apply_mask: bool = True) DataFrame

Computes the total intensity of all the in-band measurements. This includes all Type-I in-band measurements and all Type-II probe measurements. Both methylated and unmethylated alleles are considered.

Parameters:
  • sample_label (str | None) – the name of the sample to get total in-band intensity values for. If None, return total in-band intensity values for every sample.

  • apply_mask (bool) – set to False if you don’t want any mask to be applied. Default: True

Returns:

the total in-band intensity values

Return type:

pandas.DataFrame

ib(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of in-band probes (for type I probes only), and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

ib_green(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of in-band green probes (for type I probes only), and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

ib_red(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of in-band red probes (for type I probes only), and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

infer_type1_channel(sample_labels: str | list[str] | None = None, switch_failed=False, mask_failed=False, summary_only=False) DataFrame

For Infinium type I probes, infer the channel from the signal values, setting it to the channel with the max signal. If max values are equals, the channel is set to R (as opposed to G in sesame).

Parameters:
  • sample_labels (str | list[str] | None) – the name(s) of the sample(s) to infer the channel for. If None, infer with all samples. Default: None

  • switch_failed (bool) – if set to True, probes with NA values or whose max values are under a threshold (the 95th percentile of the background signals) will be switched back to their original value. Default: False.

  • mask_failed (bool) – mask failed probes (same probes as switch_failed). Default: False.

  • summary_only (bool) – does not replace the sample dataframe, only return the summary (useful for QC). Default: False

Returns:

the summary of the switched channels

Return type:

pandas.DataFrame

static load(filepath: str)

Load a pickled Samples object from filepath

Parameters:

filepath (str) – path to the file to read

Returns:

the loaded object

mask_control_probes(sample_label: str | None = None) None

Shortcut to mask control probes

Parameters:

sample_label (str | None) – The name of the sample to mask. If None, mask indexes for all samples.

Returns:

None

mask_non_cg_probes(sample_label: str | None = None) None

Shortcut to mask non-CpG probes

Parameters:

sample_label (str | None) – The name of the sample to mask. If None, mask indexes for all samples.

Returns:

None

mask_non_unique_probes(sample_label: str | None = None) None

Shortcut to mask non-unique probes on this sample

Parameters:

sample_label (str | None) – The name of the sample to mask. If None, mask indexes for all samples.

Returns:

None

mask_probes_by_names(names_to_mask: str | list[str], sample_label: str | None = None, mask_name: str | None = None) None

Match the names provided in names_to_mask with the probes mask info and mask these probes, adding them to the current mask if there is any.

Parameters:
  • names_to_mask (str | list[str]) – can be a regex

  • sample_label (str | None) – The name of the sample to get masked indexes for. If None, returns masked indexes for all samples.

  • mask_name (str | None) – the name of the mask to create. If None, the name of the mask will be the same as the names to mask

Returns:

None

mask_quality_probes(sample_label: str | None = None) None

Shortcut to mask quality probes

Parameters:

sample_label (str | None) – The name of the sample to mask. If None, mask indexes for all samples.

Returns:

None

mask_snp_probes(sample_label: str | None = None) None

Shortcut to mask snp probes

Parameters:

sample_label (str | None) – The name of the sample to mask. If None, mask indexes for all samples.

Returns:

None

mask_xy_probes(sample_label: str | None = None) None

Shortcut to mask probes from XY chromosome

Parameters:

sample_label (str | None) – The name of the sample to mask. If None, mask indexes for all samples.

Returns:

None

merge_samples_by(by: str, apply_mask=True) None

Merge the beads signal values of different samples by averaging them. Modifies the signal dataframe directly and removes p values column since their values need to be updated. Beta values are averaged as not to lose the batch correction result if needed. Masks are reset - but masked probes values are ignored if apply_mask is True

Parameters:
  • by (str) – the column name in the sample sheet to group samples by

  • apply_mask (bool) – skip masked probes values when merging samples if True. Default: True

meth(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of methylated probes, and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

property nb_probes: int

Count the number of probes in the signal dataframe

Returns:

number of probes

Return type:

int

property nb_samples: int

Count the number of samples contained in the object

Returns:

number of samples

Return type:

int

noob_background_correction(sample_labels: str | list[str] | None = None, apply_mask: bool = True, use_negative_controls=True, offset=15) None

Subtract the background for a sample.

Background was modelled in a normal distribution and true signal in an exponential distribution. The Norm-Exp deconvolution is parameterized using Out-Of-Band (oob) probes. Multi-mapping probes are excluded.

Parameters:
  • sample_labels (str | list[str] | None) – the name(s) of the sample(s) to correct dye bias for. If None, correct dye bias for all samples.

  • apply_mask (bool) – True removes masked probes, False keeps them. Default: True

  • use_negative_controls (bool) – if True, the background will be calculated with both negative control and out-of-band probes. Default: True

  • offset (int | float) – A constant value to add to the corrected signal for padding. Default: 15

Returns:

None

oob(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame | None

Get the subset of out-of-band probes (for type I probes only), and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame | None

oob_green(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of out-of-band green probes (for type I probes only), and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

oob_red(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of out-of-band red probes (for type I probes only), and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

poobah(sample_labels: str | list[str] | None = None, apply_mask: bool = True, use_negative_controls=True, threshold=0.05) None

Detection P-value based on empirical cumulative distribution function (ECDF) of out-of-band signal aka pOOBAH (p-vals by Out-Of-Band Array Hybridization). Each sample is handled separately.

Adds two columns in the signal dataframe, ‘p_value’ and ‘poobah_mask’. Add probes that are (strictly) above the defined threshold to the mask.

Parameters:
  • sample_labels (str | list[str] | None) – the name(s) of the sample(s) to use for the pOOBAH calculation. If None, use all samples. Default: None

  • apply_mask (bool) – True removes masked probes from background, False keeps them. Default: True

  • use_negative_controls (bool) – add negative controls as part of the background. Default True

  • threshold (float) – used to output a mask based on the p_values.

Returns:

None

property probe_ids: list[str]

Return the list of probe IDs contained in the signal dataframe

Returns:

list of probe IDs

Return type:

list[str]

remove_probes_suffix(apply_mask=True)

Merge probes that have the same ID but different suffixes (e.g. _BC11, _TC21..) by averaging their signal values. Resets calculated pvalues and betas. TODO: to match ChAMP, take the values of the probe with the best poobah pvalue

Parameters:

apply_mask (bool) – skip masked probes values when merging samples if True. Default: True

reset_betas() None

Remove betas dataframe

Returns:

None

reset_poobah() None

Remove poobah pvalues from the signal dataframe

:return None

property sample_label_name: str

Return the name of the sample sheet column used as sample labels. By default, sample_name is used when creating the signal dataframe, but it can be changed by using the function merge_samples_by

Returns:

the name of the identifier

Return type:

str

property sample_labels: list[str]

Return the names of the samples contained in this object, that also exist in the sample sheet. :return: the list of names :rtype: list[str]

save(filepath: str) None

Save the current Samples object to filepath, as a pickle file

Parameters:

filepath (str) – path to the file to create

Returns:

None

scrub_background_correction(sample_label: str | None = None, apply_mask: bool = True) None

Subtract residual background using background median.

This function is meant to be used after noob.

Parameters:
  • sample_label (str | None) – the name of the sample to scrub background for. If None, scrub background for all samples.

  • apply_mask (bool) – True removes masked probes, False keeps them. Default: True

Returns:

None

snp_probes(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get SNP type probes (‘rs’ probes in manifest, but replaced by ‘snp’ when loaded), and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

subset(sample_labels: str | list[str]) None

Keep only the specified samples. Delete the signal information, beta values, sample sheet rows and masks of all the samples that are not in the list. Ignores non-existent sample names

Parameters:

sample_labels (str | list[str]) – list of the labels of the samples to keep

Returns:

None

type1(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of Infinium type I probes, and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

type1_green(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of type I green probes, and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

type1_red(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of type I red probes, and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

type2(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of Infinium type II probes, and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame

unmeth(apply_mask: bool = True, sigdf: DataFrame | None = None) DataFrame

Get the subset of unmethylated probes, and apply the mask if apply_mask is True

Parameters:
  • apply_mask (bool) – True removes masked probes, False keeps them. Ignored if sigdf is provided. Default: True

  • sigdf (pd.DataFrame | None) – signal dataframe to use. Useful to save time applying the mask. Default: None

Returns:

methylation signal dataframe

Return type:

pandas.DataFrame