1. Read data and get beta values

First, import the necessary packages from pylluminator, and set the logger level to your convenience.

[1]:

from pylluminator.annotations import Annotations, ArrayType, GenomeVersion
from pylluminator.samples import Samples, read_samples
from pylluminator.visualizations import betas_density, betas_dendrogram, betas_2D, nb_probes_per_chr_and_type_hist, plot_mean_beta_diff_per_group
from pylluminator.utils import set_logger, download_from_geo

# path to your data, or a folder where the downloaded data will be saved
data_path = '~/data/pylluminator-tutorial/'

# set the verbosity level, can be DEBUG, INFO, WARNING, ERROR
set_logger('WARNING')

1.1. Read .idat files

1.1.1. Download test data

For the purpose of this tutorial, you can download data from ‘GEO database https://www.ncbi.nlm.nih.gov/geo/’, or skip this step if you want to use you own data.

We downloaded 3 samples from healthy prostate cells (PrEC) and 3 samples from prostate cancer cell (LNCaP)

[2]:

download_from_geo(['GSM7698438', 'GSM7698446', 'GSM7698462', 'GSM7698435', 'GSM7698443', 'GSM7698459'], data_path)

1.1.2. Define and read annotation

First, define the array type of your data and the genome version in order to read the associated information files (manifest, masks, genome information…)

The first run of this code might take a while, as the script will download the manifest and genome info files.

[3]:

# define those parameters to match your data
array_type = ArrayType.HUMAN_EPIC_V2  # run ArrayType? to see available types
genome_version = GenomeVersion.HG38  # run GenomeVersion? to see available genome versions

# read annotation
annos = Annotations(array_type, genome_version, 'updated')

#-------------------------------------------------------------

# alternative: if you don't know what your data is, you can set annos to None and let the script detect the array version when reading samples
# annos = None

1.1.3. Read samples .idat

Now set the paths to your data and to the sample sheet if you have one.

[4]:

# minimum number of detected beads per probe
min_beads = 4

# if no sample sheet is provided, it will be rendered automatically. Use parameter sample_sheet_path or sample_sheet_name to specify one.
my_samples = read_samples(data_path, annotation=annos, min_beads=min_beads, keep_idat=True)

# let's check the generated sample sheet !
my_samples.sample_sheet

[4]:

	sample_id	sample_name
0	GSM7698443	PREC_500_2
1	GSM7698459	PREC_500_3
2	GSM7698435	PREC_500_1
3	GSM7698446	LNCAP_500_2
4	GSM7698462	LNCAP_500_3
5	GSM7698438	LNCAP_500_1

Now is a good time to check if any sample sheet value has the wrong type. Having a category value encoded as a number can lead to unexpected results, so be sure to convert them to string if need (e.g. the sample ID or sentrix ID)

[5]:

my_samples.sample_sheet.dtypes
# we're all good here, since sentrix_id and sentrix_position are unknown - but if needed here is an example of how to change a column type:
# my_samples = my_samples.astype({'sentrix_id': str})

[5]:

sample_id      str
sample_name    str
dtype: object

Example of the signal dataframe for a sample

[6]:

my_samples.get_signal_df()['PREC_500_3']

[6]:

			signal_channel	G		R
			methylation_state	M	U	M	U
type	channel	probe_type	probe_id
I	G	cg	cg00000769_BC11	171.0	5796.0	275.0	1561.0
			cg00000957_TC11	3127.0	414.0	445.0	510.0
			cg00001349_TC11	7365.0	2321.0	1526.0	280.0
			cg00002033_TC11	205.0	3947.0	195.0	378.0
			cg00002033_TC12	232.0	4576.0	217.0	711.0
...	...	...	...	...	...	...	...
II	nan	snp	rs877309_BC21	190.0	NaN	NaN	8029.0
			rs9363764_BC21	2458.0	NaN	NaN	4125.0
			rs966367_BC21	1264.0	NaN	NaN	2420.0
			rs966367_BC22	1117.0	NaN	NaN	1991.0
			rs9839873_BC21	879.0	NaN	NaN	1363.0

937688 rows × 4 columns

Let’s add a custom column to our sample sheet

[7]:

my_samples.sample_sheet['sample_type'] = [n.split('_')[0] for n in my_samples.sample_sheet.sample_name]
my_samples.sample_sheet

[7]:

	sample_id	sample_name	sample_type
0	GSM7698443	PREC_500_2	PREC
1	GSM7698459	PREC_500_3	PREC
2	GSM7698435	PREC_500_1	PREC
3	GSM7698446	LNCAP_500_2	LNCAP
4	GSM7698462	LNCAP_500_3	LNCAP
5	GSM7698438	LNCAP_500_1	LNCAP

1.1.4. Save pylluminator samples (optional)

As reading lots of samples can take a couple of minutes, you might want to save the Samples object to a file for later use. Here is how.

[8]:

my_samples.save('raw_samples')

# For later use, here is how to load samples :
# my_samples = Samples.load('raw_samples')

1.2. Calculate and plot beta values

Once your samples are loaded, you can already calculate and plot the beta values of the raw data :

[9]:

my_samples.calculate_betas(include_out_of_band=True)
betas_density(my_samples, color_column='sample_type')

../_images/tutorials_1_-_Read_data_and_get_betas_18_0.png

1.3. Preprocessing

Here is the classical preprocessing pipeline for human samples.

Note that each step modifies the Samples object directly, so it’s useful to save the raw samples first (see step 1.4) if you want to try different preprocessing methods.

[10]:

my_samples.mask_quality_probes()
my_samples.infer_type1_channel()
my_samples.dye_bias_correction_nl()
my_samples.poobah()
my_samples.noob_background_correction()
my_samples.calculate_betas(include_out_of_band=True)

[11]:

my_samples.save('preprocessed_samples')

Now let’s see what the preprocessing changed in our beta values!

[12]:

betas_density(my_samples, color_column='sample_type')

../_images/tutorials_1_-_Read_data_and_get_betas_23_0.png

1.4. Data insights

There are a few more plots you can already check to get to know your data:

[13]:

nb_probes_per_chr_and_type_hist(my_samples)

../_images/tutorials_1_-_Read_data_and_get_betas_26_0.png

Using a principal component analysis (PCA), we can visualize the spatial distribution of the samples

[14]:

betas_2D(my_samples, model='PCA', color_column='sample_type')

../_images/tutorials_1_-_Read_data_and_get_betas_28_0.png

The hierarchical clustering of the samples - nothing really surprising in the output :)

[15]:

betas_dendrogram(my_samples)

../_images/tutorials_1_-_Read_data_and_get_betas_30_0.png

And finally, check the distribution of hypo- and hyper- methylated probes, depending on the probes location

[16]:

plot_mean_beta_diff_per_group(my_samples, group_column='sample_type', figsize=(8, 5))

../_images/tutorials_1_-_Read_data_and_get_betas_32_0.png