Annotations
Manifests and other annotation files are built from the SeSAMe package and illumina (cf illumina docs) for all genome versions and array type. An updated version of the manifest is also available for EPICv2/hg38, as defined by https://www.biorxiv.org/content/10.1101/2025.03.12.642895v2
All the files, restructured to be used with pylluminator, are stored and versioned in the pylluminator-data GitHub repository
Manifest (probe_infos)
Description of the columns of the probe_infos.csv file. If you want to use a custom manifest, you will need to provide this information.
illumina_id : ID that matches probe IDs in .idat files
probe_id : probe ID used in annotation files :
First letters : Either
cg(CpG),ch(CpH),mu(multi-unique),rp(repetitive element),rs(SNP probes),ctl(control),nb(somatic mutations found in cancer)Last 4 characters : top or bottom strand (
T/B), converted or opposite strand (C/O), Infinium probe type (1/2), and the number of synthesis for representation of the probe on the array (1,2,3,…,n).
type : probe type, Infinium-I or Infinium-II
probe_type : cg (CpG), ch (CpH), mu (multi-unique), rp (repetitive element), rs (SNP probes), ctl (control), nb (somatic mutations found in cancer)
channel: color channel, green (methylated) or red (unmethylated)
address_[A/B]: Chip/tango address for A-allele and B-allele. For Infinium type I, allele A is Unmethylated, allele B is Methylated. For type II, address B is not set as there is only one probe. Addresses match the Illumina IDs found in IDat files.
start: the start position of the probe sequence
end: the end position of the probe sequence. Usually the start position +1 because probes typically span a single CpG site.
chromosome: chromosome number/letter
mask_info: name of the masks for this probe. Multiple masks are separated by semicolons. (details below)
genes: genes encoded by this sequence. Multiple gene names are separated by semicolons.
promoter_or_body: b for body, p or Promoter for promoter
cgi: position of the probe regarding the CpG island. Possible values: Island, Shelf, Shore, OpenSea
Masks
Common masks
M_mapping: unmapped probes, or probes having too low mapping quality (alignment score under 35, either probe for Infinium-I) or Infinium-I probe allele A and B mapped to different locations
M_nonuniq: mapped probes but with mapping quality smaller than 10, either probe for Infinium-I
M_uncorr_titration: CpGs with titration correlation under 0.9. Functioning probes should have very high correlation with titrated methylation fraction.
Human masks (general and population-specific)
M_commonSNP5_5pt: mapped probes having at least a common SNP with MAF>=5% within 5bp from 3’-extension
M_commonSNP5_1pt: mapped probes having at least a common SNP with MAF>=1% within 5bp from 3’-extension
M_1baseSwitchSNPcommon_1pt: mapped Infinium-I probes with SNP (MAF>=1%) hitting the extension base and changing the color channel
M_2extBase_SNPcommon_1pt: mapped Infinium-II probes with SNP (MAF>=1%) hitting the extension base.
M_SNP_EAS_1pt: EAS population-specific mask (MAF>=1%).
M_1baseSwitchSNP_EAS_1pt: EAS population-specific mask (MAF>=1%).
M_2extBase_SNP_EAS_1pt: EAS population-specific mask (MAF>=1%).
… more populations, e.g., EAS, EUR, AFR, AMR, SAS.
Mouse masks (general and strain-specific)
M_PWK_PhJ: mapped probes having at least a PWK_PhJ strain-specific SNP within 5bp from 3’-extension
M_1baseSwitchPWK_PhJ: mapped Infinium-I probes with PWK_PhJ strain-specific SNP hitting the extension base and changing the color channel
M_2extBase_PWK_PhJ: mapped Infinium-II probes with PWK_PhJ strain-specific SNP hitting the extension base.
… more strains, e.g., AKR_J, A_J, NOD_ShiLtJ, MOLF_EiJ, 129P2_OlaHsd …
Genome information
Gap info
Contains information on gaps in the genomic sequence. These gaps represent regions that are not sequenced or that are known to be problematic in the data, such as areas that may have low coverage or difficult-to-sequence regions.
chromosome: number or name of the chromosome
start: the start position of the gap
end: the end position of the gap
width: the size of the gap
strand: strand of the gap, usually * (not specified)
type: region type. Possible values: telomere, contig (continuous region), scaffold (group of regions that might contain gaps), heterochromatin (tightly packed DNA,
less transcriptionally active), short_arm (p arm of the chromosome)
Sequence lengths
Keys are chromosome identifiers (e.g., 1, 2, … X, etc.), and values are the corresponding sequence lengths (in base pairs).
chromosome: number or name of the chromosome
seq_length: chromosome size in number of base pairs
Transcript list
Detail of the exons contained in each transcripts.
group_name: unique identifier for the transcript (e.g., ENST00000456328.2), corresponds to transcript_id in the transcript_exons file.
start: the start position of the exon
end: the end position of the exon
width: the size of the exon
exon_number: exon ID within the transcript
Transcript exons
Information at the level of groups of exons for each transcript (type, gene name, gene id…). Details on transcript_types values can be found in GRCh37 database
chromosome: number or name of the chromosome
transcript_start: start position of the transcript on the chromosome
transcript_end: end position of the transcript on the chromosome
transcript_strand: strand of the transcript, either ‘+’ (forward) or ‘-’ (reverse)
transcript_id: unique identifier for the transcript (e.g., ENST00000456328.2)
transcript_type: type of the transcript (e.g., processed_transcript, lncRNA, miRNA)
transcript_name: name of the transcript (e.g., DDX11L1-202, WASH7P-201)
gene_name: name of the gene associated with the transcript (e.g., DDX11L, WASH7P)
gene_id: unique identifier for the gene (e.g., ENSG00000223972.5)
gene_type: type of the gene (e.g., transcribed_unprocessed_pseudogene, protein_coding)
source: source of the annotation (e.g., HAVANA, ENSEMBL)
level: level of annotation confidence or quality, from 1 to 3
cds_start: start position of the coding sequence within the transcript, if the transcript is protein_coding
cds_end: end position of the coding sequence within the transcript, if the transcript is protein_coding
Chromosome regions
Names, addresses and Giemsa stain pattern of all chromosomes’ regions.
chromosome: number or name of the chromosome
start: start position of the region on the chromosome
end: end position of the region on the chromosome
name: name of the region, e.g.`p36.33` where p means the region is on the short arm, or q for the long arm
giemsa_staining: Possible values: gneg for gene poor regions, gpos25 for moderate gene density regions, gpos50 for intermediate gene density regions, gpos75 for high gene density regions,
gpos100 for very high gene density regions, gvar for variable gene density (often polymorphic) regions, acen for the centromere, and stalk for the stalk