Package org.snpeff.snpEffect.factory
Class SnpEffPredictorFactoryRefSeq
- java.lang.Object
-
- org.snpeff.snpEffect.factory.SnpEffPredictorFactory
-
- org.snpeff.snpEffect.factory.SnpEffPredictorFactoryRefSeq
-
public class SnpEffPredictorFactoryRefSeq extends SnpEffPredictorFactory
This class creates a SnpEffectPredictor from a TXT file dumped using UCSC table browser RefSeq table schema: http://genome.ucsc.edu/cgi-bin/hgTables field example SQL type info description bin 585 smallint(5) range Indexing field to speed chromosome range queries. name NR_026818 varchar(255) values Name of gene (usually transcript_id from GTF) chrom chr1 varchar(255) values Reference sequence chromosome or scaffold strand - char(1) values + or - for strand txStart 34610 int(10) range Transcription start position txEnd 36081 int(10) range Transcription end position cdsStart 36081 int(10) range Coding region start cdsEnd 36081 int(10) range Coding region end exonCount 3 int(10) range Number of exons exonStarts 34610,35276,35720, longblob Exon start positions exonEnds 35174,35481,36081, longblob Exon end positions score 0 int(11) range name2 FAM138A varchar(255) values Alternate name (e.g. gene_id from GTF) cdsStartStat unk enum('none', 'unk', 'incmpl', 'cmpl') values enum('none','unk','incmpl','cmpl') cdsEndStat unk enum('none', 'unk', 'incmpl', 'cmpl') values enum('none','unk','incmpl','cmpl') exonFrames -1,-1,-1, longblob Exon frame {0,1,2}, or -1 if no frame for exon Refseq Accession format (i.e. NM_ NR_ codes) : http://www.ncbi.nlm.nih.gov/RefSeq/key.html Accession Molecule Method Note AC_123456 Genomic Mixed Alternate complete genomic molecule. This prefix is used for records that are provided to reflect an alternate assembly or annotation. Primarily used for viral, prokaryotic records. AP_123456 Protein Mixed Protein products; alternate protein record. This prefix is used for records that are provided to reflect an alternate assembly or annotation. The AP_ prefix was originally designated for bacterial proteins but this usage was changed. NC_123456 Genomic Mixed Complete genomic molecules including genomes, chromosomes, organelles, plasmids. NG_123456 Genomic Mixed Incomplete genomic region; supplied to support the NCBI genome annotation pipeline. Represents either non-transcribed pseudogenes, or larger regions representing a gene cluster that is difficult to annotate via automatic methods. NM_123456789 mRNA Mixed Transcript products; mature messenger RNA (mRNA) transcripts. NP_123456789 Protein Mixed Protein products; primarily full-length precursor products but may include some partial proteins and mature peptide products. NR_123456 RNA Mixed Non-coding transcripts including structural RNAs, transcribed pseudogenes, and others. NT_123456 Genomic Automated Intermediate genomic assemblies of BAC and/or Whole Genome Shotgun sequence data. NW_123456789 Genomic Automated Intermediate genomic assemblies of BAC or Whole Genome Shotgun sequence data. NZ_ABCD12345678 Genomic Automated A collection of whole genome shotgun sequence data for a project. Accessions are not tracked between releases. The first four characters following the underscore (e.g. 'ABCD') identifies a genome project. XM_123456789 mRNA Automated Transcript products; model mRNA provided by a genome annotation process; sequence corresponds to the genomic contig. XP_123456789 Protein Automated Protein products; model proteins provided by a genome annotation process; sequence corresponds to the genomic contig. XR_123456 RNA Automated Transcript products; model non-coding transcripts provided by a genome annotation process; sequence corresponds to the genomic contig. YP_123456789 Protein Mixed Protein products; no corresponding transcript record provided. Primarily used for bacterial, viral, and mitochondrial records. ZP_12345678 Protein Automated Protein products; annotated on NZ_ accessions (often via computational methods). NS_123456 Genomic Automated Genomic records that represent an assembly which does not reflect the structure of a real biological molecule. The assembly may represent an unordered assembly of unplaced scaffolds, or it may represent an assembly of DNA sequences generated from a biological sample that may not represent a single organism. $ zcat genes.txt.gz | cut -f 2 | cut -b 1,2 | sort | uniq -c 34466 NM 6548 NR- Author:
- pcingola
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
CDS_STAT_COMPLETE
-
Fields inherited from class org.snpeff.snpEffect.factory.SnpEffPredictorFactory
MARK, MIN_TOTAL_FRAME_COUNT
-
-
Constructor Summary
Constructors Constructor Description SnpEffPredictorFactoryRefSeq(Config config)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description SnpEffectPredictor
create()
protected void
readRefSeqFile()
Read and parse RefSeq file-
Methods inherited from class org.snpeff.snpEffect.factory.SnpEffPredictorFactory
add, add, add, add, add, add, addMarker, addSequences, adjustChromosomes, adjustTranscripts, beforeExonSequences, codingFromCds, collapseZeroLenIntrons, createRandSequences, deleteRedundant, exonsFromCds, exonsFromCds, findGene, findGene, findMarker, findTranscript, findTranscript, getOrCreateChromosome, getProteinByTrId, parsePosition, readExonSequences, replaceTranscript, setCreateRandSequences, setDebug, setFastaFile, setFileName, setRandom, setReadSequences, setStoreSequences, setVerbose, showChromoNamesDifferences
-
-
-
-
Field Detail
-
CDS_STAT_COMPLETE
public static final java.lang.String CDS_STAT_COMPLETE
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
SnpEffPredictorFactoryRefSeq
public SnpEffPredictorFactoryRefSeq(Config config)
-
-
Method Detail
-
create
public SnpEffectPredictor create()
- Specified by:
create
in classSnpEffPredictorFactory
-
readRefSeqFile
protected void readRefSeqFile()
Read and parse RefSeq file
-
-