Thursday, October 2, 2014

We plan to consistently support the approach of having a full set of RNA products represented in our


You probably haven t spent much time thinking about how we represent genes in a genomic cna reference sequence context. And by genes, I really mean transcripts since genes are just a collection of transcripts that produce the same product.
But in fact, there is more complexity here than you ever really wanted to know about. cna Andrew Jesaitis covered some of this in detail as he dove deep in the analysis of variant cna annotation against transcripts in his recent post The State of Variant Annotation: A Comparison of AnnoVar, snpEff and VEP .
The databases that catalog and provide labels for transcripts are really cna cataloging transcribed RNA, like you might identify from a RNA-Seq experiment, cna as well as the spliced and translated protein products. To represent these RNA sequences in genomic space we must actually take a step backward. In fact, it takes a special type of alignment algorithm (see my webcast on alignment algorithms , the same family of algorithms are applied to this problem).
First cna we have our new RefSeq Genes track that I will be discussing in this post. This is built off of a GFF file that is part of the NCBI Human Annotation Release 105 using NCBI s Splign algorithm. The second track is the latest Ensembl release cna on GRCh37, which uses the Ensembl genebuild algorithm, and finally cna the UCSC alignments of the same RNA using their BLAT algorithm.
You can see they disagree on how to handle a gap between the NM_006331 transcript s RNA sequence and the human genomic sequence. The Ensembl alignments most correctly preserve the actual protein sequence of NP_006322.4 in this case by introducing a 2bp intron (which has no real biological reality). Note that given this is in the first exon, this disagreement extends for the rest of the protein sequence.
Of course, they are trying to do the best with a bad situation where the reference sequence is not ideal. In fact, this is a great example of the real improvements of the latest GRCh38 reference sequence. It incorporated a patch that changes the local sequence at this region. You can see the same gene with all three algorithms agreeing on simple alignment with the GRCh38 genomic reference:
In our latest release of Golden Helix SVS 8.2.0 and GenomeBrowse 2.0.4, we shipped by default the RefSeq Genes from NCBI Human Annotation 105 as the default gene track. Going forward we will curate RefSeq genes from NCBI directly and no longer from UCSC.
There are a few of reasons we believe this to be the right path going forward: NCBI provides mitochondrial gene mappings to NC_012920.1; the MT referenced used in GRCh37_g1k and GRCh38 that nearly all NGS alignment is done with. NCBI has official releases of annotations for species, compared cna to UCSC more continuously changing approach. NCBI provides extremely good cross references for their transcripts with other resources, providing great hyperlinks and summary information for analysis. The NCBI set of alignments includes more non-coding transcripts such as microRNA, tRNA as well as various annotations of transcripts such as pseudo cna and predicted .
The UCSC provided RefSeq genes does not actually provide any mitochondrial gene annotations. As inherieted mitochondrial gene disorders are testable with NGS gene panels, it s important to provide as much annotation support for Dx tests based on variant data.
The Revised Cambridge Reference Sequence ( rCRS ) has been adopted by the community (and now by GRCh in 38) as the standard reference to use, and probably would have been used by UCSC if it was available when they defined their hg19 reference (and decided not to update hg19 when it was clear that rCRS was the way forward).
We plan to consistently support the approach of having a full set of RNA products represented in our gene track with predicted mRNA, pseudo (incomplete) RNAs and non-coding RNAs all intermixed. To achieve this, we took great care to provide extra fields to the track that allow for easy identification and filtering of the type of transcript you are interested in.
In this way, we updated our Variant Classification algorithm in SVS 8.2.0 to by default only annotate non-predicted, protein coding transcripts (that will have NM_ transcript identifiers). This brings RefSeq cna in line with the Ensembl gene track which similarly has been inclusive of incomplete and predicted transcripts.
Here are the stats of the RefSeq Genes 105 transcript set by type: Transcript Type Count mRNA (NM_*) 34,663 mRNA Predicted (XM_*) 30,077 microRNA 1,501 ncRNA (rRNA, tRNA, other) 7,090 Puedo, incomplete cna 11,619 Total 84,950
While I could write many posts on the difficulties of dealing cna with the GFF file NCBI produced, it was worth the effort to handle the many edge cases which demanded cna special parsing to produce our latest default gene track. The extra annotations cna provided by NCBI have been cleaned up and hyperlinked to provide a platform for exploring a gene and its

No comments:

Post a Comment