Team II Comparative Genomics Group: Difference between revisions
Line 93: | Line 93: | ||
[[File:stringMLST1.PNG|border|400px]] | [[File:stringMLST1.PNG|border|400px]] | ||
===='''Tool: [ | ===='''Tool: [https://github.com/tseemann/mlst''']==== | ||
Assembly based MLST caller that scan contig files against traditional PubMLST typing schemes and uses BLASTN to align sequences to alleles | |||
* | * Very fast and searches all databases on pubMLST to automatically detect the organism, then calculates the STs. | ||
* Can build DB but also has bundle of all available databases in their software repository, which are regularly updated | |||
* | |||
[[File:mentalist1.PNG|400px]] | [[File:mentalist1.PNG|400px]] | ||
Revision as of 15:30, 14 April 2020
Team 2: Comparative Genomics
Team Members: Kara Keun Lee, Courtney Astore, Kristine Lacek, Ujani Hazra, Jayson Chao
Class Presentations
Introduction
What is Comparative Genomics?
Once genomes are fully assembled and annotated, outbreak analysis can begin via comparative genomics. Generally, metadata ascertained from gene prediction and annotation can be used to map the relatedness of multiple isolates. Combined with epidemiological data, a given outbreak can be mapped back to a particular source (patient zero) and tracked to determine which strains are outbreak isolates and which are sporadic cases. Furthermore, phenotypic features such as virulence, antibiotic resistance, and pathogenicity can be determined. Compilation of these data allows for recommendations to be made on behalf of human impact, treatment strategy, and management methods to address further spread.
Our Data
- Our genomic data comes from 50 isolates of C. jejuni from an outbreak of foodborne illnesses. The genomes are assembled and fully annotated.
- Our epidemiological data include times, locations, and ingested foods of each case.
Pipeline Overview
Objectives
- Identify kinds of strains (outbreak vs. sporadic)
- Construct phylogeny demonstrating which isolates are related and which differ
- Determine the source of the outbreak
- Map virulence and antibiotic resistance features of outbreak isolates
- Compile recommendations for outbreak response and treatment
Overview of Techniques
When performing phylogenomics, there are many options by which one can classify similarities and differences across the genome. Our approach utilizes tools from three different techniques.
ANI
Definition
Average Nucleotide Identity (ANI) is a measure of nucleotide-level genomic similarity between the coding regions of two genomes. A value of 70 % DDH (DNA-DNA hybridization, 1 kb fragments of genome) was proposed as a recommended standard for delineating species and it is a golden standard for species definition based on hybridization experiment. A 95% percent ANI corresponds to 70 % DDH in a study combining experiment hybridization and bioinformatic analysis of bacterial genomes.
How to calculate
Calculating ANI usually involves the fragmentation of genome sequences, followed by nucleotide sequence search, alignment, and identity calculation. The original algorithm to calculate ANI used the BLAST program as its search engine. As it is done in the hybridization experiment, one genome (query genome) will be chopped in to 1 kb fragments when comparing two genomes using bioinformatic tools. Then each of those 1 kb fragments will be search against genome B using blast or other local alignment tools (search for example). Then the identity of each 1 kb fragment will be calculated. The average identity value of those fragments will be used as ANI.
A reciprocal search of genome B against genome A was then proposed to be more reliable and accurate, which is also called orthlogous ANI.
Tools and Algorithms
There are a bunch of tools to use to calculate ANI, some of which are alignment-based while others are not. ANI values using different tools may be a little bit different but they are comparable.
Alignment based tools
ANIb and ANIm represent blast-blast ANI and MUMer-based ANI. They are both one way ANI, without reciprocal search. Popular software includes JSpecies, java based ANI calculation. JSpecies is a friendly UI interface but cannot be used to calculate large datasets.
Reciprocal ANIb and ANIm are the most popular ANI calculation method. Tools such as ani.rb and OrthANI, which are ruby and java based, respectively. Both ani.rb and OrthANI ani.rb can be used to calculate pairwise ANI for datasets with thousands of genomes. ani.rb is blast-based while OrthANI can use 3 different method such as blast, MUMer and usearch (usearch_local).
- gANI (genome wide ANI)(gANI: [ https://ani.jgi.doe.gov/html/download.php])
gANI use high performance similarity search tool NSimScan: protein-coding genes (A, B) were compared at the nucleotide level. It reaches high speed via query aggregation, use of optimized bitwise operations in alignment computing, and by avoidance of dynamic programming
Alignment free tools
fast ANI use Mashmap: fragments of genome A are mapped to the reference genome (B) using Mashmap. Mashmap first indexes the reference genome and subsequently computes mappings as well as alignment identity estimates for each query fragment, one at a time. At the end of the Mashmap run, all the query fragments f1;f2 1⁄4fbjAj=lc are mapped to B. The results are saved in a set M containing triplets of the form 〈f, i, p〉, where f is the fragment id, i is the identity estimate, and p is the starting position where f is mapped to B. The subset of M (say Mforward) corresponding to the maximum identity mapping for each query fragment is then extracted. To further identify the reciprocal matches, each triplet 〈f, i, p〉 in Mforward is ‘‘binned’’ based on its mapping position in the reference, with its value updated to hf ; i; bini 1⁄4 hf ; i; bp=lci. Through this step, fragments which are mapped to the same or nearby positions on the reference genome are likely to get equal bin value. Next, Mreciprocal filters the maximum identity mapping for each bin. Finally, FastANI reports the mean identity of all the triplets in Mreciprocal
MLST
MLST or Multi-locus Sequence Typing identifies a set of loci (housekeeping genes) in the genome and compares each locus in a genome against the set of loci. It estimates the relationships between bacteria based on allelic variations in specific loci than their nucleotide sequences. MLST data can be used to investigate evolutionary relationships among bacteria. However, the sequence conservation of the housekeeping genes limits the discriminatory power of MLST in differentiating bacterial strains.
There are several types of MLST:
Whole-genome MLST (wgMLST): All loci of a given isolate compared to equivalent loci in other isolates (typing scheme based on a few thousand genes).
- Creates wgMLST tree (different styles exist)
- Minimum spanning tree = circles with sizes indicative of the frequency of ST and distance showed on connecting lines
Core-genome MLST (cgMLST): Focused on only the core elements of the genomes of a group of bacteria (typing scheme based on a few hundred genes).
7-Gene MLST: Chooses 7 loci in the genome and compare all genomes to these 7 loci.
- Profile of alleles (“sequence type” or ST) by calling the alleles
- Genome assembly optional – there are assembly free methods
- Creates a phylogeny
Ribosomal MLST (rMLST): Based on 53 loci that code for ribosomal proteins present in most bacteria.
Database: PubMLST
PubMLST (Public databases for molecular typing and microbial genome diversity) for Campylobacter jejuni/coli (as of 17MAR2020):
- 98,017 isolates
- 50,138 genomes
- 1,286,733 alleles
Tool: stringMLST
stringMLST is a tool for detecting the MLST of an isolate directly from the genome sequencing reads.
- Predicts the ST of an isolate in a complete assembly and alignment-free manner
- Downloads and builds databases from pubMLST using the most recent allele and profile definitions
- Faster algorithm compared to traditional MLST tools that maintain high accuracy
Tool:
Assembly based MLST caller that scan contig files against traditional PubMLST typing schemes and uses BLASTN to align sequences to alleles
- Very fast and searches all databases on pubMLST to automatically detect the organism, then calculates the STs.
- Can build DB but also has bundle of all available databases in their software repository, which are regularly updated
Tool: ARIBA
Assembly based tool, primarily developed for identifying Anti-Microbial Resistance - associated genes and single nucleotide polymorphisms directly from short reads
- Provides inbuilt support for and functionality for multi-locus sequence typing (MLST) using data from PubMLST
- Provides inbuilt support for PlasmidFinder and VFDB (Virulence Factor Databases)
- Can be used in the study of Virulence Profile and AMR features along with the results from the Functional Annotation group
SNP Typing
SNP stands for Single Nucleotide Polymorphism, meaning that certain alleles have two or three possibilities as to which base is at a given locus. As SNPs accumulate through de novo mutations and are passed down through generations, comparing a given isolate's SNPs to other isolates and a reference genome allow ascertainment of phylogenetic distance between samples(1). Tools have been developed to compare bases position by position (SNP-calling) and create matrices to compute relatedness between samples based on common SNPs.
Generalized Algorithm Overview:
- Pre-processing and read cleaning
- Mapping
- SNP calling against a reference genome
- Phylogeny generation based on SNP profiles
Tools to be tested:
Tool: kSNP3.0
- Optimal for situations where whole genome alignments don't work
- k-mer-based approaches are alignment-free and have a faster runtime
- Multiple kSNP versions have been created and thoroughly tested
Tool: Lyve-SET
- MSA-based approach (computationally expensive)
- Consistent performance according to literature
- Has a higher sensitivity, specificity, and average Sn and Sp than kSNP
Tool: Parsnp
- Rapid Core Genome Multi-Alignment
- Output includes variant (SNP) calls, core genome phylogeny and multi-alignments
- Uses information provided by multi-alignments flanking SNP sites for QC
Table: SNP-based tool comparison
Results
ANI
MLST
SNP Typing
ParSNP Results
While ParSNP ran very quickly (~30 seconds) on our 10 isolate benchmarking pipeline, there was a disconnect between calculated statistics and subsequent tree generation (only 9/10 isolates represented in tree) Therefore, this tool was decided not to be optimal for our data.
Figure: ParSNP calculates statistic (10 samples present with 1 reference genome) and ParSNP phylogenetic tree generated (only 9 samples present with 1 reference genome)
kSNP3.1 Results
Determining k: Since kSNP is a k-mer based tool, there exists an optimal size k for a specific dataset. To determine this value, we ran MakeFasta and Kchooser to determine the best k for our data. Our percent uniqueness plateued at 95.8%, so we reran with that plateu as the cutoff. Optimal k was determined to be k = 25
Figure: Kchooser output before and after a cutoff was specified. Note percentage asymptote at approximately 0.958
Figure: SNP count across sequences determined by kSNP3 tool.
Figure: SNP-based phylogeny annotated with locations of cases
Like the results found with the MLST-based tools, we found little convergence with SNP-typing and location.
Figure: SNP-based phylogeny results converged with MLST results
The 5 clades in our phylogenetic tree divide into the 4 Sequence Types determined by MLST analysis.
Outbreak Analysis Results
Outbreak Response
- Analyze date distribution / geographic outbreak plots
- Refer related cases to physicians for treatment
- Alert state labs of heightened related cases
Figure: Geographic Heat Map of Cases
Note that cases, while heightened in Mid-Atlantic region, are not clustered to any one specific area. Only two pairs of the 9 affected states are even bordering one another.
Figure: Date distribution
Our specific dataset saw its first isolate sampled in late June of 2018. Cases then continued for almost 11 months until the final sample was taken in mid-May of 2019. December 2018 saw the month with the most reported cases.
Starting in January of 2019, a C. jejuni outbreak began linked to pet-store puppies. We believe that because of a 3 state overlap between our outbreak and the puppy outbreak, that it is possible that a few of our isolates may be attributed to the puppy outbreak and not are not foodbourne.
Figure: Frequency of different food types across samples.
Food types reported fell into 4 different food categories.
1. Dairy: cheese, cookie batter
2. Grocery: chocolate, chips, wheat bread
3. Meat: chorizo, grilled chicken, beef
4. Produce: apples, avocado, celery, lettuce, oranges, melons
Individual foods with the most cases associated with them were chorizo, beef, lettuce, and avacado.
Source of Outbreak
There is little correlation between locations and sequence types, and most sequence types have cases reported over the span of one year. We believe that there are many instances of sporadic strains but only one outbreak strain and source. This would explain how sister taxa in our phylogenetic tree only come from the same state 20% of the time.
Our outbreak strain is ST 2109. It had cases on the East Coast only over the span of 3 months. The outbreak source traces back to Dairy: Cheese, Cookie batter.
We found that our remaining strains, potentially linked to vegetables and meat, are sporadic given the wide variation of time, foods, and location, and do not classify as an outbreak.
These findings are not surprising, given that C. jejuni has a nature of being highly sporadic.
Human Impact
A C. jejuni infection is one of the leading causes of gastroentiritis in the United States. Acute Gastroentirits an inflammation of the intestinal lining, charachterized by emisis, chills, abdominal cramping/pain, fever, and diarrhea.
However, C. jejuni can cause a multitude of other clinical manifestations including Meningitis, Miller-Fisher Syndrome, Periodontitis, Hepatitis, and infections of the bone or soft tissue.
Figure: Summary of clinical manifestations of C. jejuni infections according to Clinical Reviews of Epidemiology
Treatment Strategy
- Determination of antibiotics that will be most effective and ineffective from AMR profile
- Recommendation to health providers which antibiotics to avoid
AMR Gene Detected | Resistant to Drug/Class | Method of Antibiotic |
APH2/3 | aminglycosides | Inhibition of 30s subunit |
OXA184/451 | Beta-lactamase inhibitors | Blocks cell wall synthesis |
SAT-4 | Streptothricin | Inhibition of 30s subunit |
cmeA/B/C/R | ciprofloxacin, norfloxacin, cefotaxime, fusidic acid, erythromycin | not a specific class. cmeABC is a multidrug efflux pump |
gryA | Flouroquinolone | DNA synthesis inhibitor |
tetO | Tetracycline | 30s Subunit Inhibitor |
After eliminating the drugs and classes of drugs to which isolates could be resistant, we consulted published studies to assemble our final reccomendation. OXA184 is the resistant gene present in our outbreak strain (ST2109), so we particularly avoided Beta-lactamase inhibitors. We chose a class of drugs that would inhibit the 50s subunit, more specifically, Macrolides. Since a cmaABC efflux pump would resist erythromycin, we suggest 5-7 days of treatment with either azithromycin or clarithromycin.
CDC Recommendations
- Thoroughly wash vegetables
- Double-check expiration dates of meat/dairy products
- Cook meat to the correct temperature
- Notify healthcare providers in 9 states of recommended antibiotics (Macrolides)
- Encourage sick individuals to seek medical treatment and stay home from school or work
Works Cited
1. Touchman, J. (2010). "Comparative Genomics". Nature Education Knowledge. 3 (10): 13.
2. Xia, X. (2013). Comparative Genomics. SpringerBriefs in Genetics. Heidelberg: Springer. doi:10.1007/978-3-642-37146-2. ISBN 978-3-642-37145-5.
3. Goris, J., Konstantinidis, K. T., Klappenbach, J. A., Coenye, T., Vandamme, P., & Tiedje, J. M. (2007). DNA-DNA hybridization values and their relationship to whole- genome sequence similarities. International Journal of Systematic and Evolutionary Mi- crobiology, 57, 81–91
4. Konstantinidis, K. T., & Tiedje, J. M. (2005). Genomic insights that advance the species definition for prokaryotes. Proceedings of the National Academy of Sciences of the United States of America, 102, 2567–2572.
5. Arahal, D.R. (2014). Whole-genome analyses: average nucleotide identity. In: Methods in microbiology. Elsevier, pp. 103-122.
6. Richter, M., & Rossello ́-Mo ́ra, R. (2009). Shifting the genomic gold standard for the prokaryotic species definition. Proceedings of the National Academy of Sciences of the United States of America, 106, 19126–19131.
7. Varghese, N.J., Mukherjee, S., Konstantinidis, K.T. & Mavrommatis, K. (2015) Microbial species delineation using whole genome sequences. Nucleic Acid Research, 43, 6761–6771.
8. Wayne, L. G., Brenner, D. J., Colwell, R. R., Grimont, P. A. D., Kandler, O., Krichevsky, M. I., Moore, L. H., Moore, W. E. C., Murray, R. G. E. & other authors (1987). International Committee on Systematic Bacteriology. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Int J Syst Bacteriol 37, 463–464.
9. Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. M. A fast approximate algorithm for mapping long reads to large reference databases. In International Conference on Research in Computational Molecular Biology (Springer, Hong Kong, 2017).
10. https://www.applied-maths.com/applications/mlst
11. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5472909/
12. https://pubmlst.org/campylobacter/
13. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5610716/
14. https://academic.oup.com/bioinformatics/article/33/1/119/2525695
15. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5857373/
16. Lee, T., Guo, H., Wang, X. et al. SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics 15, 162 (2014). https://doi.org/10.1186/1471-2164-15-162
17. Katz, Lee S et al. “A Comparative Analysis of the Lyve-SET Phylogenomics Pipeline for Genomic Epidemiology of Foodborne Pathogens.” Frontiers in microbiology vol. 8 375. 13 Mar. 2017, doi:10.3389/fmicb.2017.00375
18. Shea N Gardner, Tom Slezak, Barry G. Hall, kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome, Bioinformatics, Volume 31, Issue 17, 1 September 2015, Pages 2877–2878, https://doi.org/10.1093/bioinformatics/btv271
19. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6737581/#!po=11.3636
20. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5243249/
21. https://www.cdc.gov/foodsafety/outbreaks/investigating-outbreaks/index.html
22. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4870677/
23. https://www.genome.jp/kegg/annotation/br01600.html
24. https://card.mcmaster.ca/ontology/37153
25. https://www.orthobullets.com/basic-science/9059/antibiotic-classification-and-mechanism
26. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3462042/
27. https://cmr.asm.org/content/28/3/687
28. https://www.cdc.gov/campylobacter/outbreaks/puppies-12-19/index.html