Team II Comparative Genomics Group

Team 2: Comparative Genomics

Team Members: Kara Keun Lee, Courtney Astore, Kristine Lacek, Ujani Hazra, Jayson Chao

Class Presentations

File:Background.pdf

Introduction

What is Comparative Genomics?

Once genomes are fully assembled and annotated, outbreak analysis can begin via comparative genomics. Generally, metadata ascertained from gene prediction and annotation can be used to map the relatedness of multiple isolates. Combined with epidemiological data, a given outbreak can be mapped back to a particular source (patient zero) and tracked to determine which strains are outbreak isolates and which are sporadic cases. Furthermore, phenotypic features such as virulence, antibiotic resistance, and pathogenicity can be determined. Compilation of these data allows for recommendations to be made on behalf of human impact, treatment strategy, and management methods to address further spread.

Our Data

Our genomic data comes from 50 isolates of C. jejuni from an outbreak of foodborne illnesses. The genomes are assembled and fully annotated.
Our epidemiological data include times, locations, and ingested foods of each case.

Pipeline Overview

Objectives

Identify kinds of strains (outbreak vs. sporadic)

Construct phylogeny demonstrating which isolates are related and which differ

Determine the source of the outbreak

Map virulence and antibiotic resistance features of outbreak isolates

Compile recommendations for outbreak response and treatment

Overview of Techniques

When performing phylogenomics, there are many options by which one can classify similarities and differences across the genome. Our approach utilizes tools from three different techniques.

ANI

Definition

Average Nucleotide Identity (ANI) is a measure of nucleotide-level genomic similarity between the coding regions of two genomes. A value of 70 % DDH (DNA-DNA hybridization, 1 kb fragments of genome) was proposed as a recommended standard for delineating species and it is a golden standard for species definition based on hybridization experiment. A 95% percent ANI corresponds to 70 % DDH in a study combining experiment hybridization and bioinformatic analysis of bacterial genomes.

How to calculate

Calculating ANI usually involves the fragmentation of genome sequences, followed by nucleotide sequence search, alignment, and identity calculation. The original algorithm to calculate ANI used the BLAST program as its search engine. As it is done in the hybridization experiment, one genome (query genome) will be chopped in to 1 kb fragments when comparing two genomes using bioinformatic tools. Then each of those 1 kb fragments will be search against genome B using blast or other local alignment tools (search for example). Then the identity of each 1 kb fragment will be calculated. The average identity value of those fragments will be used as ANI.

A reciprocal search of genome B against genome A was then proposed to be more reliable and accurate, which is also called orthlogous ANI.

Tools and Algorithms

There are a bunch of tools to use to calculate ANI, some of which are alignment-based while others are not. ANI values using different tools may be a little bit different but they are comparable.

Alignment based tools

ANIb and ANIm (Jspecies: )

ANIb and ANIm represent blast-blast ANI and MUMer-based ANI. They are both one way ANI, without reciprocal search. Popular software includes JSpecies, java based ANI calculation. JSpecies is a friendly UI interface but cannot be used to calculate large datasets.

Reciprocal ANIb and ANIm (ani.rb: OrthANI: )

Reciprocal ANIb and ANIm are the most popular ANI calculation method. Tools such as ani.rb and OrthANI, which are ruby and java based, respectively. Both ani.rb and OrthANI ani.rb can be used to calculate pairwise ANI for datasets with thousands of genomes. ani.rb is blast-based while OrthANI can use 3 different method such as blast, MUMer and usearch (usearch_local).

gANI (genome wide ANI)(gANI: [ https://ani.jgi.doe.gov/html/download.php])

gANI use high performance similarity search tool NSimScan: protein-coding genes (A, B) were compared at the nucleotide level. It reaches high speed via query aggregation, use of optimized bitwise operations in alignment computing, and by avoidance of dynamic programming

Alignment free tools

fastANI (Mashmap based ANI) Tool:

fast ANI use Mashmap: fragments of genome A are mapped to the reference genome (B) using Mashmap. Mashmap first indexes the reference genome and subsequently computes mappings as well as alignment identity estimates for each query fragment, one at a time. At the end of the Mashmap run, all the query fragments f1;f2 1⁄4fbjAj=lc are mapped to B. The results are saved in a set M containing triplets of the form 〈f, i, p〉, where f is the fragment id, i is the identity estimate, and p is the starting position where f is mapped to B. The subset of M (say Mforward) corresponding to the maximum identity mapping for each query fragment is then extracted. To further identify the reciprocal matches, each triplet 〈f, i, p〉 in Mforward is ‘‘binned’’ based on its mapping position in the reference, with its value updated to hf ; i; bini 1⁄4 hf ; i; bp=lci. Through this step, fragments which are mapped to the same or nearby positions on the reference genome are likely to get equal bin value. Next, Mreciprocal filters the maximum identity mapping for each bin. Finally, FastANI reports the mean identity of all the triplets in Mreciprocal

MLST

MLST or Multi-locus Sequence Typing identifies a set of loci (housekeeping genes) in the genome and compares each locus in a genome against the set of loci. It estimates the relationships between bacteria based on allelic variations in specific loci than their nucleotide sequences. MLST data can be used to investigate evolutionary relationships among bacteria. However, the sequence conservation of the housekeeping genes limits the discriminatory power of MLST in differentiating bacterial strains.

There are several types of MLST:

Whole-genome MLST (wgMLST): All loci of a given isolate compared to equivalent loci in other isolates (typing scheme based on a few thousand genes).

Creates wgMLST tree (different styles exist)

Core-genome MLST (cgMLST): Focused on only the core elements of the genomes of a group of bacteria (typing scheme based on a few hundred genes).

7-Gene MLST: Chooses 7 loci in the genome and compare all genomes to these 7 loci.

Profile of alleles (“sequence type” or ST) by calling the alleles
Genome assembly optional – there are assembly free methods

Ribosomal MLST (rMLST): Based on 53 loci that code for ribosomal proteins present in most bacteria.

Database: PubMLST

PubMLST (Public databases for molecular typing and microbial genome diversity) for Campylobacter jejuni/coli (as of 17MAR2020):

98,017 isolates
50,138 genomes
1,286,733 alleles

Tool: stringMLST

stringMLST is a tool for detecting the MLST of an isolate directly from the genome sequencing reads.

Predicts the ST of an isolate in a complete assembly and alignment-free manner
Downloads and builds databases from pubMLST using the most recent allele and profile definitions
Faster algorithm compared to traditional MLST tools that maintain high accuracy

Tool: mlst

Assembly based MLST caller that scan contig files against traditional PubMLST typing schemes and uses BLASTN to align sequences to alleles

Very fast and searches all databases on pubMLST to automatically detect the organism, then calculates the STs.
Can build DB but also has bundle of all available databases in their software repository, which are regularly updated

Tool: ARIBA

Assembly based tool, primarily developed for identifying Anti-Microbial Resistance - associated genes and single nucleotide polymorphisms directly from short reads

Provides inbuilt support for and functionality for multi-locus sequence typing (MLST) using data from PubMLST
Provides inbuilt support for PlasmidFinder and VFDB (Virulence Factor Databases)
Can be used in the study of Virulence Profile and AMR features along with the results from the Functional Annotation group

SNP Typing

SNP stands for Single Nucleotide Polymorphism, meaning that certain alleles have two or three possibilities as to which base is at a given locus. As SNPs accumulate through de novo mutations and are passed down through generations, comparing a given isolate's SNPs to other isolates and a reference genome allow ascertainment of phylogenetic distance between samples(1). Tools have been developed to compare bases position by position (SNP-calling) and create matrices to compute relatedness between samples based on common SNPs.

Generalized Algorithm Overview:

Pre-processing and read cleaning
Mapping
SNP calling against a reference genome
Phylogeny generation based on SNP profiles

Tools to be tested:

Tool: kSNP3.0

Optimal for situations where whole genome alignments don't work
k-mer-based approaches are alignment-free and have a faster runtime
Multiple kSNP versions have been created and thoroughly tested

Tool: Lyve-SET

MSA-based approach (computationally expensive)
Consistent performance according to literature
Has a higher sensitivity, specificity, and average Sn and Sp than kSNP

Tool: Parsnp

Rapid Core Genome Multi-Alignment
Output includes variant (SNP) calls, core genome phylogeny and multi-alignments
Uses information provided by multi-alignments flanking SNP sites for QC

Table: SNP-based tool comparison

Results

ANI

MLST

MLST tools were tested and the best tool for our pipeline was mlst based on following reasons:

Takes in contig files from Genome Assembly output and does not require assembly
Time to run 50 samples were 2 minutes 24 second, while stringMLST (with assembly time) took 4 minutes 21 seconds
Do not need to build db and specify a specific scheme (has auto-search feature unlike others that require curation of specific database)

To compare outputs from mlst (7-gene MLST), we performed analysis on our samples also with chewBBACA (wgMLST/cgMLST)

Creates schemes and allele calls on complete or draft genomes resulting from de novo assemblies
Allele calling algorithm is based on Blast Score Ratio that can be run in multiprocessor settings

MLST Results

There were 4 types of ST (sequence types) identified in our 50 samples by mlst tool: ST 45, ST 829, ST 2109, and ST 50. ChewBBACA cgMLST results very closely reflected the mlst results with 4 distinctive clusters. ST 829 were composed of only C.coli strains (13, identified in ANI & gene prediction), while 37 C.jejuni strains made up of 3 other STs. The following was visualized using GrapeTree.

MLST Results with Epidemiological Data

Epidemiological data available to us included information on location, time, and up to 4 consumed food items.

ST vs. Time: All STs with the exception of ST 2109, had cases well distributed throughout the year (June 2018 - May 2019). ST 2109 cases only occurred from December 2018 to February 2019 (3 month span).

We created a Microreact project to visualize the time-series MLST data: https://microreact.org/project/TYUx1N5L2

ST vs. Location: All STs with the exception of ST 2109, had cases well distributed in all regions of US (9 states). ST 2109 cases were localized to the Eastern US (VA, MA, OH, FL).

ST vs. Food: All STs with the exception of ST 2109, had a wide variety of food associated to illness (ranging from produce to meat to grocery food items). ST 2109 cases also had some of these items, but it was the only ST linked to dairy foods (cookie batter and cheese).

MLST analysis with epi data indicates ST 2109 as the Campylobacter jejuni outbreak strains based on specific time frame and location and link to unique food source.

SNP Typing

ParSNP Results

While ParSNP ran very quickly (~30 seconds) on our 10 isolate benchmarking pipeline, there was a disconnect between calculated statistics and subsequent tree generation (only 9/10 isolates represented in tree) Therefore, this tool was decided not to be optimal for our data.

Figure: ParSNP calculates statistic (10 samples present with 1 reference genome) and ParSNP phylogenetic tree generated (only 9 samples present with 1 reference genome)

kSNP3.1 Results

Determining k: Since kSNP is a k-mer based tool, there exists an optimal size k for a specific dataset. To determine this value, we ran MakeFasta and Kchooser to determine the best k for our data. Our percent uniqueness plateued at 95.8%, so we reran with that plateu as the cutoff. Optimal k was determined to be k = 25

Figure: Kchooser output before and after a cutoff was specified. Note percentage asymptote at approximately 0.958

Figure: SNP count across sequences determined by kSNP3 tool.

Figure: SNP-based phylogeny annotated with locations of cases

Like the results found with the MLST-based tools, we found little convergence with SNP-typing and location.

Figure: SNP-based phylogeny results converged with MLST results

The 5 clades in our phylogenetic tree divide into the 4 Sequence Types determined by MLST analysis.

Outbreak Analysis Results

Outbreak Response

Analyze date distribution / geographic outbreak plots
Refer related cases to physicians for treatment
Alert state labs of heightened related cases

Figure: Geographic Heat Map of Cases

Note that cases, while heightened in Mid-Atlantic region, are not clustered to any one specific area. Only two pairs of the 9 affected states are even bordering one another.

Figure: Date distribution

Our specific dataset saw its first isolate sampled in late June of 2018. Cases then continued for almost 11 months until the final sample was taken in mid-May of 2019. December 2018 saw the month with the most reported cases.

Starting in January of 2019, a C. jejuni outbreak began linked to pet-store puppies. We believe that because of a 3 state overlap between our outbreak and the puppy outbreak, that it is possible that a few of our isolates may be attributed to the puppy outbreak and not are not foodbourne.

Figure: Frequency of different food types across samples.

Food types reported fell into 4 different food categories.

1. Dairy: cheese, cookie batter

2. Grocery: chocolate, chips, wheat bread

3. Meat: chorizo, grilled chicken, beef

4. Produce: apples, avocado, celery, lettuce, oranges, melons

Individual foods with the most cases associated with them were chorizo, beef, lettuce, and avacado.

Source of Outbreak

There is little correlation between locations and sequence types, and most sequence types have cases reported over the span of one year. We believe that there are many instances of sporadic strains but only one outbreak strain and source. This would explain how sister taxa in our phylogenetic tree only come from the same state 20% of the time.

Our outbreak strain is ST 2109. It had cases on the East Coast only over the span of 3 months. The outbreak source traces back to Dairy: Cheese, Cookie batter.

We found that our remaining strains, potentially linked to vegetables and meat, are sporadic given the wide variation of time, foods, and location, and do not classify as an outbreak.

These findings are not surprising, given that C. jejuni has a nature of being highly sporadic.

Human Impact

A C. jejuni infection is one of the leading causes of gastroentiritis in the United States. Acute Gastroentirits an inflammation of the intestinal lining, charachterized by emisis, chills, abdominal cramping/pain, fever, and diarrhea.

However, C. jejuni can cause a multitude of other clinical manifestations including Meningitis, Miller-Fisher Syndrome, Periodontitis, Hepatitis, and infections of the bone or soft tissue.

Figure: Summary of clinical manifestations of C. jejuni infections according to Clinical Reviews of Epidemiology

Treatment Strategy

Determination of antibiotics that will be most effective and ineffective from AMR profile
Recommendation to health providers which antibiotics to avoid

AMR Gene Detected	Resistant to Drug/Class	Method of Antibiotic
APH2/3	aminglycosides	Inhibition of 30s subunit
OXA184/451	Beta-lactamase inhibitors	Blocks cell wall synthesis
SAT-4	Streptothricin	Inhibition of 30s subunit
cmeA/B/C/R	ciprofloxacin, norfloxacin, cefotaxime, fusidic acid, erythromycin	not a specific class. cmeABC is a multidrug efflux pump
gryA	Flouroquinolone	DNA synthesis inhibitor
tetO	Tetracycline	30s Subunit Inhibitor

After eliminating the drugs and classes of drugs to which isolates could be resistant, we consulted published studies to assemble our final reccomendation. OXA184 is the resistant gene present in our outbreak strain (ST2109), so we particularly avoided Beta-lactamase inhibitors. We chose a class of drugs that would inhibit the 50s subunit, more specifically, Macrolides. Since a cmaABC efflux pump would resist erythromycin, we suggest 5-7 days of treatment with either azithromycin or clarithromycin.

CDC Recommendations

Thoroughly wash vegetables

Double-check expiration dates of meat/dairy products

Cook meat to the correct temperature

Notify healthcare providers in 9 states of recommended antibiotics (Macrolides)

Encourage sick individuals to seek medical treatment and stay home from school or work

Works Cited

1. Touchman, J. (2010). "Comparative Genomics". Nature Education Knowledge. 3 (10): 13.

2. Xia, X. (2013). Comparative Genomics. SpringerBriefs in Genetics. Heidelberg: Springer. doi:10.1007/978-3-642-37146-2. ISBN 978-3-642-37145-5.

3. Goris, J., Konstantinidis, K. T., Klappenbach, J. A., Coenye, T., Vandamme, P., & Tiedje, J. M. (2007). DNA-DNA hybridization values and their relationship to whole- genome sequence similarities. International Journal of Systematic and Evolutionary Mi- crobiology, 57, 81–91

4. Konstantinidis, K. T., & Tiedje, J. M. (2005). Genomic insights that advance the species definition for prokaryotes. Proceedings of the National Academy of Sciences of the United States of America, 102, 2567–2572.

5. Arahal, D.R. (2014). Whole-genome analyses: average nucleotide identity. In: Methods in microbiology. Elsevier, pp. 103-122.

6. Richter, M., & Rossello ́-Mo ́ra, R. (2009). Shifting the genomic gold standard for the prokaryotic species definition. Proceedings of the National Academy of Sciences of the United States of America, 106, 19126–19131.

7. Varghese, N.J., Mukherjee, S., Konstantinidis, K.T. & Mavrommatis, K. (2015) Microbial species delineation using whole genome sequences. Nucleic Acid Research, 43, 6761–6771.

8. Wayne, L. G., Brenner, D. J., Colwell, R. R., Grimont, P. A. D., Kandler, O., Krichevsky, M. I., Moore, L. H., Moore, W. E. C., Murray, R. G. E. & other authors (1987). International Committee on Systematic Bacteriology. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Int J Syst Bacteriol 37, 463–464.

9. Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. M. A fast approximate algorithm for mapping long reads to large reference databases. In International Conference on Research in Computational Molecular Biology (Springer, Hong Kong, 2017).

10. https://www.applied-maths.com/applications/mlst

11. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5472909/

12. https://pubmlst.org/campylobacter/

13. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5610716/

14. https://academic.oup.com/bioinformatics/article/33/1/119/2525695

15. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5857373/

16. Lee, T., Guo, H., Wang, X. et al. SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics 15, 162 (2014). https://doi.org/10.1186/1471-2164-15-162

17. Katz, Lee S et al. “A Comparative Analysis of the Lyve-SET Phylogenomics Pipeline for Genomic Epidemiology of Foodborne Pathogens.” Frontiers in microbiology vol. 8 375. 13 Mar. 2017, doi:10.3389/fmicb.2017.00375

18. Shea N Gardner, Tom Slezak, Barry G. Hall, kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome, Bioinformatics, Volume 31, Issue 17, 1 September 2015, Pages 2877–2878, https://doi.org/10.1093/bioinformatics/btv271

19. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6737581/#!po=11.3636

20. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5243249/

21. https://www.cdc.gov/foodsafety/outbreaks/investigating-outbreaks/index.html

22. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4870677/

23. https://www.genome.jp/kegg/annotation/br01600.html

24. https://card.mcmaster.ca/ontology/37153

25. https://www.orthobullets.com/basic-science/9059/antibiotic-classification-and-mechanism

26. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3462042/

27. https://cmr.asm.org/content/28/3/687

28. https://www.cdc.gov/campylobacter/outbreaks/puppies-12-19/index.html