Team III Comparative Genomics Group
Team 3: Comparative Genomics
- File:Comparative Genomics - Team 3 Background and Strategy.pptx.pdf
- File:Comparative Genomics - Team 3 Results.pptx.pdf
Team Members: Swetha Singu, Ruize Yang, Deepali Kundnani, Gulay Bengu Ulukaya, Yuhua Zhang, Jie Zhou
Highlights of Results
Species of bacterial pathogen under investigation:- Listeria monocytogenes
Isolates belonging to the outbreak cluster:- CGT3372, CGT3833, CGT3587, CGT3601, CGT3768, CGT3292, CGT3773, CGT3194, CGT3409, CGT3556, CGT3058, CGT3666
Food Source of the oubreak:- Chicken Salad or ingredients used to make the chicken salad
Timeline of the outbreak:- First Instance of outbreak strain was seen in July 2019 in Connecticut, reoccurrence in the same State in September 2019, an outbreak in October 2019 to Michigan, Alabama and Connecticut. In December 2019 Wisconson was added to those lists of States.
Specific recommendations to CDC:
- 1. Tracking the source: Knowing the food source alone is not enough. Contacting the patients to know the details of the chicken salad they consumed(brand or grocery store or restaurant) and also confirm with patients who have missed out on mentioning chicken salad. At the same time investigating if there are any distributors of probably pre-prepared chicken salad or ingredients used in a chicken salad from Connecticut(since the state had two occurrences two months apart from each other before the outbreak).
During this period people should be warned of not repeating the mistakes that patients did and eating cooked meals rather than cold meals, especially meat of any sort(at this point we don't know if the place containing the source processes only chicken or any other meat/product that could be contaminated further)
- 2. Confirming the strain of outbreak: When the possible sources of the outbreak are narrowed down, they should be tested using WGS and merging with the current data to confirm the source.
- 3. Taking Action: After confirmation of the main source of contamination(it can be farm animals, meat storage, ingredients, etc), with the help of FDA, all the process involving that source/place/manufacturing unit including any downstream processes should be ceased, audited and tested. After further confirmation,decontamination measures for Listeria should be undertaken. There should be a hold on distribution until the products clear testing of Listeria and other foodborne diseases.
- 4. Treatment of patients: The antibiotic recommendation would be ampicillin with gentamicin (if needed), trimethoprim-sulfamethoxazole, penicillin and vancomycin, meropenem, or a macrolide if the patient is allergic to penicillin and trimethoprim.
Introduction
Background
Comparative genomics is a field of biological research in which different organisms are compared for their genomic features like the DNA sequence, genes, order of genes and other genomic structural landmarks. This comparative analysis can reveal many insights, such as disease outbreak strains, sporadic strains, evolutionary lineages and genetic variations to name a few. Our group was given 50 different isolates for which genome assembly, gene prediction and functional annotation was performed and the species was identified as Listeria Monocytogenes.
L.monocytogenes belongs to Clade 1 of the Genus Listeria. L.monocytogenes includes 14 serotypes of which 95% of human illnesses are caused by the serotypes 1/2a, 1/2b and 4b. The serotype 4b is most commonly associated with outbreaks. L.monocytogenes is ubiquitous in nature is a hardy organism that can withstand a wide range of conditions including freezing, drying, heat, and relatively high levels of acid, salinity, and alcohol, which makes it a particular problem in ready-to-eat foods that are not cooked before eating.
An outbreak is a sudden or violent start of something unwelcome. When two or more people experience a similar illness from same source, eating the same food, it is called an outbreak. It usually takes about 2-10 weeks to determine if a person is part of a Listeria outbreak. When a person is sick, the laboratory tests are performed on samples from patients, foods and environment for related strains of bacteria. And clusters of closely related listeria isolates are identified by comparing the sequences of a patients sample with that of other Listeria infected patients, and the cluster of samples with same bacterial DNA fingerprint are identified as possible outbreak strains. Whole genome sequencing (WGS) has proven to be a powerful sub-typing tool for food borne pathogenic bacteria even L. monocytogenes, since 2013. Based on the knowledge that bacteria with the same DNA fingerprint are more likely to be from the same source, this cluster data is used by the Epidemiologists to find matches to food isolates [ using GenomeTrakr] to identify common food/outbreak source and trace back is where the identified food is recalled and people are warned.
Listeriosis causes miscarriage, stillbirth in pregnant women, death in newborn, sepsis, meningitis in older patients in humans and it has a high fatality of upto 20% in high risk groups like old, immune compromised people and pregnant women. There are about 1600 reported invasive infections annually and 1 in 5 of them, die of listeriosis.
Objectives
- Identify the outbreak strains from the sporadic strains for the given isolates.
- Analyze the source of the outbreak using the epidemiological data provided.
- Determine the virulence and antibiotic resistance profiles of the outbreak isolates.
- Recommendation for outbreak response and classes of antibiotics to prescribe and to avoid.
Information at hand
- Raw fastq, assembled files and gene prediction files.
- Gff files with all annotated genes from functional annotation team for both genome and plasmids.
- Epidemiological data of 49 isolates with source, foods consumed and location information.
The fastq and fasta files are used as input by various comparative genomic analysis tools inorder to find similar sequence clusters and identify the outbreak strain. Based on these clusters, the epidemiological data and annotated gene information is used and the source of the outbreak and antimicrobial resistant genes are identified.
Pipeline
Approaches
The genomic data can be exploited with many different bioinformatics methods and we use five bioinformatics tools from different levels to compare the 50 isolates.
ANI
Average Nucleotide Identity (ANI) is a measure of nucleotide-level genomic similarity between the coding regions of two genomes. ANI values of >95% infer the same species. Alignment-based ANI is based on the pairwise alignment of the genome stretches. Its reliability depends on the quantity and quality of the aligned fragment, which is very time-consuming. Alignment-free ANI avoids expensive sequence alignments, which is 10 times faster than alignment-free ANI.
We used Fast-ANI as our ANI tool because it's stable and fast. Uses Mashmap as its MinHash based sequence mapping engine to compute the orthologous mappings and alignment identity estimates. Here is the fastANI formula:

Command line
fastani --ql <query list> --rl <ref list> -o <output>
We used five genomes as the reference genome. Here are the results (fastANI can't report the ANI value which is below 80%):

We can also visualize the result:

MLST
Multilocus sequence typing (MLST) refers to the systematic sequencing of seven housekeeping loci within the bacterial genome. Allelic variation at each locus is determined, and a sequence type is assigned by comparing the set of alleles to other isolate profiles in the database. It gives important information about the nucleotide divergence, the recombination rate, and the phylogenetic relationship among strains. MLST data are currently used not only in epidemiological investigations at global and local scales but also in studies of pathogen population dynamics, pathogenicity, and molecular evolution. As Whole Genome Sequencing becomes increasingly ubiquitous, the MLST concept has also been extended to include schemes with many hundreds or even thousands of loci with whole genome and core genome multilocus sequence typing.
We used StringMLST tool for running 7-housekeeping gene MLST analysis using the available PubMLST MLST scheme for Listeria monocytogenes and ChewBBACA tool for running cgMLST by determining allelic variation of loci that exist in more than 95% of our isolates.

StringMLST relies on pattern matching using k-mers. Isolates are characterized by assigning a specific allele for each locus in the typing scheme that shows the maximum k-mer hits. MLST scheme is retrieved them from the PubMLST database. StringMLST k-merizes each locus-specific allele sequence and records the corresponding allele and loci for each k-mer. According to the allelic profile of the seven housekeeping genes, a sequence type is assigned to each isolate.
Get MLST scheme from PubMLST
stringMLST.py --getMLST -P <location/prefix of scheme> --species <species name>
Build database
stringMLST.py --buildDB --config <config file> -k <k-mer length> -P <prefix>
Run MLST analysis
stringMLST.py --predict -d <directory for samples> -p --prefix <prefix for the database> -k <k-mer size> -o <output file name>
Based on the StringMLST analysis, there are 5 distinct sequence types among our 50 samples. Listeria monocytogenes Sequence Types: 219 (1 sample) 397 (3 samples) 1 (18 samples) 37 (16 samples) 6 (12 samples)
ChewBBACA enables users to create and evaluate novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. chewBBACA performs the schema creation and allele calls on complete or draft genomes resulting from de novo assemblers. The alleles identified by chewBBACA correspond to potential coding sequences, giving information on the correspondence between the genetic variability within the isolates. By using ChewBBACA, we created a cgMLST allele matrix for our isolated and determined the allele variations based on the 540 loci common in >95% of the isolates.

SNP-based Typing
- SNP is the abbreviation of Single Nucleotide Polymorphisms, represents a variation in a single nucleotide that occurs at a specific position in the genome. And SNP-based phylogenetic analysis is a comparative analysis that identifies and compares SNPs between isolate genomes. It measures variations of SNPs between isolates. And then construct a tree based on comparisons to differentiate isolates.
- There are two kinds of tools for SNP-based approach. Because alignment of whole microbial genome sequences has proven to be computationally intensive and not applicable to datasets of hundreds or thousands of genomes, and chosen of reference genome may affect SNP result, the first kind of tools was developed as a solution to the problem of aligning large numbers of microbial genomes, such as Parsnp, RealPhy and Ksnp. The other kind of tools was developed with the objective of creating high quality SNP matrices for sequences from closely-related pathogens, such as Lyve-set, SNVphyl, and CFSAN SNP Pipeline.
- The tool we chose to use was kSNP3.0. kSNP took A list of sequence file path for the genome as input to save space, and kSNP is kmer based method, which is less time consuming or computationally demanding comparing. Since it’s kmer based, the results will have lower resolution, but we can still use the results for clustering the isolates and compare our results with mlst to see if they are concordant.
- kSNP generated three knids of trees, parsimony tree, maxinmun likelihood tree, neighbor joining tree. Among the three knids of trees, parsimony tree has the highest accuracy, which is stated in the user manual. maxinmum likelihood tree generated from kSNP gives more evolutionary information even though it is less accurate that the parsimony tree. So we will use the maximum likelihood tree generated from kSNP, together with information from other parts to make a comprehensive analysis.
Pan-genome analysis
Pan-genome is set of all the genes found in the given samples. Pan-genome can be categorized into core and accessory genomes. Core-genome is the genes shared by all samples, and is usually essential for survival. Accessory genome is the genes appeared in some but not all the samples, and is related to the variations in the species.
The focus of pan-genome analysis is to compare the difference in protein expression between samples. The core steps typically involves a collection of genome data, homology clustering based on multiple sequence alignment, and profiling of core and accessory genomes. The profiling can include the number of genes in the genome, the frequency of genes, and the change of size of pan- and core-genome when adding samples, which is called the profile plots.
The biological information that can be derived from the analysis includes phylogenetic distances, presence or absence of genes across the samples, and functional distribution of proteins.
Roary:
It has been most widely used for the streamlined pan-genome analysis, for closely related genomes, usually samples of the same species. The input to Roary has to be annotated GFF files generated by PROKKA, one for each sample. Roary extracts the coding part and convert it to protein sequence. The protein sequence is filtered, then all-against-all blastp is performed, and the results are used to do clustering, then used for downstream analysis. The outputs include the profiling of pan- and core-genomes, gene frequency, presence/absence matrix, the representative sequence of each cluster, and the phylogenetic tree based on the core and accessory genome.
BPGA:
BPGA takes the input of protein sequence for each sample. After pre-processing, BPGA generates the orthologous gene clusters, then the presence/absence matrix, then some downstream analysis. BPGA has 7 function modules. The outputs include the profiling of pan-genome, the representative sequence for each gene cluster, the presence/absence matrix, atypical GC content, gene functions in COG and KEGG databases, and the phylogenetic tree.
Results and Analysis
Correlation of clusters with different typing analysis

Rationale behind selection of Outbreak cluster and grouping samples in different clusters 1.Samples of an outbreak cluster are very much identical to each other as compared to the sporadic strains of the same serotype. Despite different techniques like wgMLST and clustering of annotations for every sample, SNP provides the best resolution and provides truly identical samples in outbreak as a cluster in the SNP Analysis tree. The most important observations in the tree is the number of tip SNPs included in the label of the every tip of the tree and tip nodes having as low as 0 SNP difference between two clades is clear evidence of being an outbreak cluster. 2. While comparing the results of different typing and analysis, exactly same grouping was found by looking at the histogram of ANI scores of all samples against L. monocytongenes Serotype 4b. A difference of 0.02% ANI graciously separated the outbreak cluster from the sporadic clusters of the same serotype. This further nailed down the samples belonging to different group of clusters.
Food source and Outbreak locations
As you compare the different colored clusters, red(outbreak), orange(Serotype similar to the outbreak) and green (Serotype different than the outbreak), you can clearly see the presence of chicken salad and peanuts only belonging to the outbreak group(Figure 3A). Since Peanuts were consumed only by one patient, it might be less significant for outbreak cause as compared to chicken salad. Chicken Salad fits the criteria of being the source of Listeria outbreak as outbreaks have been seen to be caused by mostly refrigerated meat products. Chicken salad goes through preprocessing before being packed and is made up of more than one ingredient which increases the possibility of contamination.
Whilst looking at the geographical distribution(figure 3B) of same color coded clusters, the outbreak cluster seems to be prevalent in the eastern States. Since the orange cluster is similar to the outbreak cluster, it is very possible that location having the sporadic strains from the outbreak cluster can be the location where the outbreak started, which narrows down the source locations to Alabama and Connecticut (since orange and red clusters coexist in these two states)
Timeline and location of clusters
Further analysis of the cluster locations based on timeline(Figure 4) helped us reach the conclusion that the outbreak started in Connecticut. Looking at different cluster locations for the first four months, the red (outbreak) cluster is only prevalent in Connecticut. In the later four months, you can observe the locations where the outbreak was spread.
Outbreak Analysis - Virulence Factors
We analyzed the .gff files generated form Virulence factors from VFDB [Virulence Factor Database] by functional annotation team for both genome and plasmids.
There were 40 genes annotated as virulence factor associated genes of which 36 of the genes belonged to the outbreak strains. They include: lapB, inlJ, oatA, hpt, prsA2, IspA, prfA, llsY, llsB, llsH, llsG, llsD, llsX, lpeA, plcA, plcB , actA, pdgA, vip, hly, inlF, inlA, inlB, inlC, clpE, inlP, mpl, clpP, inlK, iap/cwhA, fbpA, clpC, lntA, ami, lap, bsh. Most of these virulence genes are function as virulence enhancers, participate in infection process or translocation of bacteria or are involved in surface protein binding. Three of the genes IIsP that function as endo peptidase, gtcA gene involved in decoration of cell wall teichoic acid with galactose and glucose and cut gene with autolysis amides acitvity are absent in the outbreak strains but present in some of the other isolates. But these are not the major virulent genes for the bacteria. The plasmids of 7 outbreak strains and the genomes of all outbreak strains showed presence of lplA1 gene, a lipoate protein ligase important for the Listeria transcriptional landscape from saprophytism to virulence.
Outbreak Analysis - Antibiotic resistance
We analyzed the .gff files of Antibiotic resistance genes from CARD [Comprehensive Antibiotic Resistance Database] obtained from functional annotation team for both genome and plasmids. All the outbreak strains and most of the isolates had four antibiotic resistant genes FosX, msrA, norB and Listeria monocytogenes mprF.

The drug resistance, resistance mechanism, the AMR gene family and drug class of these four genes is provided in the table below.

Recommendation for Antibiotic
Antibiotic therapy is the treatment of choice and ampicillin is generally preferred in treating confirmed cases of listeriosis. In cases of meningitis and endocarditis, and in patients with severely impaired T-cell function, most authorities recommend the addition of gentamicin to ampicillin for synergy. Pencillin can also be used and in case of penicillin hypersensitivity, trimethoprim-sulfamethoxazole (TMP-SMX) is the treatment of choice. Those with allergy to both penicillins and sulfonamides could be treated with meropenem or given a single dose of vancomycin and then desensitized. Imipenem and meropenem have been used successfully to treat cases of listeriosis. In vitro, rifampin demonstrates good activity although it is not bactericidal.
Cephalosporins, commonly used in the treatment of bacterial meningitis, have limited activity against listeriae. Chloramphenicol has been shown to have unacceptable failure and relapse rates and should not be used to treat infections with L. monocytogenes. Ertapenem is considerably less active than imipenem and meropenem and should not be used. Erythromycin and tetracycline have been reported to be effective however, clinically significant antimicrobial resistance has not been identified in the outbreak groups and most of isolates, indicating drug resistance to Fosfomycin, tetracycline, erythromycin and Fluoroquinolone.

References
Filliol I, et al. Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. J. Bacteriol. 2006;188:759–772. doi: 10.1128/JB.188.2.759-772.2006.
Adam D. Leaché1 and Jamie R. Oaks2, The Utility of Single Nucleotide Polymorphism (SNP) Data in Phylogenetics. Annual Review of Ecology, Evolution, and Systematics. 2017; Vol. 48:69-84. https://doi.org/10.1146/annurev-ecolsys-110316-022645
Shea N Gardner, Tom Slezak, Barry G. Hall, kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome, Bioinformatics, Volume 31, Issue 17, 1 September 2015, Pages 2877–2878, https://doi.org/10.1093/bioinformatics/btv271
Maiden, Martin C J. “Multilocus Sequence Typing of Bacteria.” Annual Review of Microbiology, U.S. National Library of Medicine, 2006, www.ncbi.nlm.nih.gov/pubmed/16774461.
Silva, Mickael, et al. “ChewBBACA: A Complete Suite for Gene-by-Gene Schema Creation and Strain Identification.” Microbial Genomics, Microbiology Society, Mar. 2018, www.ncbi.nlm.nih.gov/pmc/articles/PMC5885018/.
B-Ummi. “B-UMMI/ChewBBACA.” GitHub, 6 Apr. 2020, github.com/B-UMMI/chewBBACA.
Gupta, Anuj, et al. StringMLST: a Fast k-Mer Based Tool for Multi Locus Sequence Typing . 2016, jordan.biology.gatech.edu/pubs/gupta-bioinformatics-2016-supp.pdf.
Jordanlab. “Jordanlab/StringMLST.” GitHub, 4 Apr. 2019, github.com/jordanlab/stringMLST.
Salcedo, C & Arreaza, L & Alcalá, B & de la Fuente, Laura & Vazquez, Julio. (2003). Development of a Multilocus Sequence Typing Method for Analysis of Listeria monocytogenes Clones. Journal of clinical microbiology. 41. 757-62. 10.1128/JCM.41.2.757-762.2003.
Kim, Yeji, et al. “Current Status of Pan-Genome Analysis for Pathogenic Bacteria.” Current Opinion in Biotechnology, vol. 63, 2020, pp. 54–62., doi:10.1016/j.copbio.2019.12.001.
Page, Andrew J., et al. “Roary: Rapid Large-Scale Prokaryote Pan Genome Analysis.” Bioinformatics, vol. 31, no. 22, 2015, pp. 3691–3693., doi:10.1093/bioinformatics/btv421.
Chaudhari, Narendrakumar M., et al. “BPGA- an Ultra-Fast Pan-Genome Analysis Pipeline.” Scientific Reports, vol. 6, no. 1, 2016, doi:10.1038/srep24373.
Valentina Galata, Tobias Fehlmann, Christina Backes, Andreas Keller, PLSDB: a resource of complete bacterial plasmids, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D195–D202
Hunt, Martin et al. “ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads.” Microbial genomics vol. 3,10 e000131. 4 Sep. 2017, doi:10.1099/mgen.0.000131
Annaleise Wilson et al. “Phenotypic and Genotypic Analysis of Antimicrobial Resistance among Listeria monocytogenes Isolated from Australian Food Production Chains”. Feb 9, 2018. Genes doi: 10.3390/genes9020080
Antibiotic therapy and drugs to use and avoid: http://antimicrobe.org/b111.asp
Clementine Henri et al “An Assessment of Different Genomic Approaches for Inferring Phylogeny of Listeria monocytogenes”Front. Microbiol., 29 November 2017 | https://doi.org/10.3389/fmicb.2017.02351
Yi Chen et al “Core Genome Multilocus Sequence Typing for Identification of Globally Distributed Clonal Groups and Differentiation of Outbreak Strains of Listeria monocytogenes” Appl Environ Microbiology, 2016 Oct 15 doi: 10.1128/AEM.01532-16
"Identification of acquired antimicrobial resistance genes", Zankari et al 2012, PMID: 22782487