top of page

A Genomic Data Quality Framework: The 6Cs and Downstream Consequences

  • Writer: Emma Doughty
    Emma Doughty
  • Sep 29
  • 4 min read

High-throughput sequencing generates huge volumes of genomic data - and fast - but the value of this data hinges entirely on what we can learn from it. Data is useless if it’s wrong, or even if we don’t know that it’s right. This genomic data quality control framework highlights six recurring issues we encounter with WGS data: completeness, continuity, collinearity, correctness, contamination, and concordance.



Why Data Quality Matters

Cartoon shows a worker dumping data from a bin/trash can into a dashboard while two men examine the dashboard with charts. Text reads, "DO WE TRUST THIS DATA?"
Do We Trust This Data? Cartoon from Piotr@Dataedo

Amongst bioinformaticians and other data analysts alike, dealing with issues in data quality is so common that memes are made to joke about it. You can spend hours pulling your hair out over low quality data or thinking you’ve made the next brilliant discovery (only to realise you haven’t after you’ve excitedly told the world, of course!).


Beyond wasting time, resources and patience, issues with data quality can have serious consequences, like misleading decision-making. For example, low-quality genomes may contain errors that appear as SNPs, and in an outbreak investigation, these could obscure transmission links and lead to samples falsely being ruled out of the outbreak. Similarly, sample contamination could lead to identification of a “new strain” appearing to cause an outbreak, that is really a pipetting error!


Genomic Data Quality Framework: The 6Cs and Their Consequences


There are six types of issues that can be identified with WGS data:

  • Completeness,

  • Continuity,

  • Collinearity,

  • Correctness,

  • Contamination, and

  • Concordance.



Genome quality infographic titled "Completeness", showing an incomplete chromosome illustration. Text beneath reads, "How much of the genome is present, including all regions of the chromosome and any plasmid(s)"

Completeness

Completeness looks at whether the whole genome has been captured in the sequence data, including the full chromosome(s) and any plasmids. Incompleteness causes downstream issues in two ways. 

  • First, when a genome is incomplete, the absence of a gene in the sequence data cannot be distinguished from the absence of a gene in the original genome. An incomplete genome could prohibit the identification of antimicrobial resistance genes, for example, and lead to false predictions of antimicrobial sensitivity. 

  • Second, SNP comparisons across incomplete genomes underestimate diversity, making isolates look closer than they really are (since you can only find SNPs in the present parts of the genome). This could distort phylogenetic trees and even prevent ruling a sample out of an outbreak!



Genome continuity infographic shows circular chromosome split into many contigs, and a bar split into contigs/segments,. Text describes continuity as "How continuous the genome is, or conversely how much the genome is broken into contigs"

Continuity

Continuity reflects how fragmented the genome is. Each contig will represent the genome sequence that the assembly algorithm can confidently reconstruct. A poor assembly may be fragmented into many hundreds or thousands of contigs, fragmenting genes and making it hard to identify whether a gene is present or absent. Additionally, this often makes it difficult to determine a gene’s genomic context- if it is on a plasmid or the chromosome, which other mobile genetic elements may mobilise it, and which other genes are closely located. 






"COLLINEARITY" title with Circular genome graphic and a linear sequence which is not in order, shown by a mixed up gradient throughout the bar.

Text beneath reads, "Whether the genome has been reconstructed in the right order"

Collinearity

Even when an algorithm assembles sequence data into contigs, it may not faithfully reconstruct the right order of the genome. Collinearity describes whether the order of the genome is correct or instead misassembled. When issues with collinearity occur, genes may appear to be in a false context and mislead conclusions, for example about the mobile genetic elements associated with a gene, or spoil comparison of genome assemblies to one another. 








Titled, "CORRECTNESS," the graphic shows both a circular and linear sequence with errors shown a navy bases on the sequence and beneath reads, "How many of the nucleotide bases are accurate."

Correctness

Correctness means the nucleotide at each position is called accurately so the genome is faithfully reconstructed. When a genome is incorrect, errors appear as false SNPs. This might lead to overestimates of the number of SNPs between two genomes, making samples seem like they’re less closely related, and potentially falsely ruling samples out of an outbreak (the opposite to completeness issues!). When SNPs are used for characterisation, such as genotyping or AMR prediction, this can mislead conclusions.







Titled, "COMTAMINATION" in bold blue, logo on top right. 


Only half of the green genome is present, a blue contaminant genome is also depicted. The text beneath reads, "How much of the sequence data is from the target genome, rather than other sources"

Contamination

Contamination looks at the presence of DNA in the dataset that does not originate from the target organism or community that was sequenced. Contamination may include other species or genera, but could equally be a different genotype of the same species, or even another clone of the exact same genotype (e.g. both E. coli ST410 but tens of SNPs apart!). When contamination creeps in, analyses can misidentify the species, confound characterisation of a genome (like presence or absence of AMR genes), or even suggest a “new strain” that is nothing more than a mixture! Contaminants also introduce false genetic variability, causing mis-clustering of strains in a phylogenetic tree that confuse signals in outbreak investigations. 




Titled "Concordance", the green genome depiction has labels, A-D, and a clipboard beside it shows A-E, with ticks for A-D showing they are concordant, and a cross next to E showing discordance. 

Text beneath reads, "Whether the information derived from the sequence data is consistent with other external information about the genome"

Concordance

Concordance is about whether genomic data aligns with other sources of evidence, such as laboratory phenotypes, epidemiological evidence, or even other methods of genome assessment. Sometimes, differences in methods can explain discordance (e.g. a species or AMR gene is not included in the database used to identify them), but when data is discordant without an explanation, confidence in these results collapses. For example, an isolate that is phenotypically resistant but lacks resistance genes in the sequence data raises concerns; or a known outbreak sample that clusters far from its peers points to sample mislabelling or deeper errors. 





The value of the 6Cs in practice

The 6Cs — completeness, continuity, collinearity, correctness contamination, and concordance — provide a structured way of understanding quality issues that may arise in genomic data. The cause of each can be mapped to issues in laboratory or bioinformatic processes, and impact the utility of the data for addressing biological or epidemiological questions. 


Thankfully, the issues described by the 6Cs can be prevented through robust quality assurance practices, and quality control assessments can detect issues that slip through, creating a foundation of reliable, actionable genomics. For support with setting up a robust quality management system, please don't hesitate to get in contact.


bottom of page