Skip to contents

Taxa and sample quality control

2024-04-03

This tutorial walks you through some quality checks when first diving into a dataset of taxonomic compositions. Note that this tutorial does not include sequence quality filtering, as we are assuming this was already done by for example DADA2. Here we will focus on filtering the dataset to keep only bacterial reads, inspect blanks, and removing samples with too few reads. How you handle these steps may depend on your dataset, but here’s a demonstration to give you an idea of your options.

Setting up

We need only two packages: tidytacos (of course) and the tidyverse set of packages.

Our data

The data for this tutorial is found in the tidytacos object ‘leaf’. This data comes from a study where the phyllosphere microbiome of plants is treated. After sampling, before DNA extraction a spike bacterium was added to all samples to serve as an internal standard to determine absolute abundances of the bacteria in our samples (as explained in Smets et al. 2016).

tacosum(leaf)
## n_samples    n_taxa   n_reads 
##        33      6054   4389396

Filter out undesired taxa

In this study, we choose to sequence the V4 region of the 16S rRNA gene to specifically study the bacterial fraction of our microbial community, hence filter_taxa is used to discard any non-bacterial reads such as mitochondria, chloroplasts, and archaea. Using tacosum before and after the filtering helps us understand that only 60% of our reads were bacterial. In phyllosphere samples it is not uncommon to find a large fraction of chloroplasts and/or mitochondria, plant cell organelles which contain a 16S rRNA gene, hence the substantial loss of reads in this step.

before <- tacosum(leaf)[3]

leaf <- leaf%>%
  filter_taxa(kingdom == "Bacteria")

tacosum(leaf)
## n_samples    n_taxa   n_reads 
##        33      5589   2634933
after<-tacosum(leaf)[3]
after/before
##   n_reads 
## 0.6002951

Depending on the database you used for assigning taxonomy, you may need to adapt this code and use for example:

leaf <- leaf%>%
  filter_taxa(class != "Chloroplast" | is.na(class))%>%
  filter_taxa(family != "Mitochondria"| is.na(family))

Read numbers and blanks

An important initial quality control is an investigation of the read numbers per sample. The add_total_count function will add total read counts per sample in the sample table. We can then visualize the distribution of these read counts using ggplot.

leaf <- leaf%>%
  add_total_count()

ggplot(leaf$samples, aes(x = log10(total_count), y = sample, color = Plant))+
    geom_point()

Unfortunately, one sample (bottom right corner) took up a lot of the reads of this sequencing run, decreasing sampling depth of all other samples. Nevertheless, most samples have more than 1000 reads, which we consider worthwhile. Blanks are on the low end of the spectrum in general, which is according to our expectation, but some blanks have quite some reads which requires further investigation.

Let’s look at the composition of our blanks. We use the filter_samples function to generate a tidytacos object with only the blanks. The tacoplot_stack function returns a nice stacked bar plot visualization of the most abundant taxa in the blanks.

leaf1 <- leaf%>%
  filter_samples(Plant == "Blank")

tacoplot_stack(leaf1)+
  geom_text(size = 3, aes(y=-0.04, label = Plot))+
  geom_text(size = 2.5, aes(y=-0.06, label = total_count))

Citrobacter is suspiciously prevalent and occurs in multiple blanks! After checking its sequence, this ASV is the one that we spiked into our samples and many of our blanks to determine absolute bacterial numbers from the relative abundances. Hence, in this case, the high abundance of Citrobacter in the blanks is a good thing. Besides Citrobacter none of the contaminants are consistent nor do the read numbers of the blanks without spike exceed 1000 per sample, which are all good signs for the quality of our data.

We still need to set a minimum read number that a sample needs to have to be included in our analyses. Samples with really low read counts are too likely not being representative of the sampled microbiome by poor amplification, contamination, and/or insufficient sequencing depth. Usually, a minimum read number cut-off is quite an arbitrary decision, hopefully somewhat informed by the nature of your data and results from the blanks. In this case however, we have spiked our samples and several of our blanks and we can determine a threshold that is actually informed by the amount of contaminants in the blanks!

Absolute abundances

We added equal amounts of a spike taxon (Citrobacter) to all of our samples and some of the blanks. This helps us calculate the absolute abundances of the 16S gene copies per sample using the function add_total_absolute_abundance. First we need to determine the spike taxon_id. Using add_prevalence will help us determine which taxa are present in all samples. After comparing taxonomy and/or sequence with the information we have of our spike, we can identify the spike taxon_id, which we need for the add_total_absolute_abundance function.

leaf <- leaf%>%
  add_prevalence(relative = T)

head(leaf$taxa)
## # A tibble: 6 × 10
##   kingdom  phylum     class order family genus species taxon taxon_id prevalence
##   <chr>    <chr>      <chr> <chr> <chr>  <chr> <chr>   <chr> <chr>         <dbl>
## 1 Bacteria Proteobac… Gamm… Ente… Enter… Citr… NA      TACG… t6           0.788 
## 2 Bacteria Proteobac… Gamm… Ente… Enter… Buch… NA      TACG… t8           0.0303
## 3 Bacteria Proteobac… Beta… Burk… Oxalo… Mass… aurea/… TACG… t9           0.364 
## 4 Bacteria Proteobac… Gamm… Ente… Enter… Buch… NA      TACG… t11          0.0606
## 5 Bacteria Proteobac… Alph… Sphi… Sphin… Sphi… NA      TACG… t12          0.515 
## 6 Bacteria Actinobac… Acti… Fran… Geode… Blas… NA      TACG… t14          0.485
leaf <- leaf%>%
  add_total_absolute_abundance(spike_taxon = "t6", spike_added = added_spike_copies)

ggplot(leaf$samples, aes(x = Plant, y = log10(total_absolute_abundance)))+
  geom_boxplot()+
  geom_point()

contaminants <- leaf$samples%>%
  filter(Plant == "Blank")%>%
  filter(!is.na(total_absolute_abundance))%>%
  summarize(mean = mean(total_absolute_abundance))

leaf <- leaf%>%
  mutate_samples(potential_contamination = contaminants$mean/total_absolute_abundance)%>%
  filter_samples(potential_contamination < 0.1)

tacosum(leaf)
## n_samples    n_taxa   n_reads 
##        16      5536   2551543

We see that the estimated biomass in the samples is well above the estimated biomass in the blanks, which is again a good sign! As we would like to minimize our chances to have more than 10% of contamination in the samples, we decided to remove samples that have less than 10x more biomass than the mean biomass in the blanks. This results in a tidytacos object without blanks and without 2 samples which were of such low biomass that we considered them too sensitive to contamination. An excellent starting point to start analyses.