| Title: | Sibyl |
|---|---|
| Description: | The Sibyl package provides a convenient way of testing a range of thresholds for rarefation, evaluating the effect of different values on ordination results, integrating with commonly used packages in microbial ecology. You provide phyloseq data with count data and sample information. |
| Authors: | Lorenzo Assentato [aut, cre] (ORCID: <https://orcid.org/0000-0003-3699-0483>), Magdalena Polder [aut] |
| Maintainer: | Lorenzo Assentato <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0 |
| Built: | 2026-05-23 09:05:06 UTC |
| Source: | https://github.com/Headonpillow/Sibyl |
This function generates accumulation (rarefaction) curves for each sample in
a given phyloseq object.
accumulation_test(input, step = 5)accumulation_test(input, step = 5)
input |
A |
step |
A numeric value. The step size for drawing the accumulation curve. It influences the rarefaction curve calculation and the granularity of points plotted. Default = 5. |
It fits a general accumulation model using the Abundance Coverage Estimator (ACE) as an asymptote, and identifies the sequencing depth at which 75% of the ACE value is reached. It also produces a density plot showing the distribution of these 75% completion thresholds.
Sites that fail model fitting or quality-control criteria are excluded from downstream analyses, and their identities are returned for inspection.
A list containing:
accumulation_plot: A ggplot object with all sites faceted.
threshold_density: A ggplot density/histogram plot of the 75% ACE thresholds.
individual_plots: A list of ggplot objects, one per site.
nls_failed_sites: A character vector of site names for which the nonlinear
model failed to converge.
quality_failed_sites: A character vector of site names for which the model
converged but failed quality-control criteria (e.g. implausible threshold or
poor fit).
fitted_table: A data frame containing per-site model results and diagnostics,
including fitted parameters, threshold estimates, and quality-control flags.
library(Sibyl) # Creating a smaller subset of the data adults_sub <- phyloseq::subset_samples(adults, location=="VK3") # Running accumulation tests on a phyloseq object, higher step size reduces # execution time. accumulation_test(adults_sub, step=50)library(Sibyl) # Creating a smaller subset of the data adults_sub <- phyloseq::subset_samples(adults, location=="VK3") # Running accumulation tests on a phyloseq object, higher step size reduces # execution time. accumulation_test(adults_sub, step=50)
This is a phyloseq containing example data from
a 16S study performed on adult mosquitoes from Burkina Faso.
adultsadults
A phyloseq object with 34 samples and 5221 taxa.
A sample_data object is included, with the following columns:
A unique identifier for each sample.
The location where the sample was collected.
https://pmc.ncbi.nlm.nih.gov/articles/PMC4785398/
This function it's a plotting function for avg_distances generated from
test_threshold.
avg_pairwise_dist_plot(avg_distances)avg_pairwise_dist_plot(avg_distances)
avg_distances |
A dataframe of average distances obtained from |
A ggplot object showing average pairwise distances across thresholds.
library(Sibyl) # Creating a smaller subset of the data adults_sub <- phyloseq::subset_samples(adults, location=="VK3") result <- test_threshold(adults_sub, repeats = 10, t_min = 100, t_max = 500, t_step = 50, group = "location", verbose = FALSE) avg_pairwise_dist_plot(result$avg_distances$repeat_number_10)library(Sibyl) # Creating a smaller subset of the data adults_sub <- phyloseq::subset_samples(adults, location=="VK3") result <- test_threshold(adults_sub, repeats = 10, t_min = 100, t_max = 500, t_step = 50, group = "location", verbose = FALSE) avg_pairwise_dist_plot(result$avg_distances$repeat_number_10)
This is a phyloseq containing example data from
a 16S study performed on adult mosquitoes larvae from Ethiopia.
larvaelarvae
A phyloseq object with 54 samples and 722 taxa.
A sample_data object is included, with the following columns:
A unique identifier for each sample.
Environmental classification of sites where mosquitoes were collected
https://academic.oup.com/femsec/article/101/1/fiae161/7928135
This function performs repeated rarefaction on a phyloseq object,
computes ordination, and generates a PCoA-based visualization.
The same procedure is used from threshold_testing function when testing
a range of thresholds.
repeated_rarefaction( input, repeats = 50, threshold = 250, colorb = "sample_id", group = "sample_id", cloud = TRUE, ellipse = FALSE, cores = 2, ... )repeated_rarefaction( input, repeats = 50, threshold = 250, colorb = "sample_id", group = "sample_id", cloud = TRUE, ellipse = FALSE, cores = 2, ... )
input |
A |
repeats |
An integer. The number of times to repeat rarefaction. A value of 1 means no repeats. If too few repeats are selected it would be not possible to draw an ellipse around the group. |
threshold |
An integer. The threshold value to use for rarefaction. |
colorb |
A string. Column name in |
group |
A string. Column name in |
cloud |
A boolean. If |
ellipse |
A boolean. If |
cores |
An integer. Number of cores to use for parallel processing. |
... |
Additional arguments are reserved to internal use. |
A list containing (While also showing the plot directly):
repeats: Number of repeats.
df_consensus_coordinates: A data frame with coordinates of the median
points of the sample clouds.
df_all: A data frame of coordinates ordered by ordination number,
along with metatata.
plot: a ggplot object.
library(Sibyl) # Running this with cloud = TRUE and ellipse = TRUE will generate a plot # where the samples belonging to the same group will be colored similarly # and an ellipse will be drawn around the group. repeated_rarefaction(adults, repeats = 10, threshold = 250, group = "location", colorb = "location", cloud = TRUE, ellipse = TRUE) # We can run the function to highlight the spread of the single sample clouds # too, setting the groupb parameter to the sample_id. repeated_rarefaction(adults, repeats = 10, threshold = 250, group = "sample_id", colorb = "location", cloud = TRUE, ellipse = TRUE)library(Sibyl) # Running this with cloud = TRUE and ellipse = TRUE will generate a plot # where the samples belonging to the same group will be colored similarly # and an ellipse will be drawn around the group. repeated_rarefaction(adults, repeats = 10, threshold = 250, group = "location", colorb = "location", cloud = TRUE, ellipse = TRUE) # We can run the function to highlight the spread of the single sample clouds # too, setting the groupb parameter to the sample_id. repeated_rarefaction(adults, repeats = 10, threshold = 250, group = "sample_id", colorb = "location", cloud = TRUE, ellipse = TRUE)
This function uses the same steps of repeated_rarefaction in a repeated
fashion and summarizes the results across multiple thresholds.
test_threshold( input, repeats = 50, t_min = 50, t_max = 250, t_step = 5, group = "sample_id", cores = 2, verbose = TRUE, ... )test_threshold( input, repeats = 50, t_min = 50, t_max = 250, t_step = 5, group = "sample_id", cores = 2, verbose = TRUE, ... )
input |
A |
repeats |
An integer, or a vector of integers. The number of times to repeat rarefaction. A value of 1 means no repeats. If a vector, different rarefaction thresholds will be tested sequentially. |
t_min |
An integer. The minimum value for the threshold testing range. |
t_max |
An integer. The maximum value for the threshold testing range. |
t_step |
An integer. The step size for the threshold testing range. A value between 0 and 1 will cause the same threshold to be tested multiple times. |
group |
A string. Column name in |
cores |
An integer. Number of cores to use for parallel processing. |
verbose |
A logical. If |
... |
Additional arguments are reserved to internal use. |
Clustering performance is evaluated using the Calinski-Harabasz index. The function also calculates the average pairwise distance for each sample cloud across different rarefaction thresholds, to give a measure of sample concordance (how stable the sample is at different thresholds).
Higher index values are better, and the plateauing of the index values when
testing higher thresholds indicate that the clouds of repetitions are compact
enough to not being affected by increasing the threshold.
The Calinski-Harabasz value takes into account both within cluster and between
cluster distances, and in the scope of this function is calculated on the
group parameter.
A sudden drop in CH value means that samples might have been removed because
they did not have enough reads to reach the threshold. In this case a warning
is printed listing the samples which have been removed.
In avg_distances, if samples do not have enough reads to reach t_max value, their
APD are set to 0 after that threshold and a warning message is printed.
A list containing:
index_plot: A ggplot object showing a scatterplot of
Calinski-Harabasz index vs. rarefaction threshold.
ordination_plots: A list of lists containing the ordination plots for
tested threshold values. If multiple repeat values were used, each one
will be present in a different list.
avg_distances: A data frame (or list of data frames if using multiple
repeat values) with average pairwise distances of the cloud generated from
each sample repetition attempts (for each threshold).
library(Sibyl) result <- test_threshold(adults, repeats = 10, t_min = 100, t_max = 500, t_step = 50, group = "location", verbose = FALSE) result$index_plotlibrary(Sibyl) result <- test_threshold(adults, repeats = 10, t_min = 100, t_max = 500, t_step = 50, group = "location", verbose = FALSE) result$index_plot