Entering edit mode. The scRNA-seq data for the analysis of human lung tissue were obtained from GEO accession GSE122960, and the bulk RNA-seq of purified AT2 and AM fractions were shared by the authors immediately upon request. As a gold standard, results from bulk RNA-seq of isolated AT2 cells and AM comparing IPF and healthy lungs (bulk). ## other attached packages: 6a) and plotting well-known markers of these two cell types (Fig. Generally, tests for marker detection, such as the wilcox method, are sufficient if type I error rate control is less of a concern than type II error rate and in circumstances where type I error rate is most important, methods like subject and mixed can be used. Developed by Paul Hoffman, Satija Lab and Collaborators. The general process for detecting genes then would be: Repeat for all cell clusters/types of interest, depending on your research questions. Cons: ## [1] stats graphics grDevices utils datasets methods base In a study in which a treatment has the effect of altering the composition of cells, subjects in the treatment and control groups may have different numbers of cells of each cell type. Infinite p-values are set defined value of the highest -log(p) + 100. Applying themes to plots. Four of the methods were applications of the FindMarkers function in the R package Seurat (Butler et al., 2018; Satija et al., 2015; Stuart et al., 2019) with different options for the type of test performed: for the method wilcox, cell counts were normalized, log-transformed and a Wilcoxon rank sum test was performed for each gene; for the method NB, cell counts were modeled using a negative binomial generalized linear model; for the method MAST, cell counts were modeled using a hurdle model based on the MAST software (Finak et al., 2015) and for the method DESeq2, cell counts were modeled using the DESeq2 software (Love et al., 2014). Here is the Volcano plot: I read before that we are not allowed to do the differential gene expression using the integrated data. These analyses suggest that a nave approach to differential expression testing could lead to many false discoveries; in contrast, an approach based on pseudobulk counts has better FDR control. Here, we compare the performance of subject, wilcox and mixed to detect cell subtype markers of CD66+ and CD66- basal cells with bulk RNA-seq data from corresponding PCTs. Supplementary Figure S14 shows the results of marker detection for T cells and macrophages. a, Volcano plot of RNA-seq data from bulk hippocampal tissue from 8- to 9-month-old P301S transgenic and non-transgenic mice (Wald test). disease and intervention), (ii) variation between subjects, (iii) variation between cells within subjects and (iv) technical variation introduced by sampling RNA molecules, library preparation and sequencing. ## [58] deldir_1.0-6 utf8_1.2.3 tidyselect_1.2.0 Single-cell RNA-sequencing (scRNA-seq) provides more granular biological information than bulk RNA-sequencing; bulk RNA sequencing remains popular due to lower costs which allows processing more biological replicates and design more powerful studies. ## [25] ggrepel_0.9.3 textshaping_0.3.6 xfun_0.38 Comparison of methods for detection of CD66+ and CD66- basal cell markers from human trachea. We then compare multiple differential expression testing methods on scRNA-seq datasets from human samples and from animal models. ## [88] plotly_4.10.1 png_0.1-8 spatstat.utils_3.0-2 In stage ii, we assume that we have not measured cell-level covariates, so that variation in expression between cells of the same type occurs only through the dispersion parameter ij2. Whereas the pseudobulk method is a simple approach to DS analysis, it has limitations. For each subject, gene counts are summed for all cells. (a) Volcano plots and (b) heatmaps of top 50 genes for 7 different DS analysis methods. For example, consider a hypothetical gene having heterogeneous expression in CF pigs, where cells were either low expressors or high expressors versus homogeneous expression in non-CF pigs, where cells were moderate expressors. These methods appear to form two clusters: the cell-level methods (wilcox, NB, MAST, DESeq2 and Monocle) and the subject-level method (subject), with mixed sharing modest concordance with both clusters. To better illustrate the assumptions of the theorem, consider the case when the size factor sjcis the same for all cells in a sample j and denote the common size factor as sj*. Increasing sequencing depth can reduce technical variation and achieve more precise expression estimates, and collecting samples from more subjects can increase power to detect differentially expressed genes. Alternatively, batch correction methods have been proposed to remove inter-individual differences prior to DS analysis, however, this increases type I error rates and disturbs the rank-order of results as explained in Zimmerman et al. Default is 0.25. The implementation provided in the Seurat function 'FindMarkers' was used for all seven tests . If subjects are composed of different proportions of types A and B, DS results could be due to different cell compositions rather than different mean expression levels. Aggregation technique accounting for subject-level variation in DS analysis. The number of UMIs for cell c was taken to be the size factor sjc in stage 3 of the proposed model. ## [64] later_1.3.0 munsell_0.5.0 tools_4.2.0 As scRNA-seq studies grow in scope, due to technological advances making these studies both less labor-intensive and less expensive, biological replication will become the norm. We identified cell types, and our DS analyses focused on comparing expression profiles between large and small airways and CF and non-CF pigs. Supplementary Table S2 contains performance measures derived from the ROC and PR curves. Overall, the volcano plots for subject and mixed look similar with a higher number of genes upregulated in the IPF group, while the wilcox method exhibits a much different shape with more genes highly downregulated in the IPF group. Next, I'm looking to visualize this using a volcano plot using the EnhancedVolcano package: The observed counts for the PCT study are analogous to the aggregated counts for one cell type in a scRNA-seq study. ## [9] LC_ADDRESS=C LC_TELEPHONE=C Finally, we discuss potential shortcomings and future work. First, a random proportion of genes, pDE, were flagged as differentially expressed. This can, # be changed with the `group.by` parameter, # Use community-created themes, overwriting the default Seurat-applied theme Install ggmin, # with remotes::install_github('sjessa/ggmin'), # Seurat also provides several built-in themes, such as DarkTheme; for more details see, # Include additional data to display alongside cell names by passing in a data frame of, # information Works well when using FetchData, ## [1] "AAGATTACCGCCTT" "AAGCCATGAACTGC" "AATTACGAATTCCT" "ACCCGTTGCTTCTA", # Now, we find markers that are specific to the new cells, and find clear DC markers, ## p_val avg_log2FC pct.1 pct.2 p_val_adj, ## FCER1A 3.239004e-69 3.7008561 0.800 0.017 4.441970e-65, ## SERPINF1 7.761413e-36 1.5737896 0.457 0.013 1.064400e-31, ## HLA-DQB2 1.721094e-34 0.9685974 0.429 0.010 2.360309e-30, ## CD1C 2.304106e-33 1.7785158 0.514 0.025 3.159851e-29, ## ENHO 5.099765e-32 1.3734708 0.400 0.010 6.993818e-28, ## ITM2C 4.299994e-29 1.5590007 0.371 0.010 5.897012e-25, ## [1] "selected" "Naive CD4 T" "Memory CD4 T" "CD14+ Mono" "B", ## [6] "CD8 T" "FCGR3A+ Mono" "NK" "Platelet", # LabelClusters and LabelPoints will label clusters (a coloring variable) or individual points, # Both functions support `repel`, which will intelligently stagger labels and draw connecting, # lines from the labels to the points or clusters, ## Platform: x86_64-pc-linux-gnu (64-bit), ## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3, ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3, ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C, ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8, ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8, ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C, ## [9] LC_ADDRESS=C LC_TELEPHONE=C, ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C, ## [1] stats graphics grDevices utils datasets methods base, ## [1] patchwork_1.1.2 ggplot2_3.4.1, ## [3] thp1.eccite.SeuratData_3.1.5 stxBrain.SeuratData_0.1.1, ## [5] ssHippo.SeuratData_3.1.4 pbmcsca.SeuratData_3.0.0, ## [7] pbmcMultiome.SeuratData_0.1.2 pbmc3k.SeuratData_3.1.4, ## [9] panc8.SeuratData_3.0.2 ifnb.SeuratData_3.1.0, ## [11] hcabm40k.SeuratData_3.0.0 bmcite.SeuratData_0.3.0, ## [13] SeuratData_0.2.2 SeuratObject_4.1.3. The subject method had the shortest average computation times, typically <1 min. 6b). Multiple methods and bioinformatic tools exist for initial scRNA-seq data processing, including normalization, dimensionality reduction, visualization, cell type identification, lineage relationships and differential gene expression (DGE) analysis (Chen et al., 2019; Hwang et al., 2018; Luecken and Theis, 2019; Vieth et al., 2019; Zaragosi et al., 2020). Step 4: Customise it! # Calculate feature-specific contrast levels based on quantiles of non-zero expression. Supplementary Table S1 shows performance measures derived from these curves. The volcano plot for the subject method shows three genes with adjusted P-value <0.05 (log10(FDR) > 1.3), whereas the other six methods detected a much larger number of genes. (c) Volcano plots show results of three methods (subject, wilcox and mixed) used to identify CD66+ and CD66- basal cell marker genes. Further, if we assume that, for some constants k1 and k2, Cj-1csjck1 and Cj-1csjc2k2 as Cj, then the variance of Kij is ij+i+o1ij2. Here, we introduce a mathematical framework for modeling different sources of biological variation introduced in scRNA-seq data, and we provide a mathematical justification for the use of pseudobulk methods for DS analysis. The volcano plot that is being produced after this analysis is wierd and seems not to be correct. ## locale: In another study, mixed models were found to be superior alternatives to both pseudobulk and marker detection methods (Zimmerman et al., 2021). The subject and mixed methods are composed of genes that have high inter-group (CF versus non-CF) and low intra-group (between subject) variability, whereas the wilcox, NB, MAST, DESeq2 and Monocle methods tend to be sensitive to a highly variable gene expression pattern from the third CF pig. Help! The data from pig airway epithelia underlying this article are available in GEO and can be accessed with GEO accession GSE150211. First, we present a statistical model linking differences in gene counts at the cellular level to four sources: (i) subject-specific factors (e.g. Second, we make a formal argument for the validity of a DS test with subjects as the units of analysis and discuss our development of a Bioconductor package that can be incorporated into scRNA-seq analysis workflows. Rows correspond to different proportions of differentially expressed genes, pDE and columns correspond to different SDs of (natural) log fold change, . Subject-level gene expression scores were computed as the average counts per million for all cells from each subject. ## loaded via a namespace (and not attached): For each method, the computed P-values for all genes were adjusted to control the FDR using the BenjaminiHochberg procedure (Benjamini and Hochberg, 1995). In practice, often only one cutoff value for the adjusted P-value will be chosen to detect genes. I have scoured the web but I still cannot figure out how to do this. In this case, Cj-1csjc=sj* and Cj-1csjc2=sj*2, and the theorem holds. When only 1% of genes were differentially expressed, the mixed method had a larger area under the curve than the other five methods. For the AM cells (Fig. Results for alternative performance measures, including receiver operating characteristic (ROC) curves, TPRs and false positive rates (FPRs) can be found in Supplementary Figures S7 and S8. As increases, the width of the distribution of effect sizes increases, so that the signal-to-noise ratio for differentially expressed genes is larger. In practice, this assumption is unlikely to be satisfied, but if we make modest assumptions about the growth rates of the size factors and numbers of cells per subject, we can obtain a useful approximation. Supplementary Figure S14(cd) show that generally the shapes of the volcano plots are more similar between the subject and mixed methods than the wilcox method. I keep receiving an error that says: "data must be a , or an object coercible by fortify(), not an S4 object with class . Was this translation helpful? Therefore, as experiments that include biological replication become more common, statistical frameworks to account for multiple sources of biological variability will be critical, as recently described by Lhnemann et al.