(A) Data uploading

In this example, two data files stored in the server are used. One (the case) is “logged-colorectal-exhausted-allData.csv” which is the scRNA-seq data from exhausted CD8 T cells isolated from human colorectal cancers (“all” means the data are from multiple patients). The other (the control) is “logged-colorectal-exhausted-controlData.csv” which is the scRNA-seq data from non-exhausted CD8 T cells isolated from the nearby colorectal tissue. Since the data are log2-transformed, the “log2-transformed” button under the “What is the data format” is clicked.

Data uploading may take a while, from seconds to minutes, depending on the two files’ size. After uploading the two files, a TF list should be chosen. Instead of using the default human TF list reported by Bahrami et al 2015, the TF list used in the SCENIC package is chosen here.

After data uploading, the user can use the data summary function at the bottom of the “Main Menu” web page to view data. Genes can be ordered on nine attributes. By pressing the down-triangle button at “valueInCase”, genes are ordered on their expression values.

By pressing the down-triangle button at “foldChange”, genes are ordered on their fold changes. The result shows that CXCL13, HAVCR2, CCL3, and PDCD1 are genes with the largest fold changes.

By pressing the down-triangle button at “TF”, the user can identify all TFs.

By inputting the string “to” (it is capital insensitive, so “to” and “TO” have no difference) into the box at “Search”, genes containing the string “to” are listed. These genes include TOX (an important transcription factor controlling the exhaustion of CD8 T cells) and TOX2 (whose function remains unclear).

(B) Feature selection

Now is the second step – to choose a subset of genes for causal discovery. “Step 2: Feature Selection Parameters” helps the user to go into Feature Selection, which has three elements – determining candidate genes, choosing and inputting response variables (target genes), and running one or multiple feature selection algorithms to identify feature genes from candidate genes.

It is not reliable to select a few feature genes (e.g., 50) from too many candidate genes (e.g., 10000). Thus, it is advisable to filter out some genes. Multiple conditions can be used to filter candidate genes, including “gene expression threshold” (the mean expression values of genes), “cellInCase” (the percentage of cells in which genes are expressed), “varianceInCase” (the mean variances of genes), “foldChange” (the mean fold changes of genes), and “TF” (genes encoding TFs). In the data summary part, these conditions are used to examine the data. In “logged-colorectal-exhausted-allData.csv” there are 16881 genes. If the user assumes that (a) feature genes should be expressed in >= 60% of cells, (b) its log2-transformed value should be >= 0.15, (c) they should have fold change > 0.2, and they should be TFs, then, dragging the left and right buttons of “gene expression threshold” to 0 and 0.15, dragging the button of “cellInCase” to 0.6, dragging the button of “foldChange” to 0.2, and using the mouse and Backspace key to set “TF=Yes” (“TF=Yes No” stipulates that feature genes include both TF and non-TF, and “TF=No” stipulates that feature genes include only TFs). Whenever the conditions are redefined, the number under “max candidate gene number” shows the number of qualified candidate genes.

Dragging buttons here and in Step3: Causal Discovery Parameters using the mouse to an accurate value may be uneasy. After using the mouse to make a click, the user can use the “<-” key and “->” key (at the bottom-right of the keyboard) to manually move a button leftward or rightward. So, the preferable way to set/reset a parameter is to use the mouse to drag the button to a largely correct position, then use the “<-” and “->” keys to fine-tune the parameter value.

This example shows the use of the four recommended feature selection algorithms. First, the checkbox of Random Forest, Extra Trees, BAHSIC, and SHS are clicked. Then, the button under “Number of selected genes for each algorithm” is dragged to 80 to let each algorithm select 80 feature genes. Next, input “tox” to the box under “Input one or multiple response variables (target genes)” to select genes highly associated with the TOX gene. After doing these, click the “Run Feature Selection” button and the notification “Running feature selection may take several minutes” pops out.

Feature selection may take 1-3 minutes, longer if the data file is large, the candidate gene list is large, and multiple algorithms are chosen. Note that feature selection is applied to all cells, but causal discovery does not necessarily. After submitting a feature selection, here comes Feature Selection Result. A concordance heatmap is generated to graphically display feature selection results, clearly showing what genes are selected by what algorithms. The “concordance heatmap” of this feature selection is shown below; note that gene names on the right side cannot be clearly displayed. Three parameters control the graphic display. The first is “Minimal overlapped feature selection algorithms” whose default value is 1 (i.e., genes selected by any algorithm are feature genes). The other two parameters, “heatmap width” and “heatmap height” are for resizing the heatmap.

Now, dragging the button under “Minimal overlapped feature selection algorithms” to 2 and dragging the button under “heatmap height” to 1100, feature genes are reduced to 93, and their names can be displayed clearly. It is clear that many genes are co-selected by BAHSIC and SHS, and by Randomly Forest and Extra Trees, respectively. Those selected by the four are ranked at the top.

Below the heatmap, feature genes are displayed in the same way as genes are displayed after being uploaded. Similarly, the user can order genes upon any attribute. These operations help the user see whether feature genes are reasonable or whether expected feature genes are identified.

Since TOX is an important TF involved in CD8 T cell exhaustion, it may be interesting to see what genes have the lowest and highest fold changes, which suggest potential down- and up-regulation by TOX in the exhausted CD8 T cells. By pressing the up- and down-triangle at “foldchange”, the most down- and up-regulated feature genes are listed. Of note, PDCD1, MIR155HG, and CD27-AS1 belong to the most up-regulated ones.

By inputting the string “pdc” into the box at “Search”, genes containing the string “pdc” are searched out. In these feature genes, only PDCD1 contains “pdc”.

It is not necessary to run causal discovery for 93 feature genes. By dragging the button under “Minimal overlapped feature selection algorithms” to 3, it turns out that the four algorithms co-select 35 genes, including PDCD1, CXCL13, ENTPD1, TNFRSF9, and CCND2 that are important for CD8 T cell exhaustion.

Feature genes can be saved in three ways, by clicking the “Copy”, “CSV”, and “Excel” buttons under “Selected feature genes”. The “CSV” and “Excel” buttons allow the feature genes to be saved as a csv or Excel file, the “Copy” button allows the feature genes to be copied to an opened file (e.g., a Word, text, or Excel file).

(C) Causal discovery

To enter into “Step 3: Causal Discovery Parameters” from “Feature Selection Result”, the user should click “Main Menu”. Functions and parameters in “Step3: Causal Discovery Parameters” are used to make the causal discovery. To perform causal discovery using genes selected by the just finished feature selection, click “Yes” the “Adopt the just finished feature selection result.” Since multiple rounds of feature selection are often performed to identify meaningful feature genes, and the user may add or remove some genes, this example shows manually inputting feature genes. To do so, first click the “No” check box under “Accept the just finished feature selection result,” then input feature genes into the box under “If No is chosen, input feature genes.” Note that only gene names are inputted. If the saved or copied feature genes are in an Excel file, mark and copy gene names using the mouse and the Ctrl-C hotkey, and paste the genes into the box using the Ctrl-V hotkey. Ensure that in the input box all lines should be gene names, and the last line should not be empty. Otherwise, such bug as “is not in data summary” or “TOX CXCL13 is not in data summary” would be reported because such empty lines correspond to no genes. If the bug is reported, the user can paste feature genes into the Windows Notepad and then copy these genes from the Notepad to the input box to remove empty characters and the newline character.

Three causal discovery methods and nine conditional independence tests are included. The user can choose one causal discovery method with any number of conditional independence tests. A highly recommended combination is PC + dcc.gamma + dcc.perm. We use spike-in data to help ensure the reliability of causal discovery. Since the dataset “logged-colorectal-exhausted-allData.csv” was generated by Smart-seq2, “logged-CD74-macrophages-spike-in-Smart-seq2.csv” stored in the server is choen as the spike-in dataset.

Under “Set the alpha level for conditional independence test, ” the alpha level’s default value is 0.1; this example adopts this value. This dataset contains 840 cells; we just choose 300 cells by dragging the button under “Select the number of cells for causal discovery” to 300. Also, by clicking the check box of “Randomly sampling cells from the case”, this example chooses to randomly select the 300 cells from the dataset (data in the control file are not involved in causal discovery). Finally, the email address is filled.

When the causal discovery job is successfully submitted, a pop-up window emerges to report the submission and job ID. The result of the job will be received by email, and the user can also use the job ID to retrieve the result manually.

(D) Retrieve causal discovery results

By clicking “Causal Discovery Result”, the user enters the web page to manually retrieve the result of one or multiple previous jobs. Here we retrieve a previous job by inputting its job ID 20220519110952 into the box under “Job ID”.

In this web page, functions used in previous steps are also available. For example, the “data summary” function allows the user to display the genes of the original dataset.

The “concordance heatmap” function allows the user to review the feature selection results. Note that the three parameters (“Minimal overlapped feature selection algorithms”, “heatmap height”, and “heatmap width”) can still be used here to show feature genes selected by algorithms.

The “selected feature genes” function allows the user to display feature genes of the causal discovery task.

The “causal graph” function allows the user to redraw the causal network inferred by any algorithm. By clicking the check box at RCIT and pressing the button “Generate Causal Graph”, the causal network is drawn in the right-side area. This causal discovery was performed by only RCIT.

Two parameters help control the size and shape of the graph. The default value of “causal graph width” is 800. By dragging the button under “causal graph width” to 550, the network becomes narrower.

Finally, at the bottom of these sub-pages, the download buttons allow data and pictures to be downloaded as Excel files and PDF files, respectively.

The step-by-step process of an example

(A) Data uploading

(B) Feature selection

(C) Causal discovery

(D) Retrieve causal discovery results