Annotations of Data Summary Parameters

Parameters Description
Choose the dataset of a cell type to analyze If the user wants to try some data in the server, click the down-triangle icon to pop-up the file list then use mouse to choose one directly. Otherwise, use the Or: Upload the dataset of a cell type to analyze function to upload the file. Note that (a) the file should be in the .csv or .tsv format, and (b) rows are genes and columns are cells.
Choose the dataset of a control for the analysis There are some control data files in the server, and the user can choose one using the method mentioned above. For example, “logged-liver-exhausted-allData.csv” and “logged-liver-exhausted-controlData.csv” are a pair of files that correspond to the exhausted CD8 T cells (the “case”) and non-exhausted CD8 T cells (the “control”) in the liver cancer. Otherwise, use the Or: Upload the dataset of a control for the analysis to upload the file. The control dataset should also be in the .csv or .tsv format, and rows are genes and columns are cells. Note that a control file is optional; if it is provided, fold changes of gene expression between the case and control will be computed and the inferred causal network will be color-mapped based on fold change values.
What is the data format Currently two data formats are acceptable: log2-transformed values and z-scores. The user should click the correct selection button. If the data format is neither of the two, then click the select button of “Raw values”, and the data will be log2-transformed while being uploaded.
Select a TF list to show TF genes TFs are important for gene expression regulation and different researchers report different TF lists. Here we collect three human TF lists and one mouse TF list, which can be shown by clicking the down-triangle icon to pop-up the file list. If your data is generated from humans, the default TF list should be OK. If your data is generated from mice, please select the mouse TF list for more accurate TF annotation.
Show entries Displaying genes and their attributes. First, one can choose to display 10, 25, 50, 100 genes per page by clicking the downward-arrow in Show entries.In addition to geneName, attributes also include valueInCase, valueInControl, cellsInCase, cellInControl, varianceInCase, varianceInControl, foldChange, and TF. The attributes valueInControl, cellInControl, and varianceInControl will have values when a control is uploaded, and the attribute TF will have values when a TF list is chosen. valueInCase and valueInControl mean gene expression values (mean value) in the case and control files. cellsInCase and cellInControl mean the percentage of cells in which genes are expressed in the case and control files. varianceInCase and varianceInControl mean the variance of the expression levels of genes in the case and control files. Second, right side to each attribute, a pair of button (the down-triangle and up-triangle) is for sorting genes upon ascending values or descending values. The user can click either to sort genes. This function helps the user to find genes with the largest/smallest values, largest/smallest variance, largest/smallest fold changes, and expressed in most/least cells. Third, to help the user search particular genes, a “Search” function is implemented at the right side of gene display section. One can input a string of any letters to search genes containing the string. Fourth, at the bottom of the gene display section, there are numbers (1, 2, 3…) and the button “Previous” and “Next”. These are for the user to go through pages or go to a particular page.

Annotation of Feature Selection Parameters.

Parameters Description
gene expression threshold If the data are log2-transformed, a frequently asked question is whether an up-regulated gene is associated with the up-regulation and down-regulation of other genes. If the data are z-scores, the question is whether an up- or down-regulated gene (whose z-score has a large absolute value) is associated with the up-regulation and down-regulation of other genes. To identify those up- and/or down-regulated genes, it is often necessary to use the parameter to filter out genes.
cellsInCase Its range is 0% - 100%. When trying to infer causal interactions among genes expressed in most cells, a high percentage (e.g., 50% or even 70%) should be set. Otherwise a smaller percentage is acceptable. Since causal discovery is quite sensitive to sample heterogeneity, this parameter significantly influences feature selection and causal discovery.
varianceInCase When no genes are clear candidates of response variable (target gene), to try genes with large variance is an option, the underlying assumption is that that these genes show high “activities” in the cells and thus more likely are the “causes” of some effects.
foldChange This parameter also helps identify response variable (target gene), especially cell-specific high/low expression is considered.
Display candidate gene number Each time any of the four parameters is changed, the number of candidate genes changes, and the user should press this button to display the new number. The button below can be dragged leftward or rightward to reduce (by filtering out those that are expressed more sparsely) or increase genes.
Select feature selection algorithms (one or multiple) The user can choose one or multiple algorithms to perform feature selection. The number of ‘*’ indicates the extent an algorithm is recommended. Algorithms with ‘***’ are most recommended, and algorithms without any ‘*’ are least recommended.
Number of selected genes for each algorithm This determines how many genes each algorithm selects. If the user selects multiple algorithms and chooses the intersection of these algorithms’ results as feature genes, then a relatively large number (e.g., 90) should be set. Otherwise, the default value 50 is appropriate. If fast causal discovery algorithms are chosen later, feature genes can be 60-70; if slow causal discovery algorithms are chosen, feature genes should better be < 50 to control time consumption.
Input one or multiple response variables (target genes) Input the response variables for feature selection. The choice of response variables (target genes) critically determines causal discovery. To investigate the mechanisms of T cell exhaustion, “PDCD1” and “PDCD1, TOX” (multiple genes are separated by comas) are appropriate target genes, but in many scenarios what are the key genes are unclear. Thus, it is very often that multiple rounds of feature selection are performed with different target genes.
Minimal overlapped feature selection algorithms When N feature selection algorithms are performed, the user can choose the shared feature genes selected by n<=N algorithms. This parameter defines the n, and the default value is 1. The user can change the value by dragging the button.
Heatmap width The concordance heatmap shows what feature genes are selected by what algorithms. If many algorithms are chosen, this parameter can be used to adjust the wide of each column to allow the full heatmap can be displayed.
Heatmap height If feature selection generates many feature genes, the font of genes may need to be adjusted. By dragging the button, this parameter can reset the height of the heatmap.
Show entries This function here displays feature genes with attributes. As mentioned above, the user can sort genes upon their attributes or use “Search” to search specific genes. Left to Show entries are three buttons: “Copy”, “CSV”, and “Excel”. Pressing “CSV” or “Excel” allows feature genes (with values of attributes) to be saved as a CSV or Excel file. Pressing “Copy”, then opening/moving to a Word, Excel, or text file, and pressing the right button of mouse and clicking “Paste”, these feature data will be copied to the file.

Annotations of Causal DIscovery Parameters

Parameters Description
Adopt the just finished feature selection result If the “Yes” button is pressed, the feature genes just selected by the recent feature selection are used as the input to causal discovery. Otherwise, the “No” button should be pressed to make the pipeline know that the recent feature selection result is discarded.
If No is chosen, input feature genes This function allows the user to input feature genes. It also allows the user to do causal discovery using previously organized feature genes without performing a feature selection process. Note that the input should contain only gene names, one per line.
Select causal discovery algorithms (one or multiple) The DCC and HSIC algorithms are recommended. If the feature gene set is large, the RCIT is recommended.
Use spike-in data or not A spike-in dataset is a small external dataset generated using the same protocol as the dataset to be analyzed. Using spike-in data can help check if the inferred causal network is reliable, which is similar to using RNA spike-in to evaluate the quality of single cell RNA sequencing. If genes and their interactions in the spike-in dataset are effectively differentiated from genes and their interactions in main dataset, the causal inference should be quite reliable.
Available spike-in data in this server Here are two different datasets, one is generated using Smart-seq2 and the other is generated using 10X Genomics. They can be the spike-in dataset for Smart-seq2 data or 10X Genomics data.
Or upload a spike-in data The user can also build a spike-in dataset and upload it, if the scRNA-seq data are generated by a quite specific protocol.
Set the alpha level for conditional independence test This parameter is somewhat similar to the p-value that indicates statistical significance. Typically it is 0.1 or 0.05.
Select the number of cells for causal discovery When the dataset is large, it is neither necessary nor possible to use all of the cells. Normally, for Smart-seq2 data, 300-400 cells are enough, and for 10X Genomics data, 600-700 cells are enough. The button can be dragged leftward or rightward to set the cell number.
Select how a subset of cells is sampled There are two ways to select a subset of cells. One is to select cells in which feature genes are mostly expressed (these cells will contain fewer missing values); the other is to select cells randomly. If the user applies the same causal discovery algorithms to a set of feature genes multiple times, “Randomly sampling” should be chosen.
Input an email address for receiving the results This is necessary because a causal discovery process takes at least hours.
Submit job to cluster By pressing this button, the causal discovery task is submitted and a small window is opened reporting the job ID. The job ID can be used to retrieve the result. The task may be queued if currently some task(s) are running.