Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Seurat version of the PBMC clustering tutorial #5491

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

MarisaJL
Copy link
Collaborator

This is the Seurat version of the Clustering3k PMBCs with Scanpy tutorial for single cell.

@MarisaJL MarisaJL requested a review from a team as a code owner October 31, 2024 13:30
@pavanvidem
Copy link
Member

@MarisaJL I will review it soon.

Copy link
Member

@pavanvidem pavanvidem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed and tested until the preprocessing.


Representing the matrix with these three files is convenient for sharing the data, but not for processing them. Different single cell analysis packages have attempted to solve the problem of storage and analysis by inventing their own formats, which has led to the proliferation of many different 'standards' in the scRNA-seq package ecosystem.

## Seurat Object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Seurat Object
## SeuratObject

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please call it by the data structure naming as "SeuratObject" or in plain language as "Seurat object" (with a lower case o)

> - *"Include features detected in at least this many cells"*: `3`
> - *"Include cells where at least this many features are detected"*: `200`
> - *"Calculate percentage of mito genes in each cell"*: `No`
>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include selection of genes.tsv and barcodes.tsv in params

> 4. Check that the format is `RDS`
{: .hands_on}

We can't look at the RDS file directly as it is designed for computers to read, rather than humans, but the Seurat tools will now be able to interact with the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an inspect step here. A question on the number of features and cells filtered out with our thresholds should give an idea of what happened in the last step.

>
> #### Distortion of population heterogeneity during variance estimation or principal components analysis
>
> When we cluster single cell data, we're looking for the biggest differences between groups of cells. If we have lots of low-quality cells, then the biggest differences we'll see might be these differences in quality, rather than something more biologically interesting. Differences in quality can have a big impact because low-quality cells often have low total RNA counts. When we perform scaling normalisation on these cells during preprocessing, this can make the variances for the genes they do express much bigger than for other cells. When we select the most variable genes in the dataset, we'll end up picking the ones expressed by low-quality cells. We'll use these in our PCA analysis, likely ending up with top principal components based on cell quality rather than biology, making it harder to detect the differences we're actually interested in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> When we cluster single cell data, we're looking for the biggest differences between groups of cells. If we have lots of low-quality cells, then the biggest differences we'll see might be these differences in quality, rather than something more biologically interesting. Differences in quality can have a big impact because low-quality cells often have low total RNA counts. When we perform scaling normalisation on these cells during preprocessing, this can make the variances for the genes they do express much bigger than for other cells. When we select the most variable genes in the dataset, we'll end up picking the ones expressed by low-quality cells. We'll use these in our PCA analysis, likely ending up with top principal components based on cell quality rather than biology, making it harder to detect the differences we're actually interested in.
> When we cluster single cell data, we're looking for the biggest differences between groups of cells. If we have lots of low-quality cells, then the biggest differences we'll see might be these differences in quality, rather than something more biologically interesting. Differences in quality can have a big impact because low-quality cells often have low total RNA counts. When we perform scaling and normalisation on these cells during preprocessing, this can make the variances for the genes they do express much bigger than for other cells. When we select the most variable genes in the dataset, we'll end up picking the ones expressed by low-quality cells. We'll use these in our PCA analysis, likely ending up with top principal components based on cell quality rather than biology, making it harder to detect the differences we're actually interested in.

>
> #### Misidentification of upregulated genes
>
> Another problem that arises when we apply our preprocessing steps to small, low quality cells is that the genes we detect in them can appear to be strongly upregulated. Since we didn't detect many other genes in these cells, even a small difference in the number of transcripts detected can end up becoming much larger after normalisation. For example, contaminating transcripts may be present in all cells at low but constant levels. With the increased scaling normalization in low-quality cells, the small counts for these transcripts may become large normalized expression values, so we might think we've found a population of cells where these genes are upregulated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> Another problem that arises when we apply our preprocessing steps to small, low quality cells is that the genes we detect in them can appear to be strongly upregulated. Since we didn't detect many other genes in these cells, even a small difference in the number of transcripts detected can end up becoming much larger after normalisation. For example, contaminating transcripts may be present in all cells at low but constant levels. With the increased scaling normalization in low-quality cells, the small counts for these transcripts may become large normalized expression values, so we might think we've found a population of cells where these genes are upregulated.
> Another problem that arises when we apply our preprocessing steps to small, low quality cells is that the genes we detect in them can appear to be strongly upregulated. Since we didn't detect many other genes in these cells, even a small difference in the number of transcripts detected can end up becoming much larger after normalisation. For example, contaminating transcripts may be present in all cells at low but constant levels. With the increased scaling and normalization in low-quality cells, the small counts for these transcripts may become large normalized expression values, so we might think we've found a population of cells where these genes are upregulated.

Comment on lines +290 to +291
> > 2. A threshold of 5% should work for this dataset. As before, the violin plot shows that the majority of our cells are below this threshold, but in this case the cells above it had low total RNA counts, which suggests these could be damaged cells that had lost a lot of their other RNAs.
> > Although there are few standards to guide us in single cell analysis, you will see the same 5% threshold for mitochondrial content used in many studies. It often works well, but it can filter out energetic cell types such as muscle cells, so we shouldn't apply this threshold without considering whether it works for our dataset. Some studies don't filter on mitochondrial reads at all!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > 2. A threshold of 5% should work for this dataset. As before, the violin plot shows that the majority of our cells are below this threshold, but in this case the cells above it had low total RNA counts, which suggests these could be damaged cells that had lost a lot of their other RNAs.
> > Although there are few standards to guide us in single cell analysis, you will see the same 5% threshold for mitochondrial content used in many studies. It often works well, but it can filter out energetic cell types such as muscle cells, so we shouldn't apply this threshold without considering whether it works for our dataset. Some studies don't filter on mitochondrial reads at all!
> > 2. A threshold of 5% should work for this dataset. As before, the violin plot shows that the majority of our cells are below this threshold, but in this case the cells above it had low total RNA counts, which suggests these could be damaged cells that had lost a lot of their other RNAs. Although there are few standards to guide us in single cell analysis, you will see the same 5% threshold for mitochondrial content used in many studies. It often works well, but it can filter out energetic cell types such as muscle cells, so we shouldn't apply this threshold without considering whether it works for our dataset. Some studies don't filter on mitochondrial reads at all!

Comment on lines +151 to +152
><hands-on-title>Create a Seurat Object</hands-on-title>
> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
><hands-on-title>Create a Seurat Object</hands-on-title>
> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:
> <hands-on-title>Create a Seurat Object</hands-on-title>
>
> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:

> - *"Method for normalization"*: `LogNormalize`
>
> 2. {% tool [Seurat Preprocessing](toolshed.g2.bx.psu.edu/repos/iuc/seurat_preprocessing/seurat_preprocessing/5.0+galaxy0) %} with the following parameters:
> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Preprocessing** {% icon tool %})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Preprocessing** {% icon tool %})
> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of previous **Seurat Preprocessing** {% icon tool %})

> - *"Output list of most variable features"*: `Yes`
>
> 3. {% tool [Seurat Preprocessing](toolshed.g2.bx.psu.edu/repos/iuc/seurat_preprocessing/seurat_preprocessing/5.0+galaxy0) %} with the following parameters:
> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Preprocessing** {% icon tool %})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Preprocessing** {% icon tool %})
> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of previous **Seurat Preprocessing** {% icon tool %})

> - *"Genes to calculate residual features for"*: `all genes`
> - *"How to set variable features"*: `set number of variable features`
> - *"Output list of most variable features"*: `Yes`
> - *"Variable(s) to regress out"*: `percent.mt`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it different from the normal scaling step? In the normal scaling step, there was no variable regressed out but in scTransform, percent.mt regressed out

@pavanvidem
Copy link
Member

@MarisaJL the tutorial is well made! Some complex topics were explained in easy language. One key improvement I would suggest for this tutorial is to establish a clearer separation between log normalization and SCTransform preprocessing. There is "choose your own tutorial" option but applied only in the initial step. In later hands-on steps, both preprocessing methods are run together with side-by-side comparisons, which could be confusing for beginners trying to learn the analysis from scratch. Tracking two different analyses and managing two separate SeuratObjects may feel overwhelming at this stage.

It would be nice if the "choose your own tutorial" option carried through the entire tutorial. This way, users could follow a single workflow from start to finish before diving into comparisons. I really like your explanations and comparisons. Moving the comparisons to a dedicated section toward the end might also be beneficial. This final section could include guidance on which preprocessing method to use in specific scenarios.

Do you think it makes sense? Sorry, it sounds like a lot of restructuring.

@MarisaJL
Copy link
Collaborator Author

MarisaJL commented Nov 7, 2024

Thanks @pavanvidem - that does make sense! Since this is a tutorial designed for beginners, it's probably best to keep the two options separate. I think it is possible to have the cyoa follow on between steps, so I'll try and restructure it.

Copy link
Member

@pavanvidem pavanvidem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the tutorial and it works perfectly! The content looks great!!

> 1. {% tool [Seurat Visualize](toolshed.g2.bx.psu.edu/repos/iuc/seurat_plot/seurat_plot/5.0+galaxy0) %} with the following parameters:
> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Run Dimensional Reduction** {% icon tool %})
> - *"Method used"*: `Determine dimensionality with 'ElbowPlot'`
> - *"Number of dimensions to plot standard deviation for"*: If you ran the separate preprocessing steps then enter `30` here, if you used SCTransfrom then enter `50`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either change from 50 to 30 or upload a new Elbow plot for scTransform with 50 PCs.

> - *"Output list of top genes"*: `Yes`
>
{: .hands_on}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this output because it is used in many following steps.

Creating a Seurat Object in R would require two steps - first, we would need to read in our data, in this case using the `Read10X` function, then secondly we would turn it into a Seurat Object using the `CreateSeuratObject` function. On Galaxy, we can perform both steps with a single tool. The `CreateSeuratObject` function also generates some QC metrics and performs basic filtering of the data.

><hands-on-title>Create a Seurat Object</hands-on-title>
> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The barcodes file is in txt format. Currently, the tools does not support the txt files as input.
I am trying to add it here: galaxyproject/tools-iuc#6539. So please use galaxy1 version for this step.

> - *"UMAP implementation to run"*: `uwot`
> - *"Run UMAP on dimensions, features, graph or KNN output"*: `dims`
> - *"Number of dimensions from reduction to use as input"*: If you ran separate preprocessing steps, leave this as `10`, if you used SCTransform then change it to `30`
>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the output. This output is used later in for FindAllMarkers and then FindMarkers. Users might select wrong input if not renamed.

Although there is a lot of information here, all we need to know for now is that the markers listed for each cluster are the genes that were expressed more by these cells than any of the other clusters. We can search online for these genes to get an idea of what types of cells are in our clusters.

> <question-title></question-title>
> 1. Are the top genes associated with PCs 1-3 in our list of markers? Which clusters are they markers for?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a question on how to select markers for a particular cluster. Then answer with Filter tool on column 7.

>
> 1. Click on the {% icon galaxy-pencil %} pencil icon of the file we renamed as `DE Markers` (this was the CSV output from `FindAllMarkers`) then select {% icon galaxy-chart-select-data %} Datatypes in the central panel. Choose the second option, `Convert to Datatype` and make sure `tabular (using `Convert CSV to tabular`)` is selected in the drop down menu before pressing the `Create Dataset` button. This will create a new, tabular version of the dataset at the top of your history - make sure that this is the version you use in the next step.
>
> 2. {% tool [Table Compute](toolshed.g2.bx.psu.edu/repos/iuc/table_compute/table_compute/1.2.4+galaxy0) %} with the following parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested, the empty header line did not affect the heatmap. I think a simple cut tool should be enough here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants