Adding Seurat version of the PBMC clustering tutorial #5491

MarisaJL · 2024-10-31T13:30:35Z

This is the Seurat version of the Clustering3k PMBCs with Scanpy tutorial for single cell.

Removing old images to replace with newer ones

pavanvidem · 2024-11-04T10:05:41Z

@MarisaJL I will review it soon.

pavanvidem

reviewed and tested until the preprocessing.

pavanvidem · 2024-11-05T15:37:46Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+
+Representing the matrix with these three files is convenient for sharing the data, but not for processing them. Different single cell analysis packages have attempted to solve the problem of storage and analysis by inventing their own formats, which has led to the proliferation of many different 'standards' in the scRNA-seq package ecosystem.
+
+## Seurat Object


Suggested change

## Seurat Object

## SeuratObject

please call it by the data structure naming as "SeuratObject" or in plain language as "Seurat object" (with a lower case o)

pavanvidem · 2024-11-06T18:07:53Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>        - *"Include features detected in at least this many cells"*: `3`
+>        - *"Include cells where at least this many features are detected"*: `200`
+>        - *"Calculate percentage of mito genes in each cell"*: `No`
+>


include selection of genes.tsv and barcodes.tsv in params

pavanvidem · 2024-11-06T18:13:45Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+> 4. Check that the format is `RDS`
+{: .hands_on}
+
+We can't look at the RDS file directly as it is designed for computers to read, rather than humans, but the Seurat tools will now be able to interact with the data. 


Please add an inspect step here. A question on the number of features and cells filtered out with our thresholds should give an idea of what happened in the last step.

pavanvidem · 2024-11-06T18:17:53Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>
+> #### Distortion of population heterogeneity during variance estimation or principal components analysis
+>
+> When we cluster single cell data, we're looking for the biggest differences between groups of cells. If we have lots of low-quality cells, then the biggest differences we'll see might be these differences in quality, rather than something more biologically interesting. Differences in quality can have a big impact because low-quality cells often have low total RNA counts. When we perform scaling normalisation on these cells during preprocessing, this can make the variances for the genes they do express much bigger than for other cells. When we select the most variable genes in the dataset, we'll end up picking the ones expressed by low-quality cells. We'll use these in our PCA analysis, likely ending up with top principal components based on cell quality rather than biology, making it harder to detect the differences we're actually interested in.


Suggested change

> When we cluster single cell data, we're looking for the biggest differences between groups of cells. If we have lots of low-quality cells, then the biggest differences we'll see might be these differences in quality, rather than something more biologically interesting. Differences in quality can have a big impact because low-quality cells often have low total RNA counts. When we perform scaling normalisation on these cells during preprocessing, this can make the variances for the genes they do express much bigger than for other cells. When we select the most variable genes in the dataset, we'll end up picking the ones expressed by low-quality cells. We'll use these in our PCA analysis, likely ending up with top principal components based on cell quality rather than biology, making it harder to detect the differences we're actually interested in.

> When we cluster single cell data, we're looking for the biggest differences between groups of cells. If we have lots of low-quality cells, then the biggest differences we'll see might be these differences in quality, rather than something more biologically interesting. Differences in quality can have a big impact because low-quality cells often have low total RNA counts. When we perform scaling and normalisation on these cells during preprocessing, this can make the variances for the genes they do express much bigger than for other cells. When we select the most variable genes in the dataset, we'll end up picking the ones expressed by low-quality cells. We'll use these in our PCA analysis, likely ending up with top principal components based on cell quality rather than biology, making it harder to detect the differences we're actually interested in.

pavanvidem · 2024-11-06T18:18:35Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>
+> #### Misidentification of upregulated genes
+>
+> Another problem that arises when we apply our preprocessing steps to small, low quality cells is that the genes we detect in them can appear to be strongly upregulated. Since we didn't detect many other genes in these cells, even a small difference in the number of transcripts detected can end up becoming much larger after normalisation. For example, contaminating transcripts may be present in all cells at low but constant levels. With the increased scaling normalization in low-quality cells, the small counts for these transcripts may become large normalized expression values, so we might think we've found a population of cells where these genes are upregulated.


Suggested change

> Another problem that arises when we apply our preprocessing steps to small, low quality cells is that the genes we detect in them can appear to be strongly upregulated. Since we didn't detect many other genes in these cells, even a small difference in the number of transcripts detected can end up becoming much larger after normalisation. For example, contaminating transcripts may be present in all cells at low but constant levels. With the increased scaling normalization in low-quality cells, the small counts for these transcripts may become large normalized expression values, so we might think we've found a population of cells where these genes are upregulated.

> Another problem that arises when we apply our preprocessing steps to small, low quality cells is that the genes we detect in them can appear to be strongly upregulated. Since we didn't detect many other genes in these cells, even a small difference in the number of transcripts detected can end up becoming much larger after normalisation. For example, contaminating transcripts may be present in all cells at low but constant levels. With the increased scaling and normalization in low-quality cells, the small counts for these transcripts may become large normalized expression values, so we might think we've found a population of cells where these genes are upregulated.

pavanvidem · 2024-11-06T22:52:45Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+> > 2. A threshold of 5% should work for this dataset. As before, the violin plot shows that the majority of our cells are below this threshold, but in this case the cells above it had low total RNA counts, which suggests these could be damaged cells that had lost a lot of their other RNAs.
+> > Although there are few standards to guide us in single cell analysis, you will see the same 5% threshold for mitochondrial content used in many studies. It often works well, but it can filter out energetic cell types such as muscle cells, so we shouldn't apply this threshold without considering whether it works for our dataset. Some studies don't filter on mitochondrial reads at all! 


Suggested change

> > 2. A threshold of 5% should work for this dataset. As before, the violin plot shows that the majority of our cells are below this threshold, but in this case the cells above it had low total RNA counts, which suggests these could be damaged cells that had lost a lot of their other RNAs.

> > Although there are few standards to guide us in single cell analysis, you will see the same 5% threshold for mitochondrial content used in many studies. It often works well, but it can filter out energetic cell types such as muscle cells, so we shouldn't apply this threshold without considering whether it works for our dataset. Some studies don't filter on mitochondrial reads at all!

> > 2. A threshold of 5% should work for this dataset. As before, the violin plot shows that the majority of our cells are below this threshold, but in this case the cells above it had low total RNA counts, which suggests these could be damaged cells that had lost a lot of their other RNAs. Although there are few standards to guide us in single cell analysis, you will see the same 5% threshold for mitochondrial content used in many studies. It often works well, but it can filter out energetic cell types such as muscle cells, so we shouldn't apply this threshold without considering whether it works for our dataset. Some studies don't filter on mitochondrial reads at all!

pavanvidem · 2024-11-06T23:26:00Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+><hands-on-title>Create a Seurat Object</hands-on-title>
+> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:


Suggested change

><hands-on-title>Create a Seurat Object</hands-on-title>

> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:

> <hands-on-title>Create a Seurat Object</hands-on-title>

>

> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:

pavanvidem · 2024-11-06T23:37:08Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>        - *"Method for normalization"*: `LogNormalize`
+>
+> 2. {% tool [Seurat Preprocessing](toolshed.g2.bx.psu.edu/repos/iuc/seurat_preprocessing/seurat_preprocessing/5.0+galaxy0) %} with the following parameters:
+>    - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Preprocessing** {% icon tool %})


Suggested change

> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Preprocessing** {% icon tool %})

> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of previous **Seurat Preprocessing** {% icon tool %})

pavanvidem · 2024-11-06T23:37:22Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>        - *"Output list of most variable features"*: `Yes`
+>
+> 3. {% tool [Seurat Preprocessing](toolshed.g2.bx.psu.edu/repos/iuc/seurat_preprocessing/seurat_preprocessing/5.0+galaxy0) %} with the following parameters:
+>    - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Preprocessing** {% icon tool %})


Suggested change

> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Preprocessing** {% icon tool %})

> - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of previous **Seurat Preprocessing** {% icon tool %})

pavanvidem · 2024-11-06T23:41:38Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>        - *"Genes to calculate residual features for"*: `all genes`
+>            - *"How to set variable features"*: `set number of variable features`
+>        - *"Output list of most variable features"*: `Yes`
+>        - *"Variable(s) to regress out"*: `percent.mt`


isn't it different from the normal scaling step? In the normal scaling step, there was no variable regressed out but in scTransform, percent.mt regressed out

pavanvidem · 2024-11-07T15:16:33Z

@MarisaJL the tutorial is well made! Some complex topics were explained in easy language. One key improvement I would suggest for this tutorial is to establish a clearer separation between log normalization and SCTransform preprocessing. There is "choose your own tutorial" option but applied only in the initial step. In later hands-on steps, both preprocessing methods are run together with side-by-side comparisons, which could be confusing for beginners trying to learn the analysis from scratch. Tracking two different analyses and managing two separate SeuratObjects may feel overwhelming at this stage.

It would be nice if the "choose your own tutorial" option carried through the entire tutorial. This way, users could follow a single workflow from start to finish before diving into comparisons. I really like your explanations and comparisons. Moving the comparisons to a dedicated section toward the end might also be beneficial. This final section could include guidance on which preprocessing method to use in specific scenarios.

Do you think it makes sense? Sorry, it sounds like a lot of restructuring.

MarisaJL · 2024-11-07T16:09:33Z

Thanks @pavanvidem - that does make sense! Since this is a tutorial designed for beginners, it's probably best to keep the two options separate. I think it is possible to have the cyoa follow on between steps, so I'll try and restructure it.

pavanvidem

Tested the tutorial and it works perfectly! The content looks great!!

pavanvidem · 2024-11-07T15:39:22Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+> 1. {% tool [Seurat Visualize](toolshed.g2.bx.psu.edu/repos/iuc/seurat_plot/seurat_plot/5.0+galaxy0) %} with the following parameters:
+>    - {% icon param-file %} *"Input file with the Seurat object"*: `rds_out` (output of **Seurat Run Dimensional Reduction** {% icon tool %})
+>    - *"Method used"*: `Determine dimensionality with 'ElbowPlot'`
+>        - *"Number of dimensions to plot standard deviation for"*: If you ran the separate preprocessing steps then enter `30` here, if you used SCTransfrom then enter `50`


either change from 50 to 30 or upload a new Elbow plot for scTransform with 50 PCs.

pavanvidem · 2024-11-07T15:45:27Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>        - *"Output list of top genes"*: `Yes`
+>
+{: .hands_on}
+


Maybe rename this output because it is used in many following steps.

pavanvidem · 2024-11-07T15:50:30Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+Creating a Seurat Object in R would require two steps - first, we would need to read in our data, in this case using the `Read10X` function, then secondly we would turn it into a Seurat Object using the `CreateSeuratObject` function. On Galaxy, we can perform both steps with a single tool. The `CreateSeuratObject` function also generates some QC metrics and performs basic filtering of the data.
+
+><hands-on-title>Create a Seurat Object</hands-on-title>
+> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:


The barcodes file is in txt format. Currently, the tools does not support the txt files as input.
I am trying to add it here: galaxyproject/tools-iuc#6539. So please use galaxy1 version for this step.

pavanvidem · 2024-11-07T16:35:51Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>        - *"UMAP implementation to run"*: `uwot`
+>        - *"Run UMAP on dimensions, features, graph or KNN output"*: `dims`
+>            - *"Number of dimensions from reduction to use as input"*: If you ran separate preprocessing steps, leave this as `10`, if you used SCTransform then change it to `30`
+>


Please rename the output. This output is used later in for FindAllMarkers and then FindMarkers. Users might select wrong input if not renamed.

pavanvidem · 2024-11-07T16:42:36Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+Although there is a lot of information here, all we need to know for now is that the markers listed for each cluster are the genes that were expressed more by these cells than any of the other clusters. We can search online for these genes to get an idea of what types of cells are in our clusters.
+
+> <question-title></question-title>
+> 1. Are the top genes associated with PCs 1-3 in our list of markers? Which clusters are they markers for?


maybe add a question on how to select markers for a particular cluster. Then answer with Filter tool on column 7.

pavanvidem · 2024-11-07T17:42:17Z

topics/single-cell/tutorials/scrna-seurat-pbmc3k/tutorial.md

+>
+> 1. Click on the {% icon galaxy-pencil %} pencil icon of the file we renamed as `DE Markers` (this was the CSV output from `FindAllMarkers`) then select {% icon galaxy-chart-select-data %} Datatypes in the central panel. Choose the second option, `Convert to Datatype` and make sure `tabular (using `Convert CSV to tabular`)` is selected in the drop down menu before pressing the `Create Dataset` button. This will create a new, tabular version of the dataset at the top of your history - make sure that this is the version you use in the next step.
+>
+> 2. {% tool [Table Compute](toolshed.g2.bx.psu.edu/repos/iuc/table_compute/table_compute/1.2.4+galaxy0) %} with the following parameters:


When I tested, the empty header line did not affect the heatmap. I think a simple cut tool should be enough here.

MarisaJL and others added 30 commits September 9, 2024 13:19

Create seurat pbmc tutorial

8eed6fd

Adding images

d85dd30

Correcting image paths and a few small edits

5342dd1

Minor edits

4fce919

Edits to markers section

aed0064

Edited cell annotation section

2be8fd8

Delete topics/single-cell/images/scrna-seurat-pbmc3k directory

6e7452b

Removing old images to replace with newer ones

Hands on sections

c0d7226

Add image via upload

97a0a51

Create image folder

64b3b5e

Fixing types and adding some image captions

7e2a55f

Adding images

4210a14

Removing old image

0218f0d

Deleting old image

0438a31

Deleting old image

f159ff3

Adding SCTransform results

40e8fbd

Deleting old image

f5f655a

Adding more SCTransform results

5a2cce0

Adding more

daaec19

Removing typo

a1f29c0

Final results added for SCTransform route

8164931

Adding images for SCTransform route

31a11ee

Two more images for SCT results

9e2c837

Upload image for SCT

a21c64b

Fixing some images

f912a80

Proofreading

d404fa3

Updated image

4ed6bba

Proofreading

614486f

Adding workflow file

3e37322

Creating workflows folder

8448497

MarisaJL added 7 commits October 28, 2024 22:25

Adding SCTransform workflow

81ee2f3

Adding answer histories to top of tutorial

b4017bc

Trying to fix answer history links

a09af56

Shorter name for workflow 1

aa1ea40

Shorter name for workflow 2

8f1616f

Create index.md

38d6e0b

Adding workflows with workflow tests

6bcca5b

MarisaJL requested a review from a team as a code owner October 31, 2024 13:30

github-actions bot added the single-cell label Oct 31, 2024

MarisaJL added 12 commits October 31, 2024 13:46

Fixing some formatting problems

17596d6

Missed this one!

c2f0791

Adding missing " to workflow

553c45c

Fixing errors in Seurat_PBMC_Workflow.ga

1e4cb24

Fixing errors in Seurat_PBMC_Workflow_SCT.ga

4ddf97d

Adding missing DOI

8579661

Correcting file name 1

ed38334

Correcting file name 2

b8b3c6e

Fixing errors

ff863cf

Fixing the one I missed!

c2ad554

Trying to fix snippet

c1edc52

Merge branch 'main' into seurat_pbmc

9943049

pavanvidem reviewed Nov 6, 2024

View reviewed changes

pavanvidem reviewed Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Seurat version of the PBMC clustering tutorial #5491

Adding Seurat version of the PBMC clustering tutorial #5491

MarisaJL commented Oct 31, 2024

pavanvidem commented Nov 4, 2024

pavanvidem left a comment

pavanvidem Nov 5, 2024

pavanvidem Nov 5, 2024

pavanvidem Nov 6, 2024

pavanvidem Nov 6, 2024

pavanvidem Nov 6, 2024

pavanvidem Nov 6, 2024

pavanvidem Nov 6, 2024

pavanvidem Nov 6, 2024

pavanvidem Nov 6, 2024

pavanvidem Nov 6, 2024

pavanvidem Nov 6, 2024

pavanvidem commented Nov 7, 2024

MarisaJL commented Nov 7, 2024

pavanvidem left a comment

pavanvidem Nov 7, 2024

pavanvidem Nov 7, 2024

pavanvidem Nov 7, 2024

pavanvidem Nov 7, 2024

pavanvidem Nov 7, 2024

pavanvidem Nov 7, 2024


		Representing the matrix with these three files is convenient for sharing the data, but not for processing them. Different single cell analysis packages have attempted to solve the problem of storage and analysis by inventing their own formats, which has led to the proliferation of many different 'standards' in the scRNA-seq package ecosystem.

		## Seurat Object

		> > 2. A threshold of 5% should work for this dataset. As before, the violin plot shows that the majority of our cells are below this threshold, but in this case the cells above it had low total RNA counts, which suggests these could be damaged cells that had lost a lot of their other RNAs.
		> > Although there are few standards to guide us in single cell analysis, you will see the same 5% threshold for mitochondrial content used in many studies. It often works well, but it can filter out energetic cell types such as muscle cells, so we shouldn't apply this threshold without considering whether it works for our dataset. Some studies don't filter on mitochondrial reads at all!

		><hands-on-title>Create a Seurat Object</hands-on-title>
		> 1. {% tool [Seurat Create](toolshed.g2.bx.psu.edu/repos/iuc/seurat_create/seurat_create/5.0+galaxy0) %} with the following parameters:

	> - {% icon param-file %} "Input file with the Seurat object": `rds_out` (output of Seurat Preprocessing {% icon tool %})
	> - {% icon param-file %} "Input file with the Seurat object": `rds_out` (output of previous Seurat Preprocessing {% icon tool %})

Adding Seurat version of the PBMC clustering tutorial #5491

Are you sure you want to change the base?

Adding Seurat version of the PBMC clustering tutorial #5491

Conversation

MarisaJL commented Oct 31, 2024

pavanvidem commented Nov 4, 2024

pavanvidem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavanvidem commented Nov 7, 2024

MarisaJL commented Nov 7, 2024

pavanvidem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment