## Demultiplexing run command `SampleSheet.csv` corresponds to the sample run sheet including updated indices for each sample. --use-base-mask is updated according to sequencing specific parameters. `bcl2fastq --no-lane-splitting -p 16 --output-dir ./Seqs --sample-sheet SampleSheet.csv --use-bases-mask y*n,I8,I8,y*n` ## Fastp run command For each sample, the following commands is run for either paired-end OR single-end reads. Fastq files are cleaned and filtered using the fastp paramaters below. ``` fastp \ --in1 ${SAMPLE}_R1.fastq.gz \ --in2 ${SAMPLE}_R2.fastq.gz \ --out1 synced_clt_${SAMPLE}_R1.fastq.gz \ --out2 synced_clt_${SAMPLE}_R2.fastq.gz \ --length_required 35 \ --cut_front_window_size 1 \ --cut_front_mean_quality 13 \ --cut_front \ --cut_tail_window_size 1 \ --cut_tail_mean_quality 13 \ --cut_tail \ --adapter_fasta ${SCRIPT_DIR}/adapters.fa \ -w ${CPUS} -y -r \ -j ${SAMPLE}_fastp.json \ --correction ``` ``` fastp \ --in1 ${SAMPLE}_R1.fastq.gz \ --out1 clt_${SAMPLE}_R1.fastq.gz \ --length_required 35 \ --cut_front_window_size 1 \ --cut_front_mean_quality 13 \ --cut_front \ --cut_tail_window_size 1 \ --cut_tail_mean_quality 13 \ --cut_tail \ --adapter_fasta ${SCRIPT_DIR}/adapters.fa \ -w ${CPUS} -y -r \ -j ${SAMPLE}_fastp.json ``` ## STAR alignment run command The following command is run to align paired-end read OR single-end. `STARREF` is used as an input variable before the command is called to call the reference folder ``` STAR \ --twopassMode Basic \ --readFilesCommand zcat \ --runThreadN ${CPUS} \ --runMode alignReads \ --genomeDir ${STARREF} \ --readFilesIn \ synced_clt_${SAMPLE}_R1.fastq.gz \ synced_clt_${SAMPLE}_R2.fastq.gz \ --outSAMtype BAM Unsorted \ --outSAMstrandField intronMotif \ --outFileNamePrefix synced_clt_${SAMPLE}_ \ --outTmpDir ./tmp/star \ --outFilterIntronMotifs RemoveNoncanonical \ --outReadsUnmapped Fastx ``` ``` /bin/time -v STAR \ --twopassMode Basic \ --readFilesCommand zcat \ --runThreadN ${CPUS} \ --runMode alignReads \ --genomeDir ${STARREF} \ --readFilesIn clt_${SAMPLE}_R1.fastq.gz \ --outSAMtype BAM Unsorted \ --outSAMstrandField intronMotif \ --outFileNamePrefix clt_${SAMPLE}_ \ --outTmpDir ./tmp/star \ --outFilterIntronMotifs RemoveNoncanonical \ --outReadsUnmapped Fastx ``` ## Subread featureCounts To quatify the gene counts the following command is run on either paired-end or single-end reads. ``` featureCounts \ -p \ -s ${STRANDED} \ -T ${CPUS} \ -t exon \ -g gene_name \ -a ${STARREF}genes.gtf \ -o ${SAMPLE}_featCountsU.txt \ synced_clt_${SAMPLE}_Aligned.out.bam \ &> ${SAMPLE}_countU.log ``` ``` featureCounts \ -s ${STRANDED} \ -T ${CPUS} \ -t exon \ -g gene_name \ -a ${STARREF}genes.gtf \ -o ${SAMPLE}_featCountsU.txt \ clt_${SAMPLE}_Aligned.out.bam \ &> ${SAMPLE}_countU.log ``` ## sambama run command The following command is run to sort the resulting bam files. ``` sambamba sort \ -t ${CPUS} \ -m 2G \ --tmpdir='./tmp' \ -o synced_clt_${SAMPLE}.bam \ synced_clt_${SAMPLE}_Aligned.out.bam ``` ``` sambamba sort \ -t ${CPUS} \ -m 2G \ --tmpdir='./tmp' \ -o clt_${SAMPLE}_R1.bam \ clt_${SAMPLE}_Aligned.out.bam ``` ## Salmon run command Salmon is run using the following to provide transcript level quantification. `STARREF` is used as an input variable before the command is called to call the reference folder ``` salmon quant \ -i ${STARREF}/salmon/ \ -l A \ --seqBias \ --gcBias \ --posBias \ -1 synced_clt_${SAMPLE}_R1.fastq.gz \ -2 synced_clt_${SAMPLE}_R2.fastq.gz \ -p ${CPUS} \ -o salmon_${SAMPLE} \ &> ${SAMPLE}_salmon.log ``` ``` salmon quant \ -i ${STARREF}/salmon/ \ -l A \ --seqBias \ --gcBias \ --posBias \ -r clt_${SAMPLE}_R1.fastq.gz \ -p ${CPUS} \ -o salmon_${SAMPLE} \ &> ${SAMPLE}_salmon.log ``` # Submitting data to GEO NOTE: When submitting human data, you must conform to human suject guidelines https://www.ncbi.nlm.nih.gov/geo/info/faq.html#patient. We recommend submitting both raw data(fastq files) and the deSeq counts matrix (deSeq2_counts.txt). # FAQ ## General How many replicates should I include in my experiment? As a rule of thumb we always recommend at least 3 biological replicates for any experiment looking to perform a differential analysis. We also recommend you speak with the Biostatistics department as they would be better at discussing any power calculations you might want to consider before designing your experiment. What are the limitations of only doing 1 biological replicate? If you are unable to perform additional replicates (due to cost, mice, other factors) we will be unable to characterize the biological variation that exists within your conditions and therefore will not be able to compute statistical significance for your study. You will be unable to publish these findings and will drastically increase the likelihood of chasing false-positive findings. Generally, we recommend performing a minimum of 3 biological replicates for a RNA-Seq experiment. We also recommend contacting the URMC Biostatistics Department to discuss statistical power. Can you modify the RNA-Seq pipeline to meet our experimental questions? Yes, we often modify our pipelines to meet the needs of individual experiments. Depending on the amount of effort required there may be an additional service charge that will cover our time to make these project specific changes. Can I have your scripts? No. Our bioinformatics scientists devote a lot of time and intellectual property into designing efficient pipelines. We are happy to share the tools and parameters that we use to enable reproducible work but we will not share our custom scripts and pipelines. My delivery links don't work anymore, can you help? Your delivery links no longer work because, for security reasons, they are only live for 7-10 days. We recommend you download all data that we deliver within this window of time. If extenuating circumstances exist that have kept you from downloading the data within this period of time, links can be remade upon request. ## Bulk RNASeq Why do I not see enrichR results? If you aren’t seeing enrichR results in your StarFeature counts report this is because there is not enough differentially expressed upregulated or downregulated genes for a specific comparison. In order to reduce false positives and thus calculate significantly enriched pathways enrichR needs at least 50 genes to be differentially expressed. What are the salmon results? For salmon, a different alignment algorithm is used, allowing us to better use reads that we would otherwise discard due to multimapping within star-featurecounts. For the salmon report, it maps those reads at a transcript level rather then the whole gene, so you will see inflation of total features. We also report what gene these transcripts belong to, so that the reports are comparable. You may see differential expression within the same gene, but that is tied to what transcripts are up or downregulated. We also provide a summary of the type of transcript (protein-coding, lncRNA, etc). When do I want to use transcript level quantifications? While most investigators find gene counts to be sufficient for their experimental, there may be specific cases where salmon may be valuable. In theory, transcript level quantifications can be accurate representation of expression and biological changes between conditions. For example, while gene level expression may increase or decrease, these changes are driven by a non-functional transcript. Can I remove a sample from the analysis? Although certain samples may seem as outliers, this may be due to biological variation. Thus, we don’t recommend removing samples from an analysis based on clustering alone. If there is experimental reasoning, such as poor sample quality, that may cause technical variation, you may want to consider removing samples from the analysis. Can I open my RNA-Seq delivered text files in excel? While we do not recommend working with these files in excel, you can view your data in excel by importing from a text file. We have a tutorial to walk you through how to safely work with gene symbols within excel. Can the GRC re-analyze my RNA-Seq experiment? Yes, we request you email us and provide the PI and submission date associated with the project. Project re-analysis will be charged an hourly service fee to cover our bioinformatician's time to re-analyze the data. Why is my RNASeq data showing a weak knockdown of my gene of interest despite being validating with qRT-PCR? One such explanation for discrepancies in knockdown expression between qRT-PCR and RNA-Seq data is the expression and alignment of a non-functional transcript. We recommend reading the aligned files into a Genome Browser to look at how the reads align to the gene. # Other information Questions? Contact `urgenomics@urmc.rochester.edu` Authors: GRC staff Updated: 10/27/2022