Bioinformatics - Building Genome Index and Aligning with STAR

Alex Soupir

มุมมอง 14 485

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 14 ก.ค. 2020
I apologize for the delay in getting the next video out - have had some things keeping me busy for the last few weeks. In this video (I struggle a bit!) I talk through the use of STAR RNA-aligner for creating a genome index using the genome fasta and the GTF annotation of the Mus musculus folder that we previously downloaded, as well as running the alignment with the trimmed reads and the new index.
I run into some issues with the Windows Subsystem for Linux (WSL) where I have to increase the ulimit - not sure if this is an issue when running a full linux machine or VM or if it is just a WSL problem. Other than this, using cat and parallel to automatically run through all the SRR IDs, and unzipping, aligning, than zipping again. If running WSL, run `ulimit -n 100000` if you have at least 32gb of RAM to accommodate the genome index.
STAR: ultrafast universal RNA-seq aligner:
www.ncbi.nlm.nih.gov/pmc/arti...
STAR Manual:
raw.githubusercontent.com/ale...
Mus musculus NCBI build 37.2 download (Large file size):
ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/NCBI/build37.2/Mus_musculus_NCBI_build37.2.tar.gz
Not a fan of doing plugs for "Like, Share, and Subscribe" but if you could, I would greatly appreciate it. Thanks!
Project Github: github.com/ACSoupir/Bioinform...
I'm not an expert in bioinformatics, and have only really done basic bacterial genome assembly, some RNA-seq, and a microbiome analysis. There may be better ways of doing things but this is how I do it. These video's are learning experiences for everyone.
Image at the beginning on the bottom left is modified from AllGenetics.EU.
Please consider contributing to my Patreon where I may do merch and gather ideas for future content:
/ alexsoupir
แนวปฏิบัติและการใช้ชีวิต

ความคิดเห็น • 26

@CookieNikie 3 ปีที่แล้ว
This helped me so much, thank you!
@geocarvalhont ปีที่แล้ว
Miss u here Alex! THank you!
@royalfoxgaming 3 ปีที่แล้ว
This helped a lot! Thanks!!
@coldlip 3 ปีที่แล้ว
very nice channel !
@erpampa94 3 ปีที่แล้ว
thank you!!!
@avp300 2 ปีที่แล้ว
Hi Alex, thank you for the video. I have paired end fastq files for 29 samples (total 58 files) and I am supplying it with manifest.tsv file as per the STAR manual. But I am only getting one combined BAM output file of 66GB. I can do for loop command and supply the file prefix in --outFileNamePrefix but it will read one by one and not paired end read, I think. Can you please help! thanks.
@tinacole1450 3 ปีที่แล้ว ⁺¹
Hey Alex. What is your operating system? I am looking to running bowtie2 within a Windows based system.
@lawrencemckinney7464 3 ปีที่แล้ว ⁺¹
How do you feed in an SRR Acc_List.txt when the PE reads in my trimmedReads folder have a naming convention of SRRXXXXXXX.qc.1.fq/SRRXXXXXXX.qc.2.fq ?
@alexsoupir 3 ปีที่แล้ว ⁺¹
Depends what you're trying to do but in the SRR_Acc_List.txt file should be what is generally referred to as the "base" name of the files - that is, without the folder or file extension. Since there isn't any extension what you can do is use the curly brackets with whatever extension you want to use - {}.qc.1.fq and {}.qc.2.fq, for example.
When working with the paired end files here, read 1 and read 2 have different places we need to call it in the function, and that place is always the same so hard coding the extension in the parallel function is fine. This allows us to then work with the pairs at one time!
@lawrencemckinney7464 3 ปีที่แล้ว
@@alexsoupir This was helpful. I was able to run create a script that piped my sample ID numbers into the {} you described above. Thanks Alex!
@abhisekdey6428 3 ปีที่แล้ว
How much ram is required for indexing human genome?
@alexsoupir 3 ปีที่แล้ว ⁺¹
Unfortunately, a fair amount by personal computer standards. I think I could do it on my old computer that had 16GB but you have to then specify something in the STAR indexing command (don't remember what it is). I think best bet would be having at least 32gb or 64gb to be safe.
@dhwanidholakia3175 ปีที่แล้ว
Hie Alex, Can you help me with the name of tool for CPU utilization? Does that work in linux in real time?
@alexsoupir ปีที่แล้ว
Good question. For linux, CPU utilization will come from using the command 'top' but windows I use the Task Manager. Both of these are "real-time" (slight delay by a second perhaps but relatively quick to update).
Alternatively, some linux versions have a program or app called "System" which will show the CPU and RAM usage (looks more cool than the Windows version).
@sultankdp ปีที่แล้ว
Please teach me how to open . Ssj file
@zafaran4089 2 ปีที่แล้ว
hi so why we need to index a genome ... why is it important ? please could you help me i'm new to bioinformatics
@alexsoupir 2 ปีที่แล้ว ⁺⁵
Hi, Zafaran.
Think of it as a way of finding where your read will be in the genome - instead of checking every single location in the genome you index the genome so you have shorter regions of the genome you know where it is. It's much like Google where when you search something,.they aren't checking every website for every search. Indexing will decrease the number of searcher you have to do and decrease the computational time needed to "align" the whole transcriptome (or find where the read aligns).
Could *sort of* think of it like looking in the dictionary where you'll go to where the first letter of the word matches, then second, so on until you react the right one. Not a perfect analogy but it cuts down on your search time instead of checking every word in the dictionary until you find what you want.
@kyrgyzsanjar 6 หลายเดือนก่อน
64 cores CPU, that's impressive! May I ask what was the cost of building such a machine?
@alexsoupir 6 หลายเดือนก่อน
Oofta was way too much. The CPU was $3500 a few years ago. Not sure what the cost would be now but back when living was cheaper (everyone remembers when things were half the cost just...2 years ago...) I think the whole computer was $4000 since I only needed motherboard, CPU, cooler, and RAM.
Since then, if not using cloud computing would definitely go used (my gpus and power supplies since were used). So much cheaper..
@abdelrahmanmahany133 6 หลายเดือนก่อน
The process of indexing never complete successfully. It gave me errors like this
WARNING: while processing pGe.sjdbGTFfile=Mus_musculus/NCBI/build37.2/Annotation/Genes/genes.gtf: no gene_id for line:
X RefSeq exon 132930061 132930132 . - . transcript_id "rna31311"; tss_id "TSS1704";
then I uninstalled the STAR and installed a different version. It worked for a while but when it finished if finished without success and this was in the log file
WARNING: while processing pGe.sjdbGTFfile=Mus_musculus/NCBI/build37.2/Annotation/Genes/genes.gtf: no gene_id for line:
MT RefSeq exon 15356 15422 . - . geneID "TrnP"; transcript_id "rna32375"; tss_id "TSS8091";
Any thoughts?
@alexsoupir 6 หลายเดือนก่อน ⁺¹
Looks like these are just warnings as in genes that don't have 'gene_id' which is strange. Would have to interrogate the GTF file to see if there are rows that have 'gene_id' and if so, these warnings could very well be ignored if the genome is still being indexed. Looks like you're using the same reference I had which is strange I didn't get the warning.
If the genome doesn't index (looks like it should be since these are warnings not errors) then can keep on as long as there aren't a lot of warnings and expected genes of interest aren't in them. Alternatively, might have to provide more flags to tell STAR what to use. See bottom of page 6 and top of page 7 here:
physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf
@abdelrahmanmahany133 6 หลายเดือนก่อน
@@alexsoupir thanks for support. you are right those were warnings not errors. I saw them in the log file as the command didn't finished successfully and I finally knew the reason as my RAM is just 24 GB so I had to limit the command with --genomeSAsparseD 3 and --limitGenomeGenerateRAM 20000000000 and finally the indexing process went successful.

ต่อไป

เล่นอัตโนมัติ

Bioinformatics - fastp FastQ Preprocessing Tool (Timestamps)