Bioinformatics - Assembling, Annotating, and QA for Bacterial Genomes!

Alex Soupir

มุมมอง 13 021

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 7 พ.ย. 2020
Howdy everyone!
Today I'm working through genome sequencing of a bacterial isolate that we found. The pipeline starts off similar like any other sequencing work with QC and then trimming. After that we shift to using a new tool called Unicycler to de-novo assemble contiguous sequences (no reference). After creating the assembly we then annotate it and check the assembly quality.
The SRA accession and tools are listed below!
Whole Genome Sequencing SRA Search:
www.ncbi.nlm.nih.gov/sra/?ter...
Unicycler Github (rrwick):
github.com/rrwick/Unicycler#m...
Prokka: Rapid Prokaryotic Genome Annotation:
academic.oup.com/bioinformati...
QUAST: Quality Assessment Tool for Genome Assemblies:
academic.oup.com/bioinformati...
I hope that these videos are helpful to others who are analyzing data themselves without knowing where to start. If you do find these helpful or interesting please like and subscribe to see more to come!
Please consider contributing to my Patreon where I may do merch, gather ideas for future content, and have further discussions:
/ alexsoupir
แนวปฏิบัติและการใช้ชีวิต

ความคิดเห็น • 34

@chirasmitanayak8337 2 ปีที่แล้ว
thank you very much for sharing such informative tutorials. your videos really helped me a lot. thanks once again.
@AliRaza-lo5tg 3 ปีที่แล้ว
The whole series is really great. Thank you. I would love to see Roary/any other pan genomics tool to find the core, accessary and unique genes. Flower plots and tuturial on circos would be great choice if you find it suitable
@alexsoupir 3 ปีที่แล้ว
Thanks for the kind words!
The pan genome sounds really interesting to dive into for sure. I've never looked at that before so I would have to do some searching to find out how to do that myself. It would be great to do for sure!
@muhammadakmal1414 9 หลายเดือนก่อน
Thank you for this amazing and simple tutorial. I wonder which tool do you recommend to scaffold the obtained bacterial contigs into one contig.
@carliesaline7498 3 ปีที่แล้ว
Thank you for the videos! They're helpful as I learn bioinformatics to use for my undergrad thesis. Has anyone told you your voice sounds like John Krasinski's? Just a few sentences here and there.
@shathaomar1516 ปีที่แล้ว
Hi, thank you so much. Does that mean that Quast is used for checking the quality of assembly and Prokka annotation? How to interpret the quality of the Prokka annotation? Do you have a video about quast in more detail?
@siddikisiddikisiddik 2 ปีที่แล้ว
interesting indeed!
@phuongdoan827 2 ปีที่แล้ว
Thank you so much for the excellent videos. the command "conda search unicycler" shows "-bash: conda: command not found " in my HPC server. would you please tell me why and how can i fix it? I really appreciate for that.
@robertoluarterodriguez5278 ปีที่แล้ว
Can prokka be used as a kind of blast ? (In my case i have a Denovo assembly of a bacterial consortia and im searching for ways to get the annotations)
@thesparrowtalks5019 3 ปีที่แล้ว
Keep up the good work
@alexsoupir 3 ปีที่แล้ว ⁺¹
Thanks, Palani Kumar! Hope they are interesting haha.
@thesparrowtalks5019 3 ปีที่แล้ว ⁺¹
@@alexsoupir of course it's. I'm from a third world country. I neither have a proper bioinformatics professor to teach me nor having good computation unluckily. My only hope is the people like you who's ready to share the knowledge in internet. You may not know impactful you're resources are. Someday you will see how many life has changed.
@alexsoupir 3 ปีที่แล้ว ⁺¹
@@thesparrowtalks5019 I'm glad to hear I can be of help! Having a good professor is sometimes hard to do, and may not even be a professor that focusses in a certain area. Learning together and from each other is something that makes this area of study great - it doesn't always take a lot to pick up topics or concepts so the more people feeding the conversation, the more that's learned and the better the experience.
If i can ever help with more, feel free to ask!
@thesparrowtalks5019 3 ปีที่แล้ว ⁺¹
@@alexsoupir 💯
@user-zl4rp4cj2o 3 หลายเดือนก่อน
how many read files do we have to download for a single WGS of a single bacteria ?
@allimy3662 3 ปีที่แล้ว
Thank you sooo much! I am wondering if you plan to teach us how to make a chart of read density distribution around some specific regions like splice site, promoter and UTR using SAMtools and Bedtools etc.
@alexsoupir 3 ปีที่แล้ว ⁺¹
Hey! We could look at read distributions around splice sites using the RNA seq data we worked with and the Integrated Genome Browser (IGV). There is some neat things that can be seen with alternative events when you look at the junctions!
As far as using SAMtools or another program like that, I haven't done that before so I'd have to read up and see how to do that. I've used DeepTools before (i think is what it's called) to plot coverage but that was for ChIP-seq which is already only getting reads around a specific region. I'll look and see what I can find!
@allimy3662 3 ปีที่แล้ว ⁺¹
@@alexsoupir Thanks，I don't know whether or not we can get read distribution using same analysis regardless of RNA~seq, ChIP-seq and PAS-seq.
@alexsoupir 3 ปีที่แล้ว
The read distributions that I'm thinking of are more like what proportion of the whole genome does the reads come from. So for something like whole genome we would expect a pretty even coverage, with RNA-seq the there should be some areas that are higher coverage and areas with lower coverage, and chipseq there should be a lot of reads coming from a small proportion of the genome since it's usually very targeted.
I'll see what else I can find because it would be interesting!
@kgeehomie6091 2 ปีที่แล้ว
I have a question regarding the assembly procedure. But before that, I have to mention that this tutorial is P.E.R.F.E.C.T. Even for someone getting in touch for the first time with bacterial annotation, this presentation was well-structured and very easy to follow. At 24:30 there were 118 contigs. Is it mandatory to close these "gaps" in order to have one genome at the end? Am asking because on viral genomes (I know they are smaller and much easier to handle) after the assembly, possible gaps were closed by other software if I am not mistaken gapfiller. Should we do this step on bacterial assembly too, or is it not necessary?
@alexsoupir 2 ปีที่แล้ว
Really good question that i don't have a "correct" answer for. Ideally, it would be continuous, right? Long rest sequencing can add scaffolds and Nanopore base caller has come a long way in the last few years. Alternatively, PCR with extended elongation time *may* be able to bridge those areas. Have never heard of gapfiller but will certainly look.
Personally, i think the long read sequencing should be the starting point to cover potential repeat sequence regions then follow up with high coverage sort read for mutation frequency or consensus.
@kgeehomie6091 2 ปีที่แล้ว
@@alexsoupir Thank you so much for the prompt response. I see the point. Except for the gapfiller, in an article (the title totally slept off my mind😅), I read that they mapped the regions ± 500bp of each contig with the corresponding region of the two closest genomes to overcome the issue of the "gaps." That is why I asked if it is the "rule" to have one genome at the end rather than multiple contigs.
@abdelrahmanmahany133 5 หลายเดือนก่อน
Thanks. when running unicycler I got 124 contigs not 118 as in the video. what could be the reason? I installed the most recent packages not specific ones like in the tutorial.
@alexsoupir 5 หลายเดือนก่อน ⁺¹
That could be a reason. One hopes that as time goes on, there is more and more refinement of tools. If the same version and everything was used, it might come down internal random seeds of where to begin building contigs that it arrives at a different final 'optimal' final product. Can't think of a different example that's easy to understand outside of K-Means clustering - a random location is generated at the beginning of the fit.
If the final product differs from run to run, there might actually be a way to leverage this to increase the build of our genome and decrease contig number. Lets say we run it 10 times and get slightly different contig numbers and sizes each time, we can then align all of our contigs and see if any contig from different runs are able to bridge gaps.
Can start thinking one step further and see what pops up. What happens if you align the longest contig from different builds? Are they identical? Are there differences? If there are differences, can we produce a consensus but building the genome over and over and seeing what changes? Great thing with some of this stuff is that if/when there's a deviation from 'typical', that leaves room to get creative!
@abdelrahmanmahany133 5 หลายเดือนก่อน
@@alexsoupir Great. I got it now. Thanks a lot.
@hassanramadan358 5 หลายเดือนก่อน
Can I contact with you Dr abdelrahman @@abdelrahmanmahany133
@wkfw1274 3 ปีที่แล้ว ⁺¹
The untrimmed reads has better assemble result ？untrimmed reads mean the reads with adapter? Thank you
@alexsoupir 3 ปีที่แล้ว
Not necessarily. There **may** be adapter in the untrimmed reads. Trimming shifts the overall quality of the reads used to assemble higher, which provides a higher confidence genome assembly. So while not trimming may provide fewer longer contiguous sequences, the quality may be lower.
Does this make sense? We trim by quality threshold, so the assembly from trimmed reads will be higher confidence that the correct base pair is called. But by trimming we are losing some of the "bad" data that, when used, allows the program to create the longer sequences.
@wkfw1274 3 ปีที่แล้ว
@@alexsoupir Thank you for the explanation. Maybe we can filter reads with low quality and keep the high ones with full length.
@alexsoupir 3 ปีที่แล้ว
@@wkfw1274 of course. That is an option. Would need to set minimum length to whatever the length of your reads are, however, even doing that you will remove far more reads (if still doing sliding window) because if you have a high quality start and middle of the read but the end trails off below your threshold, instead of keeping say 75bp of your 100bp read, now you're keeping 0bp of your 100bp read (a lot of discarded data!).
I don't think there is a right or wrong way if you can justify what you do. I often do a sliding window because it allows me to keep the high quality parts of a read while removing the bad part. As with everything there's likely some optimization that can be done!
@hiagosilva7619 ปีที่แล้ว
Are you using linnux? Can you do it on windows? Sorry for this type of question. I am new to this. Is it a software interface where you run linnux commands?
@alexsoupir ปีที่แล้ว ⁺¹
Hey Hiago. I am indeed using windows. There's a terminal tool called "Windows Subsystem for Linux" that allows you to run Linux commands on your windows computer. Super handy.
You'll have to run a couple commands in Windows PowerShell but thankfully it's made pretty easy. Feel free to Google "turn on windows subsystem for linux" and you'll find easy methods.
@Eron589 ปีที่แล้ว
@@alexsoupir thank you so much! That will be very helpful! Your tutorial was also very helpful! Best wishes !

ต่อไป

เล่นอัตโนมัติ

Bioinformatics - Prokaryote Pan Genome with Roary! (Timestamps)