Bioinformatics - SRA Download, QC, and Trimming

แชร์
ฝัง
  • เผยแพร่เมื่อ 23 มิ.ย. 2020
  • In this video I run through the project that these videos are based on, downloading raw fastq files, running quality control check, and trimming. Also mentioned is how to use parallel to do all of the files in one command rather than running a command for each.
    Conda packages used: fastqc, multiqc, sra-tools (fastq-dump), parallel, trimmomatic
    My GitHub markdown with the project (still sort of a work in progress):
    github.com/ACSoupir/Bioinform...
    My GitHub link for this video in particular.
    github.com/ACSoupir/Bioinform...
    Publication which the data is from:
    pubmed.ncbi.nlm.nih.gov/26372...
    Publications SRA run selector for the read files used:
    trace.ncbi.nlm.nih.gov/Traces...
    JHU's Center for Computation Biology (mouse genome and annotation):
    ccb.jhu.edu/software/tophat/ig...
    Not a fan of doing plugs for "Like, Share, and Subscribe" but if you could, I would greatly appreciate it. Thanks!
    Image at the beginning on the bottom left is modified from AllGenetics.EU.
    Please consider contributing to my Patreon where I may do merch and gather ideas for future content:
    / alexsoupir
  • แนวปฏิบัติและการใช้ชีวิต

ความคิดเห็น • 39

  • @jacobi_official8590
    @jacobi_official8590 2 ปีที่แล้ว +1

    Thanks for this wonderful video. For someone who isn't familiar with trimmomatic, this is really helpful.

  • @Finger_Lock_
    @Finger_Lock_ 2 ปีที่แล้ว +1

    Thank you so much for making this video. It is very helpful.

  • @madhavanjn
    @madhavanjn 3 ปีที่แล้ว +4

    Bioinformatics data is mostly unstructured data, we are facing so many troubles while working with bioinformatics projects, we really appreciate ur initiative nd efforts towards to share ur knowledge in this site, keep going on, All the best👍💯.

  • @noureddineIDALI
    @noureddineIDALI ปีที่แล้ว

    Thank you!

  • @windflow9373
    @windflow9373 3 ปีที่แล้ว

    Thank you SO much for this explanation! And If per tile sequence quality show red, how to trim bad tile?

  • @szecek
    @szecek 2 ปีที่แล้ว +2

    LEADING:3 doesn't mean remove first 3 bases. It means to remove bases starting from 5' of quality below 3. To remove first 3 bases regardless of quality you should use HEADCROP

    • @alexsoupir
      @alexsoupir  2 ปีที่แล้ว +2

      Thank you for catching this! Important distinction.

    • @szecek
      @szecek 2 ปีที่แล้ว +1

      @@alexsoupir No worries. I really appreciate your efforts in making these videos. The best I've found on the subject. 👌

  • @adetayoaborisade9346
    @adetayoaborisade9346 3 ปีที่แล้ว

    thanks

  • @marwatawfik3956
    @marwatawfik3956 3 ปีที่แล้ว +1

    Many thanks, Alex, that is quite helpful :)
    Regarding Trimmomatic, is it possible to have access to the adaptor sequence when I am using the cluster (HPC) where Trimmomatic is already downloaded? I mean from where could I get the adaptor sequences (library) provided by Trimmomatic?
    LEADING, TRAILING is the number of bases to be removed or the threshold (quality score) below which it is to be removed? as it is really unclear in the manual

    • @alexsoupir
      @alexsoupir  3 ปีที่แล้ว +1

      Hello, Marwa!
      It is possible to access the adapter sequence files on an HPC, but you would have to maybe talk with those who installed it. If trimmomatic is in a module (loaded with `module load trimmomatic` or something like that) there is a common area for them to have them all installed. Alternatively, if you installed it similarly to me with `conda install trimmomatic` or `conda install -c bioconda trimmomatic`, you can look in your miniconda3 folder (I think in bin?) for the trimmomatic data folder where it should be.
      If all of that fails, you certainly can google the adapter sequences and put them in a file within your project folder and more easily use it. Then you don't need to worry about digging through folders to find it! :)

    • @marwatawfik3956
      @marwatawfik3956 3 ปีที่แล้ว

      @@alexsoupir Thanks

  • @janasalamonova5301
    @janasalamonova5301 4 ปีที่แล้ว

    Great! Keep going :) Also I would appreciate if it was little bit bigger, it is quite hard to see :)

    • @janasalamonova5301
      @janasalamonova5301 4 ปีที่แล้ว

      Or...actually it is alright :D Maybe I had something wrong with my internet connection so it was kind of blurry

    • @alexsoupir
      @alexsoupir  4 ปีที่แล้ว +1

      I can try to make it a little bigger, for sure. Gets kind of full when the code gets really long. Thanks for the feedback! Sometimes TH-cam will give a 'standard definition' video which makes it tough to see what's on the screen. New to the tutorial stuff but hopefully it is helpful! Haha

  • @o.renishii
    @o.renishii 8 หลายเดือนก่อน

    Hi there! Very helpful video! One (probably stupid) question: my data have not yet been submitted to SRA. How can I process them using the Ubuntu terminal?

    • @alexsoupir
      @alexsoupir  6 หลายเดือนก่อน

      If your data isn't an SRA, you can skip the SRA download step and continue on with typical workflow. The SRA is just a nice place for getting data that is available to everyone (unless human subjects data then would have to get access to controlled). If already have the data, can create the text file with the sample keys and go to QC and trimming.

  • @adetayoaborisade9346
    @adetayoaborisade9346 3 ปีที่แล้ว +2

    when i try to download the mus musculus reference genome, i am getting a login incorrect warning

    • @ricardoandreslabandalucero9855
      @ricardoandreslabandalucero9855 9 หลายเดือนก่อน

      me too

    • @alexsoupir
      @alexsoupir  6 หลายเดือนก่อน

      Have seen a couple comments with that. Could try getting from a different source. Found this from a quick Google that you could look for the files needed:
      www.ncbi.nlm.nih.gov/grc/mouse

  • @abdelrahmanmahany133
    @abdelrahmanmahany133 5 หลายเดือนก่อน

    Is sra-toolkit need special settings as it keep give me connection error when trying to download the SRR files?

    • @alexsoupir
      @alexsoupir  5 หลายเดือนก่อน +1

      So I was playing around recently and noticed the comida one doesn't work right anymore. Need to play around and figure out how to fix that and make an update video.
      I would recommend following this for the time being:
      github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit
      Certainly not as easy and clean, but I think this was how I ended up getting it to work.

    • @abdelrahmanmahany133
      @abdelrahmanmahany133 5 หลายเดือนก่อน +1

      @@alexsoupir Thanks a lot. It worked I installed sra-tools from the link you provided and uninstalled the conda version. Thanks for help.

  • @hebaalmaghrbi6800
    @hebaalmaghrbi6800 ปีที่แล้ว

    does anyone know a publication that reported what he's doing to a specific disease? help me out on this guys

    • @alexsoupir
      @alexsoupir  6 หลายเดือนก่อน

      What kind of information are you looking for? Specific pipeline or tools used?

  • @freezingtolerance7493
    @freezingtolerance7493 ปีที่แล้ว

    Hello, Alex. I am trying to download mouse genome using ur linux code, but I got error message like : Logging in as igenome ... Login incorrect. Could you give some advice to download it?

    • @o.renishii
      @o.renishii 8 หลายเดือนก่อน

      I'm having the same problem, as well. Any suggestions as to why this is happening? Am I missing a package or something?

    • @alexsoupir
      @alexsoupir  6 หลายเดือนก่อน

      Wonder if they put in a sign-in thing? That's interesting that it would do that. Alternatively, could download through a browser and then move to the project folder. Similar to the human genome, wouldn't be surprised if there are multiple locations to get the mouse genome and accompanying files like GTF online.
      Could try digging around here for the needed files:
      www.ncbi.nlm.nih.gov/grc/mouse

  • @shobhitashah1524
    @shobhitashah1524 3 ปีที่แล้ว

    can you make a similar video doing operation in OS windows

    • @alexsoupir
      @alexsoupir  3 ปีที่แล้ว +1

      Hey Shobhita Shah! My latest video is how to get Linux working on a Windows computer if you want to check it out. It's called "Windows Subsystem for Linux" and as long as you have administrator abilities it should work. Hope it's helpful!

  • @adetayoaborisade9346
    @adetayoaborisade9346 3 ปีที่แล้ว

    how do i circumvent that

  • @hassankarim9451
    @hassankarim9451 2 ปีที่แล้ว

    Hello Alex, I have found this video very useful regarding my work. I am facing some problems. Can you please help me to solve this issue? Thanks
    After adding the text list of SRR accessions, I could not split these files. I am facing the following problems.
    user@user:~/Downloads$ fasterq-dump --gzip --split-file SRR8885553
    2022-04-21T01:14:44 fasterq-dump.2.9.1 err: param unknown while parsing argument list within application support module - Unknown argument '--gzip'
    2022-04-21T01:14:44 fasterq-dump.2.9.1 err: param unknown while parsing argument list within application support module - Unknown argument '--split-file'
    2022-04-21T01:14:44 fasterq-dump.2.9.1 err: ArgsMakeAndHandle() -> RC(rcApp,rcArgv,rcParsing,rcParam,rcUnknown)

    • @alexsoupir
      @alexsoupir  2 ปีที่แล้ว

      Hello, Hassan!
      May i ask which operating system you are working on? Seems as though a few of the arguments aren't happy with the function fastq-dump.
      The first place i would look is whether you have the sratools package installed through conda or through another linux or macos operating system. With gzip not being recognized, this is particularly interesting because both Mac and Linux should have these since they are built on sjmilar Unix bases.
      I have been super busy lately with my own projects that I'll admit i haven't used these commands in a while, but that's where I would start. The error of split-file also makes me think sratools isn't in your software execution path.
      If you type `fastq-dump --help` does it print anything out or does it give an error? Are --gzip and --split-files options if it does print out the help menu?
      90% of bioinformatics is troubleshooting errors. That's the fun part for me! Let me know if you're able to solve your issue of if you have other questions.

    • @hassankarim9451
      @hassankarim9451 2 ปีที่แล้ว

      @@alexsoupir Thank you very much for your response.
      I am using the Linux system with Ubuntu 20.04. I have typed 'fastq-dump' and found information about "split-files" but did not find anything about gzip. I have installed the SRAtool kit and also updated it today.
      I don't know why but I am facing problems in following your video's steps. Now, I am facing the problem with the installation of multiqc. I got the following results.
      PackagesNotFoundError: The following packages are not available from current channels:
      - multiqc
      Current channels:
      - repo.anaconda.com/pkgs/main/linux-64
      - repo.anaconda.com/pkgs/main/noarch
      - repo.anaconda.com/pkgs/r/linux-64
      - repo.anaconda.com/pkgs/r/noarch