Using FastQC to check the quality of high throughput sequence

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 ก.ย. 2024

ความคิดเห็น • 43

  • @simonandrews5604
    @simonandrews5604 12 ปีที่แล้ว +3

    Unix command line instructions are in the INSTALL.txt file linked from the project's download page under the section 'Running FastQC as part of a pipeline'.
    Basically you just do: fastqc [some fastq files]. Other command line options can be found by running fastqc --help

  • @simonandrews5604
    @simonandrews5604 12 ปีที่แล้ว +2

    The Kmer content plot was added after I'd made the initial video. It's really an extension of the overrepresented sequences search which allows you to find partial sequences which are ovrerepresented but which may not appear at a fixed position within each read (run through into adapters is the obvious example for these). I should get around to updating this with the more recent changes, but everything which was in this original video is still valid.

  • @BabrahamBioinf
    @BabrahamBioinf  13 ปีที่แล้ว +2

    @queencake14 RNA-seq often shows biased composition in the first 10 bases or so due to the 'random' hexamers used to prime the library not actually being all that random. You can also see high duplication levels since you need to oversequence highly expressed transcripts in order to see the more lowly expressed ones.
    The Kmer plot helps you spot unusual enrichment of sequences which aren't aligned in your reads, eg adapter sequences which start at a variable point.

  • @tberben
    @tberben 12 ปีที่แล้ว +1

    Extremely informative video. I have a question concerning what I think is a newer version of FastQC: I get an extra tab named "Kmer content", but I'm not sure what it represents?

  • @rpsivan2000
    @rpsivan2000 9 ปีที่แล้ว +4

    nicely explained.
    thanks

  • @divided_by_dia446
    @divided_by_dia446 หลายเดือนก่อน

    at base sequence quality, what does the scale from 0 to 38.0 represent in the graph? Like i know it is a quality score, but is it like a percentage? Or how can it be measured against each other and calculated in the first place?

    • @simonandrews5604
      @simonandrews5604 หลายเดือนก่อน

      It's a PHRED score, which is a mathematical transformation of a p-value. It's saying how likely the called base is to be incorrect. The scoring works that PHRED 20 is 1% error rate, PHRED 30 is 0.1% error rate, PHRED 40 is 0.01% error rate. The value comes from the measurement of the signal:noise ratio on the sequencer when it first calls the base.

    • @divided_by_dia446
      @divided_by_dia446 หลายเดือนก่อน

      @@simonandrews5604 Thank you :)

  • @maheshmathe4393
    @maheshmathe4393 5 ปีที่แล้ว +1

    Hi. Could you give the link to the good or bad files used in the video?

    • @BabrahamBioinf
      @BabrahamBioinf  5 ปีที่แล้ว

      They're all on the site, just not linked. Put the name of the file from the report after the project URL and you can get the data. For example the URL for the first one would be www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short.txt

  • @keenviewer
    @keenviewer 12 ปีที่แล้ว +1

    Excellent video - very clearly explained. As 'tberben' asks it would be interesting to hear your explanation of the "Kmer content" module. I personally find this the most challenging module to interpret.

  • @janebond3263
    @janebond3263 4 ปีที่แล้ว

    how to install FastQC through console ?
    I have the interactive as you . My question is because I dont know how to use cutadapt from the interactive FastQC? Sorry, I am new in this.

  • @kaixinsjtu
    @kaixinsjtu 9 ปีที่แล้ว +3

    Thanks for the interpretation.

  • @pratitiankola8031
    @pratitiankola8031 ปีที่แล้ว

    Hello, Such a great video and explanation, I have a question, so in Fastqc results , the overrepresented sequences shows a cross with red even after trimming of the adapters (the ones from overrepresented sequences), why is that so?, and on other hand the basic quality shows clean data with no interquartlies and SD box. could you please let me know. Thank you

    • @BabrahamBioinf
      @BabrahamBioinf  ปีที่แล้ว +1

      Trimming won't necessarily remove overrepresented sequences - it depends where they come from. They could be something unrelated to adapters such as rRNA sequences, or they could be truncated adapters which are missing their first few bases, in which case the trimming program won't find them. It's also possible that the trimming program wasn't given the correct adapter sequence and therefore didn't remove everything it could. It's difficult to know without seeing what's in your report.
      It's fairly common for the per base quality plot to show no interquartile range, and that's simply because illumina sequencers often generate very consistent phred scores such that more than 50% of the base calls in a given cycle end up with the same Phred value, and thus the lower and upper quartile values are the same. You even see some cases were there are no whiskers because all Phred scores from the 10th to the 90th percentile are the same, though this is more rare.

  • @ignasijs9734
    @ignasijs9734 3 ปีที่แล้ว

    Hi, the win/linux link to download the installation program doesn't work

  • @german4162
    @german4162 2 ปีที่แล้ว

    Doing an analysis with Nanopore and this video was the most useful I could find. The outputs are slightly different now but still comparable. Thanks.

  • @kmbeyagala
    @kmbeyagala 10 ปีที่แล้ว

    Thanks for the nice video. I have Illumina data and i do not know how to check read quality using FastQC. I am using MobaXterm Personal Edition v7.0. Any ideas on how to proceed

  • @gregorylupton2330
    @gregorylupton2330 ปีที่แล้ว

    Great video. Thank-you.

  • @siarheimanakov
    @siarheimanakov 12 ปีที่แล้ว

    and how to run the program from a unix type of a command line? Can't find instructions anywhere on the website :(

  • @mariliadesouzacosta4763
    @mariliadesouzacosta4763 6 ปีที่แล้ว +1

    THANK YOU SO MUCH!!!! HOLY TUTORIAL!!!!

  • @khushalsinghsolanki4857
    @khushalsinghsolanki4857 8 ปีที่แล้ว

    With my illumina next seq 500 transcriptome data, RAW data shows normal sequence length distribution (151 bp) in FasQC whereas NGS QC tool kit processed data shows a warning in sequence length distribution module with sequences from 50-151 bp long.
    How?

    • @BabrahamBioinf
      @BabrahamBioinf  8 ปีที่แล้ว

      +KHUSHAL SOLANKI Were you feeding it a soft-clipped BAM file by any chance? Soft clipping puts annotations into the file to say where it thinks the ends of the adapters are but doesn't actually remove the sequence. Since FastQC is a raw sequence analysis tool it doesn't look at this, but other tools might? If it's from fastq files then I don't know (and one of them must be wrong!)

  • @TheGummysoup
    @TheGummysoup 6 ปีที่แล้ว

    Sorry for my question i am beginner but when i try to run my sequencing Fastqc just runs to 95% and i can not get .html :( :(

    • @BabrahamBioinf
      @BabrahamBioinf  6 ปีที่แล้ว

      Sorry to hear it's not working for you. Are you using the latest version of the software 0.11.7? When it stops is there anything written to the console from which it was launched? The best way to follow this up would be to open a bug at github.com/s-andrews/fastqc/issues and we can track it from there.

  • @nitandressa1
    @nitandressa1 12 ปีที่แล้ว

    Very nice video!! Thanks a lot!! I am new in NGS, so I have a question, why is bad to have similar sequences in the fastq file? Would not be normal to have more than one copy of the same read? I have looked some differential expression analysis pipelines and they remove reads with low copies. Thanks in advance!

  • @missing409
    @missing409 5 ปีที่แล้ว

    well explained, great video and great program but your website is currently down!!

    • @BabrahamBioinf
      @BabrahamBioinf  5 ปีที่แล้ว +1

      Thanks! Our IT are very aware of the network problems and lots of people are running around trying to fix them :-)

  • @seulalee2024
    @seulalee2024 3 ปีที่แล้ว

    I still have overrepresented sequences that says NO HIT after removing adapters. I blasted these sequences and it says it is long non coding rna. Do I need to remove these overrepresented sequences or is it enough just to remove the adapters and go onto next step? Thank you

    • @simonandrews5604
      @simonandrews5604 3 ปีที่แล้ว +1

      Other common sources of overrepresented sequences would be things like ribosomal RNAs, telomere/centromere repeats, polyA and stuff like that. These are all biologically derived sequences rather than being technical contamination (ie adapters) so it's not as clear cut whether they should be removed. My normal course of action would be to leave these biologically overrepresented sequences in place initially, but I might then do additional filtering after mapping / quantitation, but a lot of this depends on the nature of your library and scientific question.

    • @seulalee2024
      @seulalee2024 3 ปีที่แล้ว

      @@simonandrews5604 thank you so much for the answer. Appreciate it

  • @sheetal_soul
    @sheetal_soul 4 ปีที่แล้ว

    please provide a video based on the installation of an 11.9 version. I am trying for a long time but unable to do it. please help.

    • @simonandrews5604
      @simonandrews5604 4 ปีที่แล้ว

      Nothing has changed in the way the program is installed or run between the version in the video and the latest version. One of the modules has been removed and there have been some more minor changes to other modules, but nothing major is different. If you're having problems getting the program to install or run then please report this as a bug on our github page and we can track it down with you.

  • @simonandrews5604
    @simonandrews5604 12 ปีที่แล้ว

    You generally have to sequence to very high fold coverage to see significant numbers of duplicated reads, though it's normal to see many overlapping reads. If you have an enriched library with huge read depth then you might expect to see duplication but in most libraries high duplication levels are more likely to have a technical source, ie PCR duplication. You can only really tell the difference by looking at the pattern of mapped reads but the fastqc report should give you a clue.

  • @queencake14
    @queencake14 13 ปีที่แล้ว +2

    Thank you! Amazingly helpful demonstration! :)
    Two questions: 1) What differences would you expect for RNA-seq data (you mention GC content fluctuations?) 2) I'm seeing Kmer content in my FastQC report as well - what is this a measure of?

  • @yuwan
    @yuwan 9 ปีที่แล้ว

    Thanks for this clear video.

  • @ellhar1
    @ellhar1 9 ปีที่แล้ว

    What a fantastic demo video, well done.

  • @sand8683
    @sand8683 3 ปีที่แล้ว

    Thank you so much!

  • @chuskihouse8776
    @chuskihouse8776 3 ปีที่แล้ว

    what is kmer

  • @healthymadness8607
    @healthymadness8607 2 ปีที่แล้ว

    Thank you!

  • @asap2334
    @asap2334 4 ปีที่แล้ว

    you have saved me so much time thank you

    • @BabrahamBioinf
      @BabrahamBioinf  4 ปีที่แล้ว

      You're welcome; glad we could help

  • @jesstilla
    @jesstilla 10 ปีที่แล้ว +1

    thank you! I very much appreciate this clear video!