How to select and remove individuals in PLINK

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 ก.ย. 2024

ความคิดเห็น • 18

  • @mohammadj.shamim9342
    @mohammadj.shamim9342 2 ปีที่แล้ว +2

    Thank you so much. Could you please make a tutorial on how to overwrite phenotypes. For example, let's say we have the ped and fam files with no phenotypes. We do a kind of research and retrieve data for a specific trait. Now, we want to include those phenotypes into the ped file for association analysis. How can we do it. My understanding is that we can use the --pheno pheno.txt command but I want these phenotypes to be embedded into the files after --out.

    • @GenomicsBootCamp
      @GenomicsBootCamp  2 ปีที่แล้ว +1

      Hi, Yes, I will do this relatively soon. For now, so you can proceed if you have more phenotypes, you need to use the combination of --pheno and --pheno-name (or --mpheno). I like --pheno-name more.

  • @MyMasaka
    @MyMasaka ปีที่แล้ว +1

    Can this be applied to human GWAS?

    • @GenomicsBootCamp
      @GenomicsBootCamp  ปีที่แล้ว

      Hi, this is just about the management of individuals in your PLINK data set, so it can be applied irrespective of the follow up steps. So yes, even in case of a follow up human GWAS.

  • @shumuyebelayteklebrhan8466
    @shumuyebelayteklebrhan8466 3 ปีที่แล้ว +1

    it is really important lecture and very helpful for us but I have one question. Why you don't use HWE for genome quality control when you are doing pca? than you!

    • @GenomicsBootCamp
      @GenomicsBootCamp  3 ปีที่แล้ว +2

      The avoidance of the HWE filtering is not related to PCA, but with the data that contains multiple, in our case MANY populations.
      The SNPs are expected to be close to Hardy-Weinberg within a population but could be very different between populations. Now, if there are multiple populations in a single data set, the genotype frequencies are all over the place, and likely not close to HWE. Thus the PLINK QC mistakenly identifies these as problematic and removes them.
      So to avoid the incorrect deletion of a large number of SNPs I decided to skip HWE. A more correct solution is in the "How to compute Fst from SNP genomic data" video, where the HWE is checked for each breed. The consequences of wrong decisions are shown in the "What happened to my results? | Consequences ..." video.
      But fortunately, the genetic distances, upon which the PCA is based is quite robust, so the differences compared to a breed-wide HWE check are in my opinion minimal for the PCA plot.

  • @fakharunnisa2178
    @fakharunnisa2178 2 ปีที่แล้ว +1

    in this playlist, you are not giving the commands link?

    • @GenomicsBootCamp
      @GenomicsBootCamp  2 ปีที่แล้ว

      This is one of the oldest videos on the channel, probably a that time I did not upload the scripts. IN any case in this video it is more important to understand the logic of the data handling, which is based on the few central keywords, and what other files they require

  • @sumeetdeshmukh9786
    @sumeetdeshmukh9786 3 ปีที่แล้ว +1

    Hi, thank you for all your tutorials, it helps a lot. I was wondering how to remove the duplicate sample IDs? keeping only one instance.

    • @GenomicsBootCamp
      @GenomicsBootCamp  3 ปีที่แล้ว

      Hmmm, good question! I seemed to remember that it is possible to remove duplicates based on their IBD value, but I do not find it...
      From your question, I assume that the entire IDs are the same and that you already tried the --remove option, which probably gets rid of bot occurrences (I did not try this, so perhaps check that first).
      If it is only in 1-2 cases, you also might do a manual edit, in the .ped file remove entire lines of the duplicates. SImilarly as demonstrated n the video... Not super safe, but gets the job done.
      If there are more duplicates, or you want to play it safe, I suggest checking out the --filter option. Here you assign e.g. 1 to individuals you want to keep, and 0 to those you don't and then specify "--filter yourFileName.txt 1"
      Not sure how this behaves though, so feedback would be much appreciated.

    • @sumeetdeshmukh9786
      @sumeetdeshmukh9786 3 ปีที่แล้ว +1

      @@GenomicsBootCamp Thank you for the quick response. I basically solve this issue by renaming the sample ID with the following command:
      awk '$2 in a {$2=$2 "_" ++a[$2]}{a[$2];print}' .fam > output

    • @GenomicsBootCamp
      @GenomicsBootCamp  3 ปีที่แล้ว +1

      @@sumeetdeshmukh9786 Great! Good solution as well!

    • @sumeetdeshmukh9786
      @sumeetdeshmukh9786 3 ปีที่แล้ว +1

      @@GenomicsBootCamp Hi I was wondering if you have or could suggest any method for imputation and Quality check post imputation?

    • @GenomicsBootCamp
      @GenomicsBootCamp  3 ปีที่แล้ว

      @@sumeetdeshmukh9786 Hi! To be honest the topc of imputation is among the ones I am not too comfortable with. Because of this, I can not suggest any "best" solution.
      As for the method of software, you can not go wrong with any of the established software. These are being used, so any serious inaccuracies are unlikely. (Imputation comes with inaccuracies for rare alleles, but this is unavoidable).
      As for the quality post imputation, I know there are multiple methods, comparing allele or genotype status. Here you can play a bit with the metrics, e.g. if there is an AB inputed instead of AA, you can take this as a 1 whole error, or "just" a 0.5 error, as one of the A-s was imputed correctly... So options here as well... I suggest to look up a review paper on this.
      A paper including our group (me not involved, should be open access): www.sciencedirect.com/science/article/pii/S0022030215003021

  • @atheerattar8576
    @atheerattar8576 2 ปีที่แล้ว

    Hello, thank you so much for your content on PLINK! its been really helpful, but can you give examples on how to work directly on PLINK or if SAS can be used similarly to R? thanks

    • @GenomicsBootCamp
      @GenomicsBootCamp  2 ปีที่แล้ว

      To my knowledge there is no graphical user interface to PLINK. So if you want to work "direclty" in it, you can open it in command line and copy paste text there. But this is exactly the same what we do via R.
      Running PLINK from SAS: My knowledge of SAS is limited, but it should be possible. Somehow... Maybe check the "Unnamed Pipes" documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/hostwin/n16puwsro9pakqn1jamy1vwyaqx6.htm#p0mqhnu5jewym3n0z3gvvqfe82aq

  • @shirihoshen5921
    @shirihoshen5921 ปีที่แล้ว

    Thank you! Your channel has opened a whole world for me. I used to skim genetics papers out of interest and now I am learning PLINK and R, and am starting to be able to get deeper into this!
    I finally managed to add breed names to a FAM file which has more than 5,000 dogs using --update-ids, and then pull out the data from the breeds that interested me.
    ChatGPT has also been a help. Chatty is bad with PLINK, but has helped me with R. It has given me a lot of bad advice, but enough good advice to be overall useful.

  • @kush2613
    @kush2613 ปีที่แล้ว

    Although, I understand its an old video, but the script
    read_tsv("ADAPTmap_genotypeTOP_20160222_full.fam", col_names = F) %>%
    select(x1,x2) %>%
    filter(x1 == "ALP" | x1 == "BOE" | x1 == "BOEx") %>%
    write_delim("individualSubset.txt", col names = FALSE)
    is showing error like unexpected symbol in :.......
    can you pls provide the script for this ex, if available