Reading FASTA files in python3 : Tut2

Shad Arf

มุมมอง 29 271

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 3 ธ.ค. 2024

ความคิดเห็น • 46

@rutujasawant6646 2 ปีที่แล้ว ⁺¹
I was taking too long to open FASTA file in Python. Your video was very useful! thank you!!
@ShadArfMohammed 2 ปีที่แล้ว
Awesome :D I am happy you found it useful :)
@nardineharrab 2 ปีที่แล้ว
hello ! can u help me please ?? im stuck with solving this exercice ! please
@BernadeVries 3 ปีที่แล้ว ⁺¹
I was really stuck, but I solved it thanks to this video :D
@swatijadhav6628 ปีที่แล้ว ⁺³
How do you get the number 103 for slicing?
@ShadArfMohammed ปีที่แล้ว
I know the reply is too late, though :D
this can be done in several different ways, you can simply write a function that counts the numbers of the letters starting from the first letter which is ">" to the first return character, which is "
", below is an example function to help you do it:
your can also iterate through your fasta file for sequences and then iteratively print the number of letters in their first line (if needed), the headers in separate lines, then their corresponding DNA sequences if required.
```
def read_fasta_file(filename):
# Initialize variables to store the header and sequence
header = None
sequence = ""
# Open the FASTA file for reading
with open(filename, 'r') as file:
for line in file:
line = line.strip() # Remove leading/trailing whitespace
# Check if the line starts with ">"
if line.startswith('>'):
# If a sequence was already read, yield the header and sequence
if header and sequence:
yield header, sequence
# Reset sequence for the next entry
sequence = ""
# Store the header for the current entry
header = line
else:
# Append the line to the sequence
sequence += line
# Yield the last header and sequence
if header and sequence:
yield header, sequence
def count_letters_in_first_line_of_fasta(filename):
# Read the FASTA file and get the first header and sequence
for header, sequence in read_fasta_file(filename):
# Count the number of letters in the first sequence
return len(sequence)
# Example usage:
filename = "DNA_sequence.fasta" # Replace with the path to your FASTA file
letter_count = count_letters_in_first_line_of_fasta(filename)
print(f"Number of letters in the first sequence: {letter_count}")
# show DNA sequence alone:
for header, sequence in read_fasta_file(filename):
print("header is : ", header)
print("your sequence is: ", sequence)
```
@qamarkhan6917 4 ปีที่แล้ว ⁺²
Excellent Sir G. Thank you
@ShadArfMohammed 4 ปีที่แล้ว
Thanks for your comment and I'm glad you liked the materials.
Best,
@piano-fe4bv 3 ปีที่แล้ว ⁺¹
Such a good video - Thank you very much!
A quick question: Can you use .strip("
") rather than .replace("
","") at time 3:58? I was working on a piece of code where I had to remove
's from a string of DNA though the code had only worked when I tried the *replace function* but not when I tried using the strip function. Do you mind explaining why this is the case? Thank you
@ShadArfMohammed 3 ปีที่แล้ว
Hi,
Thanks for your comment,
yeah sure,
strip() only works on removing white-space characters not return character.
replace() can be used for both.
@keityfarfan1472 4 ปีที่แล้ว ⁺²
Hi Shad! for example, if you want to find the coding sequence, but each exon, like in this gene:
CDS join(2514..2609,6836..7316,8043..8420,13095..13186,
13371..13564,13637..13698,14051..14109)
Which is the best form to find all exons in jupyter notebook?
Thanks!!!
@ShadArfMohammed 4 ปีที่แล้ว
Hi Keity,
there are several methods that you can apply to find all exons based on the data you have:
1) use regular expressions --> I have a list of tutorials on that.
2) use dictionary method, then use nested loop or list comprehension to iterate through the dictionary and the raw data.
3) alternatively, use pipelines like ("Automated Sanger Analysis Pipeline") as in the paper below:
doi: 10.7171/jbt.16-2704-005
they have developed a pipeline in python for such problems.
If you find those difficult to troubleshoot your problem, kindly send me the details of your query and I will try my best to help you out.
Best,
Shad.
@nardineharrab 2 ปีที่แล้ว ⁺¹
hello !! please i need ur help i can't solve this problem
i have to find the RECA seq of Ecoli as a fasta forme!! please how can i do it
@ShadArfMohammed ปีที่แล้ว
Hi Nardine,
navigate to NCBI and type:
"recA DNA recombination/repair protein RecA [ Escherichia coli str. K-12 substr. MG1655 ]"
below are other information that you might need to have:
Accession Code: NC_000913.3
Range : (2822708..2823769, complement)
Gene ID : 947170
you should get it with that in hand.
@rozettaify 3 ปีที่แล้ว ⁺²
how to convert fastq to fasta
@ShadArfMohammed 3 ปีที่แล้ว
# below is a function written by Aditya Ambati
# to convert FASTQ to FASTA format.
import sys, subprocess, gzip
infile = sys.argv[1]
fasta = sys.argv[2]
gz = sys.argv[3]
## if gz is present open it with the gzip library else normal
if '.gz' in infile:
print 'detected a .gz file '
fastq = gzip.open(infile, 'rb')
else:
fastq = open(infile, 'r')
## main function, skip line id 34 and rest counter to write out the fasta
def main(fastq, fasta, gz):
line_n =0
line_buffer = 0
line_id = 1
if gz == '1': ### if gz argument is provided it will write out a gz file else normal
outfile = gzip.open(fasta+'.gz', 'w')
else:
outfile = open(fasta, 'w')
fastas = 1
fasta_length =0
for line in fastq:
line_n += 1
#line_id += 1
if line_n == 10000:
line_buffer += 10000
line_n =0
print 'processed lines ', line_buffer
if line_id == 4:
line_id = 1
elif line_id == 3:
line_id += 1
elif line_id == 2:
line_id += 1
fasta_line = line
fasta_length += len(fasta_line.strip())
outfile.write(fasta_line)
fastas += 1
else:
if '@' not in line:
print 'are you sure this is a fastq ??'
else:
fasta_header = line.replace('@', '>')
line_id += 1
outfile.write(fasta_header)
outfile.close()
print 'FASTA records written', fastas, 'average length of fasta sequences ', float(fasta_length//fastas)
if __name__ == '__main__':
main(fastq, fasta, gz)import sys, subprocess, gzip
infile = sys.argv[1]
fasta = sys.argv[2]
gz = sys.argv[3]
## if gz is present open it with the gzip library else normal
if '.gz' in infile:
print 'detected a .gz file '
fastq = gzip.open(infile, 'rb')
else:
fastq = open(infile, 'r')
## main function, skip line id 34 and rest counter to write out the fasta
def main(fastq, fasta, gz):
line_n =0
line_buffer = 0
line_id = 1
if gz == '1': ### if gz argument is provided it will write out a gz file else normal
outfile = gzip.open(fasta+'.gz', 'w')
else:
outfile = open(fasta, 'w')
fastas = 1
fasta_length =0
for line in fastq:
line_n += 1
#line_id += 1
if line_n == 10000:
line_buffer += 10000
line_n =0
print 'processed lines ', line_buffer
if line_id == 4:
line_id = 1
elif line_id == 3:
line_id += 1
elif line_id == 2:
line_id += 1
fasta_line = line
fasta_length += len(fasta_line.strip())
outfile.write(fasta_line)
fastas += 1
else:
if '@' not in line:
print 'are you sure this is a fastq ??'
else:
fasta_header = line.replace('@', '>')
line_id += 1
outfile.write(fasta_header)
outfile.close()
print 'FASTA records written', fastas, 'average length of fasta sequences ', float(fasta_length//fastas)
if __name__ == '__main__':
main(fastq, fasta, gz)
@ananyabhardwaj15 หลายเดือนก่อน ⁺¹
hey may I know how to convert a txt file to a fasta file?
@ShadArfMohammed หลายเดือนก่อน
Hi, if you are on a Windows machine use the following steps:
"""
1. Open the .txt file:
- Locate and open your .txt file containing the DNA sequence in a text editor (e.g., Notepad).
2- Edit the Sequence File as follows:
- Add a header line at the top that begins with a ">" symbol, followed by an identifier (e.g., >Your_DNA_Sequence1).

The file should look like this upon opening:

>Your_DNA_Sequence1
ATGCTAGCTAGCTAGCTAGC
3. Save as a .fasta file
to save this text as a ".fasta" file:
- Go to File > Save As.
- Choose the location, type the file name, and add .fasta as the extension (e.g., my_sequence.fasta).
- Set the file type to All Files (*.*).
- Click Save.
Below is an example:
If your .txt file originally looks like:
ATGCTAGCTAGCTAGCTAGC
After adding the header, it should be:
>Your_DNA_Sequence1
ATGCTAGCTAGCTAGCTAGC
If you do this as explained, your DNA sequence is now formatted as a .fasta file, which you can use in most bioinformatics tools and programming languages.
```
cheers
@ananyabhardwaj15 หลายเดือนก่อน ⁺¹
@ShadArfMohammed thank you so much 💓
Kindly make more videos it's really helpful
@ShadArfMohammed หลายเดือนก่อน
@@ananyabhardwaj15 You are very welcome, Iam happy it was helpful for you.
Thanks for your interest in the tutorials I make, sure thing, I am planning to come back to making more videos soon :)
@amanmalik2269 4 ปีที่แล้ว ⁺³
How did you know that it was 103 characters?
@ShadArfMohammed 4 ปีที่แล้ว ⁺¹
1) Use readline method to separate the first line in fasta files.
2) use len() method to count the number of characters in the first line.
@rusbiology3460 4 ปีที่แล้ว
@@ShadArfMohammed привет! а как убрать первую строку? (чтобы найти общее количство нуклеотидов), а потом снова вывести эту строку, только с ответом. это задача в rosalind.info Computing GC Content
@ShadArfMohammed 4 ปีที่แล้ว
there are multiple ways of doing so,
1) use readline() method.
2) use biopython package.
3) convert the DNA to uppercase and dictate non-DNA as lowercase then count lowercase values. After that, delete the filtered lowerecase header.
@rusbiology3460 4 ปีที่แล้ว ⁺¹
thanks for such a quick response! unfortunately i am not allowed to use the biopython package
@edoardotaccaliti7301 3 ปีที่แล้ว ⁺¹
How can i download a file from NCBI as you did? i cant figure it out, because in the previous video i think that you did not show it
@ShadArfMohammed 3 ปีที่แล้ว ⁺¹
The following steps will let you download files like I did:
1- in google, write NCBI nucleotide, then choose the nucleotide database link.
2- write your query
3- choose the desired results by clicking on the little square on the left side of each item ( result item).
4- on the upper right corner, there is an option called send to:
Click on it, then choose File.
5- anothe box will be opened to you at this step, click on Format and choose FASTA
6- Click on Create File and a file will be downloaded on your machine.
@dergermanconnect 3 ปีที่แล้ว ⁺¹
How can you use this code to extract exons or regions from many sequences at once?
@ShadArfMohammed 3 ปีที่แล้ว
option1) write a for loop and loop through the data to single out the exons in your multiple sequences in the file.
option2) write a recursive function to extract the exons in each sequence in your file.
@kamransaleem1669 4 ปีที่แล้ว ⁺¹
great work,,thanks nd love
@ShadArfMohammed 4 ปีที่แล้ว
Thanks Dear Kamran Saleem, tune in for more :)
@davidmoreno2827 ปีที่แล้ว ⁺¹
THANKS
@keityfarfan1472 5 ปีที่แล้ว ⁺¹
Excellent! Thank you!!
@ShadArfMohammed 5 ปีที่แล้ว
Thanks for the comment, I am happy that you found the tutorial useful. Tune in for more.
@keityfarfan1472 4 ปีที่แล้ว ⁺¹
I tried with this:
exon1 = c[2513:2609]
exon1
exon2 = c[6835:7316]
exon2
Is this a good form to obtain the CDS for each exon?
Thanks!
@ShadArfMohammed 4 ปีที่แล้ว
Sure thing, and well done. it is a good way to practice retrieving CDS.
@mrinalsubash8358 2 ปีที่แล้ว
Hi! Could you post an algorithmm where i could use stdin() in order to read a fasta file?
@ShadArfMohammed ปีที่แล้ว
hi there,
I normally use biopython to do so. if you like, please use the following method to do so:
```
from Bio import Entrez, SeqIO
def read_fasta_file(filename):
# Initialize variables to store the header and sequence
header = None
sequence = ""
# Open the FASTA file for reading
with open(filename, 'r') as file:
for line in file:
line = line.strip() # Remove leading/trailing whitespace
# Check if the line starts with ">"
if line.startswith('>'):
# If a sequence was already read, yield the header and sequence
if header and sequence:
yield header, sequence
# Reset sequence for the next entry
sequence = ""
# Store the header for the current entry
header = line
else:
# Append the line to the sequence
sequence += line
# Yield the last header and sequence
if header and sequence:
yield header, sequence
def count_letters_in_first_line_of_fasta(filename):
# Read the FASTA file and get the first header and sequence
for header, sequence in read_fasta_file(filename):
# Count the number of letters in the first sequence
return len(sequence)
# Example usage:
filename = "DNA_sequence.fasta" # Replace with the path to your FASTA file
letter_count = count_letters_in_first_line_of_fasta(filename)
print(f"Number of letters in the first sequence: {letter_count}")
# show DNA sequence alone:
for header, sequence in read_fasta_file(filename):
print("header is : ", header)
print("your sequence is: ", sequence)
```
@corbindavies9102 2 ปีที่แล้ว ⁺¹
I'm stuck on a section when I run the code and it comes back with no file or directory but I copied everything exactly and went over it a bunch of times. Pls help 😂
@ShadArfMohammed 2 ปีที่แล้ว
Hi,
Sure thing, kindly paste your code here in the comment, or in the discussion panel of my channel, I will help you out.
best
@corbindavies9102 2 ปีที่แล้ว
@@ShadArfMohammed
Thankyou
x = open(r"M:\documents\SeqD
andseq.fasta","r")
a = x.read()
x.close()
a
@corbindavies9102 2 ปีที่แล้ว
It's identical to my file and ive been on it all day. I'm on university computer so not sure it its that but yup.
@michellecastro5311 3 ปีที่แล้ว
Thank you SO MUCH!!!!!!!!!!!!
@isaacerickson2076 4 ปีที่แล้ว ⁺¹
So quiet. Turn up your mic next time, please. I can barely hear you.
@ShadArfMohammed 4 ปีที่แล้ว
I'm sorry you've had a bad experience with the tutorial, I actually stopped making more contents because of such technical reasons, as I live in a country where high tech/quality devices are extremely difficult to obtain.
@gbo1wms 3 ปีที่แล้ว ⁺³
@@ShadArfMohammed its not even bad dude dont worry

ต่อไป

เล่นอัตโนมัติ

How to deal with Errors in Python 3: Python for bioinformatics Beginner level