У меня есть следующий входной файл:
NbV1Ch01 transdecoder gene 802292 803490 . + . ID=STRG.1;Name=ORF%20type%3A5prime_partial%20len%3A222%20%28%2B%29%2Cscore%3D55.19%2CXP_009619919.1%7C95.4%7C9.7e-113%2Cperoxidase%7CPF00141.23%7C3.1e-67%2CBaculo_PEP_C%7CPF04513.12%7C0.049%2CBaculo_PEP_C%7CPF04513.12%7C5.7e%2B02
NbV1Ch01 transdecoder mRNA 802292 803490 . + . ID=STRG.1.1.p1;Parent=STRG.1;Name=ORF%20type%3A5prime_partial%20len%3A222%20%28%2B%29%2Cscore%3D55.19%2CXP_009619919.1%7C95.4%7C9.7e-113%2Cperoxidase%7CPF00141.23%7C3.1e-67%2CBaculo_PEP_C%7CPF04513.12%7C0.049%2CBaculo_PEP_C%7CPF04513.12%7C5.7e%2B02
NbV1Ch01 transdecoder exon 802292 802491 . + . ID=STRG.1.1.p1.exon1;Parent=STRG.1.1.p1
NbV1Ch01 transdecoder CDS 802294 802491 . + 0 ID=cds.STRG.1.1.p1;Parent=STRG.1.1.p1
NbV1Ch01 transdecoder exon 802781 802946 . + . ID=STRG.1.1.p1.exon2;Parent=STRG.1.1.p1
NbV1Ch01 transdecoder CDS 802781 802946 . + 0 ID=cds.STRG.1.1.p1;Parent=STRG.1.1.p1
NbV1Ch01 transdecoder exon 803048 803490 . + . ID=STRG.1.1.p1.exon3;Parent=STRG.1.1.p1
NbV1Ch01 transdecoder CDS 803048 803349 . + 2 ID=cds.STRG.1.1.p1;Parent=STRG.1.1.p1
NbV1Ch01 transdecoder three_prime_UTR 803350 803490 . + . ID=STRG.1.1.p1.utr3p1;Parent=STRG.1.1.p1
Приведенный выше входной файл использовался с этим сценарием:
from BCBio import GFF
in_file = "data/stringtie.gff3"
out_file = "data/stringtieFix.gff3"
in_handle = open(in_file)
with open(out_file, "w") as out_handle:
for rec in GFF.parse(in_handle):
print("Chromosome name: ", rec.id)
for feature in rec.features:
geneNum = feature.qualifiers.get("ID")[0].split('.')[1]
print('STRG' + geneNum)
feature.qualifiers["ID"] = 'STRG' + geneNum
print("Parent",feature.qualifiers.get("Parent"))
del feature.qualifiers["Name"]
del feature.sub_features[0].qualifiers["Name"]
for (x, subfeature) in enumerate(feature.sub_features):
sub_parts = subfeature.qualifiers["ID"][0].split('.')
print(sub_parts[0] + geneNum + "." + 't' + str(x+1))
subfeature.qualifiers["ID"] = sub_parts[0] + geneNum + "." + 't' + str(x+1)
for subfeature2 in subfeature.sub_features:
checkID = subfeature2.qualifiers["ID"][0]
if checkID.startswith('cds.STRG'):
checkID = checkID.replace('cds.STRG', 'STRG') + '.cds'
sub_parts = checkID.split('.')
print(sub_parts[0] + geneNum + "." + 't' + str(x+1) + "." + sub_parts[4])
subfeature2.qualifiers["ID"] = sub_parts[0] + geneNum + "." + 't' + str(x+1) + "." + sub_parts[4]
GFF.write([rec], out_handle)
in_handle.close()
В результате я получил:
##gff-version 3
##sequence-region NbV1Ch01 1 803490
NbV1Ch01 transdecoder gene 802292 803490 . + . ID=STRG1
NbV1Ch01 transdecoder mRNA 802292 803490 . + . ID=STRG1.t1;Parent=STRG.1,STRG1
NbV1Ch01 transdecoder exon 802292 802491 . + . ID=STRG1.t1.exon1;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01 transdecoder CDS 802294 802491 . + 0 ID=STRG1.t1.cds;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01 transdecoder exon 802781 802946 . + . ID=STRG1.t1.exon2;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01 transdecoder CDS 802781 802946 . + 0 ID=STRG1.t1.cds;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01 transdecoder exon 803048 803490 . + . ID=STRG1.t1.exon3;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01 transdecoder CDS 803048 803349 . + 2 ID=STRG1.t1.cds;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01 transdecoder three_prime_UTR 803350 803490 . + . ID=STRG1.t1.utr3p1;Parent=STRG.1.1.p1,STRG1.t1
Как можно удалить STRG.1
и STRG.1.1.p1
из Parent
feautere?
Заранее спасибо