BCBio GFF, модифицирующий родительскую функцию - PullRequest
0 голосов
/ 29 января 2020

У меня есть следующий входной файл:

NbV1Ch01    transdecoder    gene    802292  803490  .   +   .   ID=STRG.1;Name=ORF%20type%3A5prime_partial%20len%3A222%20%28%2B%29%2Cscore%3D55.19%2CXP_009619919.1%7C95.4%7C9.7e-113%2Cperoxidase%7CPF00141.23%7C3.1e-67%2CBaculo_PEP_C%7CPF04513.12%7C0.049%2CBaculo_PEP_C%7CPF04513.12%7C5.7e%2B02
NbV1Ch01    transdecoder    mRNA    802292  803490  .   +   .   ID=STRG.1.1.p1;Parent=STRG.1;Name=ORF%20type%3A5prime_partial%20len%3A222%20%28%2B%29%2Cscore%3D55.19%2CXP_009619919.1%7C95.4%7C9.7e-113%2Cperoxidase%7CPF00141.23%7C3.1e-67%2CBaculo_PEP_C%7CPF04513.12%7C0.049%2CBaculo_PEP_C%7CPF04513.12%7C5.7e%2B02
NbV1Ch01    transdecoder    exon    802292  802491  .   +   .   ID=STRG.1.1.p1.exon1;Parent=STRG.1.1.p1
NbV1Ch01    transdecoder    CDS 802294  802491  .   +   0   ID=cds.STRG.1.1.p1;Parent=STRG.1.1.p1
NbV1Ch01    transdecoder    exon    802781  802946  .   +   .   ID=STRG.1.1.p1.exon2;Parent=STRG.1.1.p1
NbV1Ch01    transdecoder    CDS 802781  802946  .   +   0   ID=cds.STRG.1.1.p1;Parent=STRG.1.1.p1
NbV1Ch01    transdecoder    exon    803048  803490  .   +   .   ID=STRG.1.1.p1.exon3;Parent=STRG.1.1.p1
NbV1Ch01    transdecoder    CDS 803048  803349  .   +   2   ID=cds.STRG.1.1.p1;Parent=STRG.1.1.p1
NbV1Ch01    transdecoder    three_prime_UTR 803350  803490  .   +   .   ID=STRG.1.1.p1.utr3p1;Parent=STRG.1.1.p1

Приведенный выше входной файл использовался с этим сценарием:

from BCBio import GFF

in_file = "data/stringtie.gff3"
out_file = "data/stringtieFix.gff3"
in_handle = open(in_file)

with open(out_file, "w") as out_handle:
    for rec in GFF.parse(in_handle):
        print("Chromosome name: ", rec.id)
        for feature in rec.features:
            geneNum = feature.qualifiers.get("ID")[0].split('.')[1]
            print('STRG' + geneNum)
            feature.qualifiers["ID"] = 'STRG' + geneNum
            print("Parent",feature.qualifiers.get("Parent"))
            del feature.qualifiers["Name"]
            del feature.sub_features[0].qualifiers["Name"]

            for (x, subfeature) in enumerate(feature.sub_features):
                sub_parts = subfeature.qualifiers["ID"][0].split('.')
                print(sub_parts[0] + geneNum + "." + 't' + str(x+1))
                subfeature.qualifiers["ID"] = sub_parts[0] + geneNum + "." + 't' + str(x+1)

                for subfeature2 in subfeature.sub_features:
                    checkID = subfeature2.qualifiers["ID"][0]
                    if checkID.startswith('cds.STRG'):
                        checkID = checkID.replace('cds.STRG', 'STRG') + '.cds'
                    sub_parts = checkID.split('.')
                    print(sub_parts[0] + geneNum + "." + 't' + str(x+1) + "." + sub_parts[4])
                    subfeature2.qualifiers["ID"] = sub_parts[0] + geneNum + "." + 't' + str(x+1) + "." + sub_parts[4]

        GFF.write([rec], out_handle)

in_handle.close()

В результате я получил:

##gff-version 3
##sequence-region NbV1Ch01 1 803490
NbV1Ch01    transdecoder    gene    802292  803490  .   +   .   ID=STRG1
NbV1Ch01    transdecoder    mRNA    802292  803490  .   +   .   ID=STRG1.t1;Parent=STRG.1,STRG1
NbV1Ch01    transdecoder    exon    802292  802491  .   +   .   ID=STRG1.t1.exon1;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01    transdecoder    CDS 802294  802491  .   +   0   ID=STRG1.t1.cds;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01    transdecoder    exon    802781  802946  .   +   .   ID=STRG1.t1.exon2;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01    transdecoder    CDS 802781  802946  .   +   0   ID=STRG1.t1.cds;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01    transdecoder    exon    803048  803490  .   +   .   ID=STRG1.t1.exon3;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01    transdecoder    CDS 803048  803349  .   +   2   ID=STRG1.t1.cds;Parent=STRG.1.1.p1,STRG1.t1
NbV1Ch01    transdecoder    three_prime_UTR 803350  803490  .   +   .   ID=STRG1.t1.utr3p1;Parent=STRG.1.1.p1,STRG1.t1

Как можно удалить STRG.1 и STRG.1.1.p1 из Parent feautere?

Заранее спасибо

...