Я сохранил данные вашего примера Fasta и TSV в example.fasta
и example.tsv
. Вот содержимое входного файла -
$ cat example.fasta
>Prevalence_Sequence_ID:1|ARO:3003072|RES:mphL|Protein Homolog Model
MTTLKVKQLANKKGLNILEDS
>gb|ARO:3004145|RES:AxyZ|Achromobacter_insuavis_AXX-A_
MARKTKEESQRTRDRILDAAEHVFLSKG
>Prevalence_Sequence_ID:31298|ARO:3000777|RES:adeF|Protein Homolog Model
MDFSRFFIDRPIFAAVLSILIFI
$ cat example.tsv
ARO:3003072 mphL mphL is a chromosomally-encoded macrolide phosphotransferases that inactivate 14- and 15-membered macrolides such as erythromycin, clarithromycin, azithromycin.
ARO:3004145 AxyZ AxyZ is a transcriptional regulator of the AxyXY-OprZ efflux pump system.
ARO:3000777 adeF AdeF is the membrane fusion protein of the multidrug efflux complex AdeFGH.
# import biopython, bioython needs to be installed in your environment/machine
from Bio.SeqIO.FastaIO import SimpleFastaParser as sfp
# read in the tsv data into a dict
with open("example.tsv") as tsvdata:
tsv_data = {line.strip().split("\t")[0]: " ".join(line.strip().split("\t")[1:])
for line in tsvdata}
# read input fasta file contents and write to a separate file in real time
with open("example_out.fasta", "w") as outfasta:
with open("example.fasta") as infasta:
for header, seq in sfp(infasta):
aro = header.strip().split("|")[1] # get ARO for header
header = header.replace(aro, tsv_data.get(aro, aro)) # lookup ARO in dict and replace if found, otherwise ignore it
outfasta.write(">{0}\n{1}\n".format(header, seq))
Вот содержимое выходного файла -
$ cat example_out.fasta
>Prevalence_Sequence_ID:1|mphL mphL is a chromosomally-encoded macrolide phosphotransferases that inactivate 14- and 15-membered macrolides such as erythromycin, clarithromycin, azithromycin.|RES:mphL|Protein Homolog Model
MTTLKVKQLANKKGLNILEDS
>gb|AxyZ AxyZ is a transcriptional regulator of the AxyXY-OprZ efflux pump system.|RES:AxyZ|Achromobacter_insuavis_AXX-A_
MARKTKEESQRTRDRILDAAEHVFLSKG
>Prevalence_Sequence_ID:31298|adeF AdeF is the membrane fusion protein of the multidrug efflux complex AdeFGH.|RES:adeF|Protein Homolog Model
MDFSRFFIDRPIFAAVLSILIFI