Функция pandas read_csv считывает один столбец как NaN для нуклеотидных последовательностей - PullRequest
0 голосов
/ 04 января 2019

Я пытаюсь прочитать файл нуклеотидных последовательностей, содержащий идентификатор и последовательность. Последовательности по умолчанию разделены новыми строками после 70 бит нуклеотидных последовательностей.

Входной файл (seq.txt) выглядит следующим образом.

seqgb_AY741213_Organism_Influenza_A_virus__A_blackbird_Hunan_1_2004_H5N1___Strain_Name_A_blackbird_Hunan_1_2004_Segment_4_Subtype_H5N1_Host_Blackbird,
ATGGAGAAAATAGTGCTTCTTCTTGCAATAGTCAGTCTTGTTAAAAGTGATCAGATTTGCATTGGTTACC
ATGCAAACAACTCGACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTACACATGCTCAAGA
CGTACTGGACAAGACACACAACGGGAACACTCAGTTTGAGGCCGTTGGAAGGGAATTTAATAACTTAGAA
AGGAGAATAGAAAATTTAAACAAGAAGATGGAGGACGGATTCCTAGATGTCTGGACTTATAATGCTGAAC
TTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTTCATGACTCAAATGTCAAGAACCTTTACGAAAA
GGTCCGACTACAACTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTTCTATCACAAATGT
GATAATGAATGTATGGAAAGTGTAAGAAACGGAACGTATGACTACCCGCAGTATTCAGAAGAAGCAAGAC
TAAACAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAACTTACCAAATACTGTCAATTTATTC
AACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGCTGGTCTATCTTTATGGATGTGCTCCAATGGA
TCGTTACAATGCAGAATTTGCATTTGA


seqgb_EU676325_Organism_Influenza_A_virus__A_brown-head_gull_Thailand_vsmu-4_2008_H5N1___Strain_Name_A_brown-head_gull_Thailand_vsmu-4_2008_Segment_4_Subtype_H5N1_Host_Brown-Headed_Gull,
TTTAGCAAAAGGCAGGGGTATATCTGTCAAAATGGAGAAAATAGTGCTTCTTTTTGCAATAGTCAGTCTT
GTTAAAAGTGATCAGATTTGCATTGGTTACCATGCAAACAACTCGACAGAGCAGGTTGACACAATAATGG
AAAAGAACGTTACTGTTACACATGCCCAAGACATACTGGAAAAGACACACAACGGGAAGCTCTGCGATCT
AGATGGAGTGAAGCCTCTAATTTTGAGAGATTGTAGTGTAGCTGGATGGCTCCTCGGAAACCCAATGTGT
GACGAATCTCCAATGGGGGCGATAAACTCTAGTATGCCATTCCACAATATACACCCTCTCACCATCGGGG
AATGCCCCAAATATGTGAAATCAAACAGATTAGTCCTTGCGACTGGGCTCAGAAATAGCCCTCAAAGAGA
GAGAAGAAGAAAAAAGAGAGGATTATTTGGAGCTATAGCAGGTTTTATAGAGGGAGGATGGCAGGGAATG
GTAGATGGTTGGTATGGGTACCACCATGAACTTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTTC
ATGACTCAAATGTCAAGAACCTTTACGACAAGGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGG
TAACGGTTGTTTCGAGTTCTATCATAAATGTGATAATGAATGTATGGAAAGTGTAAGAAACGGAACGTAT
GACTACCCACAGTATTCAGAAGAAGCAAGACTAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAA
TAGGAATTTACCAAATACTGTCAATTTATTCTACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGC
TGGTCTATCCTTATGGATGTGCTCCAATGGGTCGTTACAATGCAGAATTTGCATTTAAATTTGTGAGTTC
AGATTGAG


seqgb_EF178528_Organism_Influenza_A_virus__A_brown-headed_gull_Thailand_VSMU-28-SPK_2005_H5N1___Strain_Name_A_brown-headed_gull_Thailand_VSMU-28-SPK_2005_Segment_4_Subtype_H5N1_Host_Brown-Headed_Gull,
AGCAAAAGCAGGGGTATAATCTGTCAAAATGGAGAAAATAGTGCTTCTTTTTGCAATAGTCAGTCTTGTT
AAAAGTGATCAGATTTGCATTGGTTACCATGCAAACAACTCGACAGAGCAGGTTGACACAATAATGGAAA
AGAACGTTACGAATGATGCAATCAACTTCGAGAGTAATGGAAATTTCATTGCTCCAGAGTATGCATACAA
AATTGTCAAGAAAGGGGACTCAACAATTATGAAAAGTGAATTGGAATATGGTAACTGCAACACCAAGTGT
CAAACTCCAATGGGGGCGATAAACTCAAGGTCAACTCGATCATTGACAAAATGAACACTCAGTTTGAGGC
CGTTGGAAGGGAATTTAACAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGGTTC
CTAGATGTCTGGACTTATAATGCTGAACTTCTGGTTCTCCTGGAAAATGAGAGAACTCTAGACTTTCATG
ACTCAAATGTCAAGAACCTTTACGACAAGGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAA
CGGTTGTTTCGAGTTCTATCATAAATGTGATAATGAATGTATGGAAAGTGTAAGAAACGGAACGTATGAC
TACCCACAGTATTCAGAAGAAGCAAGACTAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAG
GAATTTACCAAATACTGTCAATTTATTCTACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGCTGG
TCTATCCTTATGGATGTGCTCCAATGGGTCGTTACAATGCAGAATTTGCATTTAAATTTGTGAGTTCAGA
T


seqgb_CY091790_Organism_Influenza_A_virus__A_chicken_Ampenan_BBVD-282_2007_H5N1___Strain_Name_A_chicken_Ampenan_BBVD-282_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCTGTCAAAATGGAGAAAATAGTGCTTCTTCTTGCAATAGTCAGTCTTGTTAAAAGTGATCAGATT
TGCATTGGTTACCATGCAAACAATTCAACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTA
CACATGCCCAAGACATACTGGAAAAGGGAAAATGAGAGAACTCTAGACTTTCATGACTCAAATGTTAAGA
ACCTCTACGACAAGGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTT
CTATCACAAATGTGATAATGAATGTATGGAAAGTATAAGAAACGGAACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAACTTACCAAATAC
TGTCGATTTATTCAACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGAT
GTGCTCCAATGGATCGTTACAATGCAGAATTTGCATTTAAATTTGTGAGTTCAGATTGTAGTTAAA


seqgb_KT216634_Organism_Influenza_A_virus__A_chicken_Anhui_MG08_2008_H9N2___Strain_Name_A_chicken_Anhui_MG08_2008_Segment_4_Subtype_H9N2_Host_Chicken,
AGCAAAAGCAGGGGAATTTCACAACCACTCAAGATGGAGACAGTATCACTAATAAATATACTACTAGTAG
TAACAGTAAGCAATGCAGATAAAATCTGCATCGGCTATCAATCAACAAATTCCACAGAAACTGTAGACAC
ACTAACAGAAAACAATGTCCCTGTGATTGTAATTGCAATGGGGTTTGCTGCCTTCTTGTTCTGGGCCATG
TCCAATGGGTCTTGCAGATGCAACATTTGTATATAATTGGCAAAAACACCCTTGTTTCTACT


seqgb_KY005855_Organism_Influenza_A_virus__A_chicken_Anhui_MZ33_2016_H5N6___Strain_Name_A_chicken_Anhui_MZ33_2016_Segment_4_Subtype_H5N6_Host_Chicken,
ATGGAGAAAATAGTGCTTCTTCTTGCAGTGGTTAGCCTTGTTAAAGGTGATCAGATTTGCATTGGTTACC
ATGCAAACAACTCGACTGAGCAGGTTGACACGATAATGGAAAAAAACGTCACTGTTACACATGCTCAAGA
CATACTAGAAAGGAATATGGCAATTGCAACACCAAATGTCAAACTCCAATAGGGGCGATAAACTCTAGTA
TGCCATTCCACAATATACACCCTCTCACTATCGGGGAGTGCCCCAAATATGTGAAATCAAACAAATTAGT
CCTTGCGACTGGGCTCAGAAATAGTCGAATCCACCCAAAAGGCAATAGATGGAGTTACCAATAAGGTCAA
CTCGATAATTGACAAAATGAACACTCAGACGGATTCCTAGATGTCTGGACTTATAATGCTGAACTTTTAG
TTCTCATGGAAAATGAGAGAACTCTAGATTTCCATGACTCAAATGTCAAGAACCTTTATGACAAAGTCCG
ACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAATGGTTGTTTCGAGTTCTATCACAAATGTGATAAT
GAATGTATGGAAAGTGTGAGGAATGGGACGTATGACTACCCCCAGTATTCAGAAGAAGCAAGATTAAAAA
GGGAAGAAATAAGCGGAGTGAAATTGGAATCAATAGGAACTTACCAAATACTGTCAATTTATTCAACAGT
GGCGGGTTCCCTAGCACTGGCAATCATTGTGGCTGGTCTATCTTTATGGATGTGCTCCAATGGGTCGTTA
CAATGCAGAATTTGCATTTAA


seqgb_KY005863_Organism_Influenza_A_virus__A_chicken_Anhui_MZ34_2016_H5N6___Strain_Name_A_chicken_Anhui_MZ34_2016_Segment_4_Subtype_H5N6_Host_Chicken,
ATGGAGAAAAGAAGAACGATGCATACCCAACAATAAAAATGAGCTACAATAACACCAATAGGGAAGATCT
TTTGATACTGTGGGGGATTCATCATTCCAATAATGCAGAAGAGCAGACAAATCTCTATAAAAACCCAACC
ACCTATGTTTCCGTTGGGACATCAACATTAAACCAGAGAGTGGTGCCAAAAATAGCTACTAGATCCCAAG
TAAACGGGCAAAGTGGAAGAATGGATTTCTTCTGGACAATTTTAAAACCGGATGATGCAATCCACTTCGA
GAGTAATGGAAATTTTATTGCTCCAGACTATCGGGGAGTGCCCCAAATATGTGAAATCAAACAAATTAGT
CCTTGCGACTGGGCTCAGAAATAGTCCTCTAAGAGAAAGAAGAAGAAAAAGAGGATTATTTGGAGCCATA
GCAGGGTTTATAGAGGGAGGATGGCAAGGAATGGTAGATGGTTGGTATGGGTACCACCATAGCAATGCAC
AAGGGAGTGGGTATGCTGCAGACAGAGAATCCACCCAAAAGGCAATAGATGGAGTTACCAATAAGGTCAA
CTCGATAATTGACAAAATGAACACTCAATTTGAGGCCGTTGGAAGGGAATTTAATAACTTAGAACGGAGA
ATAGAGAATTTAAATAAGAAAATGGAAGACGGATTCCTAGATGTCTGGACTTATAATGCTGAACTTTTAG
TTCTCATGGAAAATGAGAGAACTCTAGATTTCCATGACTCAAATGTCAAGAACCTTTATGACAAAGTCCG
ACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAATGGTTGTTTCGAGTTCTATCACAAATGTGATAAT
GAATGTATGGAAAGTGTGAGGAATGGGACGTATGACTACCCCCAGTATTCAGAAGAAGCAAGATTAAAAA
GGGAAGAAATAAGCGGAGTGAAATTGGAATCAATAGGAACTTACCAAATACTGTCAATTTATTCAACAGT
GGCGGGTTCCCTAGCACTGGCAATCATTGTGGCTGGTCTATCTTTATGGATGTGCTCCAATGGGTCGTTA
CAATGCAGAATTTGCATTTAA


seqgb_CY091815_Organism_Influenza_A_virus__A_chicken_Badung_BBVD-277_2007_H5N1___Strain_Name_A_chicken_Badung_BBVD-277_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCTGTCAAAATGGAGAAAATAGTGCTTCTTCTTGCAATAGTCAGTCTTGTTAAAAGTGATCAGATT
TGCATTGGTTACCATGCAAACAATTCAACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTA
CACATGCCCAAGACATACTGGAAAAGACACACAACGGGAAGCTCTGTGATCTAGATGGAGTGAAGCCTCT
AATTTTAAGAGATTGTAGTGTAGCTGGATGGCTCCTCGGGAACCCAATGTGTGATGAATTCATCAATGTA
CCGGAATGGTCTTACATAGTGGAGAACAGGGGTGAGCTCAGCATGTCCATACCTGGGAACGCCCTCCTTT
TTTAGAAATGTGGTATGGCTTATCAAAAAGAACAGTACATACCCAACAATAAAAAGAAGCTACAATAATA
CCAACCAAGAAGATCTTTTGGTACTGTGGGGGATTCACCATCCTAATGATGCGGCAGAGCAAACGAGGCT
ATATCAAAATCCAATCACCTATATTTCCGTTGGGACATCAACACTGAACCAGAGATTGGTACCAAAAATA
GCTACCAGAACAAGGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTT
CTATCACAAATGTGATGATGAATGTATGGAAAGTGTAAGAAATGGGACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGGGGAAATAAGTGGGGTAAAATTGGAATCAATAGGAATTTACCAAATAC
TGTCAATTTATTCAACAGTAGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGAT
GTGCTCCAATGGATCGTTACAATGCAGAATTTGCATTTAAATTTGTGAGTTTAGATTGTAGTTAAA


seqgb_CY091816_Organism_Influenza_A_virus__A_chicken_Badung_BBVD-288_2007_H5N1___Strain_Name_A_chicken_Badung_BBVD-288_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCCGTCAAAATGGAGAAAATAGTGCTTCTTCTTGCAATAGCCAGTCTTGTTAAAGGTGATCAGATT
TGCATTGGTTACCATGCAAACAATTCAACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTA
CACATGCCCAAGACATACTGGAAAAGGCACACAACGGGAAGCTCTGTGATCTAGATGGAGTGAAGCCTCT
AATTTTAAGAGATTGTAGTGTAGCCGGATGGCTCCTCGGGAACCCAATGTGTGACGAATTCATCAATGTA
CCGGAATGGTCTTACATAGTGGAGAACAGGGGTGAGCTCAGCATGTCCATACCTGGGAACGCCCTCCTTT
TTTAGAAATGTGGTATGGCTTATCAAAAAGAACAGTACATACCCAACAATAAAAAGAAGCTACAATAATA
CCAACCAGGAAGATCTTTTGGTACTGTGGGGGATTCACCATCCTAATGATGCGGCTGAGCAAACGAAGCT
ATATCAAAATCCAACCACCTATATTTCCGTTGGGACATCAACACTAAATCAGAGATTGGTACCAAAAATA
GCTACTAGATCCAAAGTAAACGGACAAAGTGGAAGGATGGAGTTCTTCTGGACAATTTTAAAACCCAATG
ATGCAATCAACTTCGAGAGTAATGGAAATTTCATTGCTCCAGAATATGCCTACAAAATTGTCAAGAAAGG
GGACTCAGCAATTATGAAAAGTGAATTGGAATATGGCAACTGCAACACCAAATGTCAAACTCCAATGGGG
GCGATAAACTTGTGATGATGAATGTATGGAAAGTGTAAGAAATGGGACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGGGTAAAATTGGAATCAATAGGAATTTACCAAATAC
TGTCAATTTATTCAACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGAT
GTGCTCCAATGGATCATTACAATGCAGAATTTGCATTTAAATTTGTGAGTTTAGATTGTAGTTAAA


seqgb_CY091819_Organism_Influenza_A_virus__A_chicken_Badung_BBVD-328_2007_H5N1___Strain_Name_A_chicken_Badung_BBVD-328_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCTGTCAAAATGGAGAAAATAGTGCTTCTTCTTGCAATAGCCAGTCTTGTTAAAGGTGATCAGATT
TGCATTGGTTACCATGCAAACAATTCAACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTA
CACATGCCCAAGACATACTAGAAAAGGCACACAACGGGAAGCTCTGTGATCTAGATGGAGTGAAGCCTCT
AATTTTAAGAGATTGTAGTGTAGCCGAGCAGAATAAACCATTTTGAGAAAATTCAGATCATCCCCAAAAG
TTCTTGGTCCGACCATGAAGCCTCGTCAGGGGTGAGCTCAGCATGTCCATACCTGGGAACGCCCTCCTTT
TTTAGAAATGTGGTATGGCTTATCAAAAAGAACAGTACATACCCAACAATAAAAAGAAGCTACAATAATA
CCAACCAGGAAGATCTTTTGGTACTGTGGGGGATCCACCATCCTAATGATGCGGCTGAGCAAACGAAGCT
ATATCAAAATCCAACCACCTATATTTCCGTTGGGACATCAACACTAAATCAGAGATTGGTACCAAAAATA
GCTACTAGATCCAAAGTAAACGGACAAAGTGGAAGGATGGAGTTCTTCTGGACAATTTTAAAACCCAATG
ATGCAATCAACTTCGAGAGTAATGGAAATTTCATTGCTCCAGAATATGCCTACAAAATTGTCAAGAAAGG
GGACTCAGCAATTATGAAAAGTGAATTGGAATATGGCAACTGCAACACCAAATGTCAAACTCCAATGGGG
GCGATAAACTCTAGTATGCCATTCCACAACATACACCCTCTCACCATCGGGGAATGCCCCAAATATGTGA
AATCAAACAGATTAGTCCTTGCGACTGGGCTCAGAAATAGCCCCCAAAGAGAGAGAAGAAGAAAAAAGAG
AGGACTATTTGGAGCTATAGCAGGTTTTATAGAGGGTGGATGGCAGGGAATGGTAGATGGTTGGTATGGG
TACCACCATAGCAATGAGCAAGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATG
GAGTCACCAATAAGGTCAATTCGATCATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATT
TAATAACTTAGAAAGGAGAATAGAGACTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTT
CTATCACAAATGTGATGATGAATGTATGGAAAGTGTAAGAAATGGGACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGAGGAGATAAGTGGGGTAAAATTGGAATCAATAGGAATTTACCAAATAC
TGTCAATTTATTCAACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATTTTTATGGAT
GTGCTCCAATGGATCATTACAATGCAGAATTTGCATTTAAATTTGTGAGTTTAGATTGTAGTTAAA


seqgb_CY091820_Organism_Influenza_A_virus__A_chicken_Badung_BBVD-342_2007_H5N1___Strain_Name_A_chicken_Badung_BBVD-342_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCCGTCAAAATGGAGAAAATAGTGCTTCTTCTTGCAATAGCCAGTCTTGTTAAAGGTGATCAGATT
TGCATTGGTTACCATGCAAACAATTCAACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTA
CACATGCCCAAGACATACTGGAAAAGGCACACAACGGGAAGCTCTGTGATCTAGATGGGGTGAAGCCTCT
AATTTTAAGAGATTGTAGTGTAGCCGTTATAGAGGGTGGATGGCAGGGAATGGTAGATGGTTGGTATGGG
TACCACCATAGCAATGAGCAAGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATG
GAGTCACCAATAAGGTCAACTCGATTATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATT
TAATAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGATTCCTAGATGTCTGGACT
TATAATGCTGAACTTCTGGTTCTCATGGAAAATGAGAGAACTTTAGACTTTCATGACTCAAATGTTAAGA
ACCTCTACGACAAAGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTT
CTATCACAAATGTGATGATGAATGTATGGAAAGTGTAAGAAATGGGACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGGGTAAAATTGGAATCAATAGGAATTTACCAAATAC
TGTCAATTTATTCAACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGAT
GTGCTCCAATGGATCATTACAATGCAGAATTTGCATTTAAATTTGTGAGTTTAGATTGTAGTTAAA


seqgb_GQ122391_Organism_Influenza_A_virus__A_chicken_Bali_UT2091_2005_H5N1___Strain_Name_A_chicken_Bali_UT2091_2005_Segment_4_Subtype_H5N1_Host_Chicken,
ATGGAGAAAATAGTGCTTCTTCTTGCAACAGTCAGTCTTGTTAAAAGTGATCAGATTTGCATTGGTTACC
ATGCAAACAATTCAACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTACACATGCCCAAGA
CATACTGGAAAAAACACACAACGGGAATGGCAGGGAATGGTAGATGGTTGGTATGGGTACCACCATAGCA
ATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATGGAGTCACCAATAA
GGTCAACTCAATCATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATTTAATAACTTAGAA
AGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGATTTCTAGATGTCTGGACTTATAATGCCGAAC
TTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTTCATGACTCAAATGTTAAGAACCTCTACGACAA
GGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTTCTATCACAAATGT
GATAATGAATGTATGGAAAGTATAAGAAACGGAACGTATAACTACCCGCAGTATTCAGAAGAAGCAAGAT
TAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAACTTACCAAATACTGTCAATTTATTC
AACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGATGTGCTCCAATGGA
TCGTTACAATGCAGAATTTGCATTTAA


seqgb_GQ122392_Organism_Influenza_A_virus__A_chicken_Bali_UT2092_2005_H5N1___Strain_Name_A_chicken_Bali_UT2092_2005_Segment_4_Subtype_H5N1_Host_Chicken,
ATGGAGAAAATAGTGCTTCTTCTTGCAACAGTCAGTCTTGTTAAAAGTGATCAGATTTGCATTGGTTACC
ATGCAAACAATTCAACAGAGCAGGTTGCCCTCAAAGAGAGAGAAGAAGAAAAAAGAGAGGACTATTTGGA
GCTATAGCAGGTTTTATAGAGGGAGGATGGCAGGGAATGGTAGATGGTTGGTATGGGTATCACCATAGCA
ATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATGGAGTCACCAATAA
GGTCAACTCAATCATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATTTAATAACTTAGAA
AGGAGAATAGAATGGAAAATGAGAGAACTCTAGACTTTCATGACTCAAATGTTAAGAACCTCTACGACAA
GGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTTCTATCACAAATGT
GATAATGAATGTATGGAAAGTATAAGAAACGGAACGTATAACTACCCGCAGTATTCAGAAGAAGCAAGAT
TAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAACTTACCAAATACTGTCAATTTATTC
AACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGATGTGCTCCAATGGA
TCGTTACAATGCAGAATTTGCATTTAA


seqgb_DQ083551_Organism_Influenza_A_virus__A_chicken_Bangkok_Thailand_CU-3_04_H5N1___Strain_Name_A_chicken_Bangkok_Thailand_CU-3_04_Segment_4_Subtype_H5N1_Host_Chicken,
ATGGAGAAAATAGTGCTTCTTTTTGCAATAGTCAGTCTTGTTAAAAGTGATCAGATTTGCATTGGTTACC
ATGCAAACAACTCGACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTACACATGCCCAAGA
CATACTGGAAAAGACTTTCATTGCTCCAGAATATGCATACAAAATTGTCAAGAAAGGGGACTCAACAATT
ATGAAAAGTGAATTGGAATATGGTAAATGGCAGGGAATGGTAGATGGTTGGTATGGGTACCACCATAGCA
ATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATGGAGTCACCAATAA
GGTCAACTCGATCATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATTTAACAACTTAGAA
AGGAGAATAGAAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTTCTATCATAAATGT
GATAATGAATGTATGGAAAGTGTAAGAAACGGAACGTATGACTACCCGCAGTATTCAGAAGAAGCAAGAC
TAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAATTTACCAAATACTGTCAATTTATTC
TACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGCTGGTCTATCCTTATGGATGTGCTCCAATGGG
TCGTTACAATGCAGAATTTGCATTTAAATTTG


seqgb_CY091797_Organism_Influenza_A_virus__A_chicken_Bangli_BBVD-245_2007_H5N1___Strain_Name_A_chicken_Bangli_BBVD-245_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCTGTCAAAATGGAGAAAATAGTGCTTCTTCTTGCAATAGCCAGTCTTGTTAAAGGTGATCAGATT
TGCATTGGTTACCATGCAAACAATTCAACAGAGCAGGTTGACACAATAATGGAAAAGAACGTTACTGTTA
CACATGCCCAATTAGTCCTTGCGACTATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATT
TAATAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGATTCCTAGATGTCTGGACT
TATAATGCTGAACTTCTGGTTCTCATGGAAAATGAGAGAACTTTAGACTTTCATGACTCAAATGTTAAGA
ACCTCTACGACAAAGTCCGACTACAGCTTAGGGATAATGCAAAGGAGTTGGGTAACGGTTGTTTCGAGTT
CTATCACAAATGTGATGATGAATGTATGGAAAGTGTAAGAAATGGGACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGGGTAAAATTGGAATCAATAGGAATTTACCAAATAC
TGTCAATTTATTCAACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGAT
GTGCTCCAATGGATCATTACAATGCAGAATTTGCATTTAAATTTGTGAGTTTAGATTGTAGTTAAA


seqgb_CY091801_Organism_Influenza_A_virus__A_chicken_Bangli_BBVD-562_2007_H5N1___Strain_Name_A_chicken_Bangli_BBVD-562_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCTGTCATTCGAGAGTAATGGAGGGCTCAGAAATAGCCCCCAAAGAGAGAGAAGAAGAAAAAAGAG
AGGACTATTTGGAGCTATAGCAGGTTTTATAGAGGGTGGATGGCAGGGAATGGTAGATGGTTGGTATGGG
TACCACCATAGCAATGAGCAAGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAAATG
GAGTCACCAATAAGGTCAACTCGATCATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATT
TAATAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGATTCCTAGATGTCTGGACT
TATAATGCTGAACTTCTGGTTCTCATGGAAAATGAGAGAACTTTAGACTTTCATGACTCAAATGTTAAGA
ACCTCTACGACAAGGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTT
CTATCACAAATGTGATGATGAATGTATGGAAAGTGTAAGAAATGGGACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGAGGAGATAAGTGGGGTAAAATTGGAATCAATAGGAATTTACCAAATAC
TGTCAATTTATTCAACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATTTTTATGGAT
GTGCTCCAATGGATCATTACAATGCAGAATTTGCATTTAAATTTGTGAGTTTAGATTGTAGTTAAA


seqgb_CY091803_Organism_Influenza_A_virus__A_chicken_Bangli_BBVD-575_2007_H5N1___Strain_Name_A_chicken_Bangli_BBVD-575_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCCGTCAGAGCTATAGCAGGTTTTATAGAGGGTGGATGGCAGGGAATGGTAGATGGTTGGTATGGG
TACCACCATAGCAATGAGCAAGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATG
GAGTCACCAATAAGGTCAACTCGATCATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATT
TAATAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGATTCTTAGATGTCTGGACT
TATAATGCTGAGCTTCTGGTTCTCATGGAAAATGAGAGAACTTTAGACTTTCATGACTCAAATGTTAAGA
ACCTCTACGACAAAGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTT
CTATCACAAATGTGATGATGAATGTATGGAAAGTGTAAGAAATGGGACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGGGTAAAATTGGAATCAATAGGAATTTACCAAATAC
TGTCAATTTATTCAACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGAT
GTGCTCCAATGGATCATTACAGTGCAGAATTTGCATTTAAATTTGTGAGTTTAGATTGTAGTTAAA


seqgb_GQ122399_Organism_Influenza_A_virus__A_chicken_Banten_UT6025_2006_H5N1___Strain_Name_A_chicken_Banten_UT6025_2006_Segment_4_Subtype_H5N1_Host_Chicken,
ATGGAGAAAATAGTGCTTCTTCTTGCAATAGTCAGTCTTGTTAAAAGTGATCAGATTTGCATTGGTTACC
ATGCAAACAATCAGGGCTCAGAAAGGATGGCAGGGAATGGTAGATGGTTGGTATGGGTACCATCATAGCA
ATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATGGAGTCACCAATAA
GGTCAACTCAATCATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATTTAATAACTTAGAA
AGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGATTTCTAGATGTCTGGACTTATAATGCCGAAC
TTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTTCATGACTCAAATGTTAAGAACCTCTATGACAA
GGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTTCTATCACAAATGT
GATAATGGATGTATGGAAAGTATAAGAAACGGAACGTATAACTACCCGCAGTATTCAGAAGAAGCAAGAT
TAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAACTTATCAAATACTGTCAATTTATTC
AACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGATGTGTTCCAATGGA
TCGTTACAATGCAGAATTTGCATTTAA


seqgb_CY091789_Organism_Influenza_A_virus__A_chicken_Buleleng_BBVD-545b_2007_H5N1___Strain_Name_A_chicken_Buleleng_BBVD-545b_2007_Segment_4_Subtype_H5N1_Host_Chicken,
TCAATCCGTCAAAATGGAGAAAATAGTGCTTCTTCTTGCAATAGCCAGTCTTGTTAAAGGTGATCAGATT
TGCATTGGTTACCATGAAAAGTGAATTGGAATATGGCAACTGCAACACCAAATGTCAAACTCCAATGGGG
GCGATAAACTCTAGTATGCCATTCCATGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATG
GAGTCACCAATAAGGTCAACTCGATCATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATT
TAATAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGATTCCTAGATGTCTGGACT
TATAATGCTGAACTTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTTCATGACTCAAATGTTAAGA
ACCTCTACGACAAAGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTT
CTATCACAAATGTGATGATGAATGTATGGAAAGTGTAAGAAATGGGACGTATAACTACCCGCAGTATTCA
GAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGGGTAAAATTGGAATCAATAGGAATTTACCAAATAC
TGTCAATTTATTCAACAGTGGCGAGTTCCCTAGCACTGGCAATCATGATGGCTGGTCTATCTTTATGGAT
GTGCTCCAATGGATCATTACAATGCAGAATTTGCATTTAAATTTGTGAGTTTAGATTGTAGTTAAA


seqgb_HQ200590_Organism_Influenza_A_virus__A_chicken_Cambodia_047LC3_2005_H5N1___Strain_Name_A_chicken_Cambodia_047LC3_2005_Segment_4_Subtype_H5N1_Host_Chicken,
AGCAAAAGCAGGGGTTTAATCTGTCAAAATGGAGAAAATAGTGCTTCTTTTTGCGATAGTCAGTCTTGTT
AAAAGTGATCAGATGGGACTCAACAATTATGAAAAGTGAATTGGAATATGGTAACTGCAACACCAAGTGT
CAAACTCCAATGGGGGCGATAAACTCCAATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTC
AAAAGGCTATAGATGGAGTCACCAATAAGGTCAACTCGATCATTGACAAAATGAACACTCAGTTTGAGGC
CGTTGGAAGGGAATTTAACAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGGTTC
CTAGATGTCTGGACTTATAATGCTGAACTTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTCCATG
ACTCAAATGTCAAGAACCTTTACGACAAGGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAA
CGGTTGTTTCGAGTTCTATCACAAATGTGATAATGAATGTATGGAAAGTGTGAGAAACGGAACGTATGAC
TACCCGCAGTATTCAGAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAG
GAATTTACCAAATACTGTCAATTTATTCTACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGCTGG
TCTATCCTTATGGATGTGCTCCAATGGGTCGTTACAATGCAGAATTTGCATTTAAATTTGTGAGTTCAGA
TTGTAGTTAAAAACACCCTTGTTTCTACT


seqgb_HQ200554_Organism_Influenza_A_virus__A_chicken_Cambodia_047LC3b_2005_H5N1___Strain_Name_A_chicken_Cambodia_047LC3b_2005_Segment_4_Subtype_H5N1_Host_Chicken,
AGCAAAAGCAGGGGTTTAATCTGTCAAAATGGAGAAAATAGTGCTTCTTTTTGCGATAGTCAGTCTTGTT
AAAAGTGATCAGATTTGCATTGGTTACCATGCAAACAACTCAACAGAGCAGGTTGACACAATAATGGAAA
AGAACGTTACTGTTACACATGCCCAAGACATACTGGAAAAGACACATAACGGGAAGCTCTGCGATCTAGA
TGGAGTGAAGCCTCTAATTTTGAGAGATTGTAGTGTAGCTGGATGGCTCCTCGGAAACCCAATGTGTGAC
GAATTCATCAATGTGCCGGAATGGTCGAGCTATAGCAGGTTTTATAGAGGGAGGATGGCAGGGAATGGTA
GATGGTTGGTATGGGTACCACCATAGCAATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTC
AAAAGGCTATAGATGGAGTCACCAATAAGGTCAACTCGATCATTGACAAAATGAACACTCAGTTTGAGGC
CGTTGGAAGGGAATTTAACAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGGTTC
CTAGATGTCTGGACTTATAATGCTGAACTTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTCCATG
ACTCAAATGTCAAGAACCTTTACGACAAGGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAA
CGGTTGTTTCGAGTTCTATCACAAATGTGATAATGAATGTATGGAAAGTGTGAGAAACGGAACGTATGAC
TACCCGCAGTATTCAGAAGAAGCAAGATTAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAG
GAATTTACCAAATACTGTCAATTTATTCTACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGCTGG
TCTATCCTTATGGATGTGCTCCAATGGGTCGTTACAATGCAGAATTTGCATTTAAATTTGTGAGTTCAGA
TTGTAGTTAAAAACACCCTTGTTTCTACT

seqgb_EU620652_Organism_Influenza_A_virus__A_chicken_Thailand_NS-339_2008_H5N1___Strain_Name_A_chicken_Thailand_NS-339_2008_Segment_4_Subtype_H5N1_Host_Chicken,
AGCAAAAGCAGGGGTCTGATCTGTCAAAATGGAGAAAATAGTGCTTCTTTTTGCAATAGTCAGTCTTGTT
AAAAGTGATCAAATTTGCATTGGTATAAGGTCAACTCGATAATTGACAAAATGAACACTCAGTTTGAGGC
CGTTGGAAGGGAATTTAACAACTTAGAAAGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGGTTC
CTGGATGTCTGGACTTATAATGCTGAACTTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTTCATG
ACTCAAATGTCAAGAACCTTTACGACAAGGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAA
CGGCTGTTTCGAGTTCTATCATAAATGTGATAATGAATGTATGGAAAGTGTGAGAAACGGAACGTATGAC
TACCCGCAGTATTCAGAAGAAGCAAAACTAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAG
GAATTTACCAAATACTGTCAATTTATTCTACAGTGGCAAGTTCCCTAGCACTGGCAATCATGGTAGCTGG
TCTATCCTTATGGATGTGCTCCAATGGGTCATTACAATGCAGAATTTGCATTAAATTGGAGTCA


seqgb_EU850416_Organism_Influenza_A_virus__A_chicken_Thailand_NS-341_2008_H5N1___Strain_Name_A_chicken_Thailand_NS-341_2008_Segment_4_Subtype_H5N1_Host_Chicken,
ATGGAGAAAATAGTGCTTCTTTTTGCAATAGTCAGTCTTGTTAAAAGTGATCAGATTTGCATTGGTTACC
ATGCAAACAACTCGACAGAGCAGGTTCTCACCATCGGGGAATGCCCCAAATATGTGAAATCAAATAGATT
AGTCCTTGCGACTGGGCTCAGAAATAGCCCTCAAAGAGAGAGAAGAAGAAAAAAGAGAGGATTATTTGGA
GCTATAGCAGGTTTTATAGAGGGAGGATGGCAGGGAATGGTAGATGGTTGGTATGGGTACCACCATAGCA
ATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAAGGCAATAGATGGAGTCACCAATAA
GGTCAACTCGATAATTGACAAAATGAACACTCAGTTTGAGGCCGTTGGAAGGGAATTTAACMACTTAGAA
AGGAGGATAGAGAATTTAAACAAGAAGATGGAAGACGGGTTCCTAGATGTCTGGACTTATAATGCTGAAC
TTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTTCATGACTCAAATGTCAAGAACCTTTACGACAA
GGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGCTGTTTCGAGTTCTATCATAAATGT
GATAATGAATGTATGGAAAGTGTGAGAAACGGAACGTATGACTACCCGCAATATTCAGAAGAAGCAAAAC
TAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAATTTACCAAATACTGTCAATTTATTC
TACAGTGGCAAGTTCCCTAGCACTGGCAATCATGGTAGCTGGTCTATCCTTATGGATGTGCTCCAATGGG
TCATTACAATGCAGAATTTGCATTTAAATTG


seqgb_DQ999880_Organism_Influenza_A_virus__A_chicken_Thailand_PC-168_2006_H5N1___Strain_Name_A_chicken_Thailand_PC-168_2006_Segment_4_Subtype_H5N1_Host_Chicken,
ATGGAGAGAATAGTGCAGGGATAATGCAAAGGAGCTGGGTAACGGTTGTTTCGAGTTCTATCATAAGTGT
GATAATGAATGTATGGAAAGTGTGAGAAACGGAACGTATGACTACCCGCAGTATTCAGAAGAAGCAAAAC
TAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAATTTACCAAATACTGTCAATTTATTC
TACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGCTGGTCTATCCTTATGGATGTGCTCCAATGGG
TCGTTACAATGCAGAATTTGCATTAAATTG

Я написал этот код:

import pandas as pd
import numpy as np
data = pd.read_csv('seq.txt',  sep=',',delim_whitespace = True, names=["id", "seq"], skip_blank_lines = True, index_col=False) # , dtype='unicode' 
dataframe = pd.DataFrame(data)
print(dataframe)

И вывод:

                                                    id  seq
0    seqgb_AY741213_Organism_Influenza_A_virus__A_b...  NaN
1    ATGGAGAAAATAGTGCTTCTTCTTGCAATAGTCAGTCTTGTTAAAA...  NaN
2    ATGCAAACAACTCGACAGAGCAGGTTGACACAATAATGGAAAAGAA...  NaN
3    CGTACTGGACAAGACACACAACGGGAAGCTCTGCGAGCTAGATGGA...  NaN
4    TGTAGTGTAGCTGGATGGCTCCTCGGAAACCCAATGTGTGACGAAT...  NaN
5    ACATAGTAGAGAAGGCCAGTCCAGCCAATGACCTCTGTTACCCAGG...  NaN
6    GAAACACCTATTGAGCAGAATAAACCATTTTGAGAAAATTCAGATC...  NaN
7    CATGAAGCCTCATCAGGGGTGAGCTCAGCATGTCCATACCAGGGGA...  NaN
8    TATGGCTTATCAAAAAGAACAGTGCATACCCAACAATAAAGAGGAG...  NaN
9    TCTTTTGGTACTGTGGGGGATTCACCATCCTAATGATGCGGCAGAG...  NaN
10   ACCACCTATATTTCCGTTGGAACATCAACACTAAACCAGAGATTGG...  NaN
11   AAGTAAATGGGCAAAGTGGAAGAATGGAGTTCTTCTGGACAATTTT...  NaN
12   CGAGAGTAATGGAAATTTCATTGCTCCAGAATATGCATACAAAATT...  NaN
13   ATGAAAAGTGAATTGGAATATGGTAACTGCAACACCAAGTGTCAAA...  NaN
14   GTATGCCATTCCACAACATACACCCTCTCACCATCGGGGAATGCCC...  NaN
15   AGTCCTTGCGACAGGGCTCAGAAATAGCCCTCAAAGAGAGAGAAGA...  NaN
16   GCTATAGCAGGGTTTATAGAGGGAGGATGGCAGGGAATGGTAGATG...  NaN
17   ATGAGCAGGGGAGTGGATACGCTGCAGACAAAGAATCCACTCAAAA...  NaN
18   GGTCAACTCGATCATTGACAAAATGAACACTCAGTTTGAGGCCGTT...  NaN
19   AGGAGAATAGAAAATTTAAACAAGAAGATGGAGGACGGATTCCTAG...  NaN
20   TTCTGGTTCTCATGGAAAATGAGAGAACTCTAGACTTTCATGACTC...  NaN
21   GGTCCGACTACAACTTAGGGATAATGCAAAGGAGCTGGGTAACGGT...  NaN
22   GATAATGAATGTATGGAAAGTGTAAGAAACGGAACGTATGACTACC...  NaN
23   TAAACAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAAC...  NaN
24   AACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGCTGGTCTA...  NaN
25                         TCGTTACAATGCAGAATTTGCATTTGA  NaN
26   seqgb_EU676325_Organism_Influenza_A_virus__A_b...  NaN
27   TTTAGCAAAAGGCAGGGGTATATCTGTCAAAATGGAGAAAATAGTG...  NaN
28   GTTAAAAGTGATCAGATTTGCATTGGTTACCATGCAAACAACTCGA...  NaN
29   AAAAGAACGTTACTGTTACACATGCCCAAGACATACTGGAAAAGAC...  NaN
..                                                 ...  ...
598  GATAATGAATGTATGGAAAGTGTGAGAAACGGAACGTATGACTACC...  NaN
599  TAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAAT...  NaN
600  TACAGTGGCAAGTTCCCTAGCACTGGCAATCATGGTAGCTGGTCTA...  NaN
601                    TCATTACAATGCAGAATTTGCATTTAAATTG  NaN
602  seqgb_DQ999880_Organism_Influenza_A_virus__A_c...  NaN
603  ATGGAGAGAATAGTGCTTCTTTTTGCAATAGTCAGTCTTGTTAAAA...  NaN
604  ATGCAAACAACTCGACAGAGCAGGTTGACACAATAATGGAAAGGAA...  NaN
605  CATACTGGAAAAGACACACAACGGGAAGCTCTGCGATCTAGATGGA...  NaN
606  TGTAGTGTAGCTGGATGGCTCCTCGGAAACCCAATGTGTGACGAAT...  NaN
607  ACATAGTGGAGAAGGCCAATCCAGTCAATGACCTCTGTTACCCAGG...  NaN
608  GAAACACCTATTGAGCAGAATAAACCATTTTGAGAAAATTCAGATC...  NaN
609  CATGAAGCCTCATTAGGGGTGAGCTCAGCATGTCCATACCTGGGAA...  NaN
610  TATGGCTTATCAAAAAGAACAGTACATACCCAACAATAAAGAGGAG...  NaN
611  TCTTTTGGTACTGTGGGGGATTCACCATCCTAATGATGCGGCAGAG...  NaN
612  ACCACCTATATTTCTGTTGGGACATCAACACTAAACCAGAGATTGG...  NaN
613  AAGTAAACGGGCAAAGTGGAAGGATGGAGTTCTTCTGGACAATTTT...  NaN
614  CGAGAGTAATGGAAATTTCATTGCTCCAGAATATGCATACAAAATT...  NaN
615  ATGAAAAGTGAATTGGAATATGGTAACTGCAACACCAAGTGTCAAA...  NaN
616  GTATGCCATTCCACAATATACACCCTCTCACTATCGGGGAATGCCC...  NaN
617  AGTCCTTGCGACTGGGCTCAGAAATAGCCCTCAAAGAGAGAGAAGA...  NaN
618  GCTATAGCAGGTTTTATAGAGGGGGGATGGCAGGGAATGGTAGATG...  NaN
619  ATGAGCAGGGGAGTGGGTACGCTGCAGACAAAGAATCCACTCAAAA...  NaN
620  GGTCAACTCGATAATTGACAAAATGAACACTCAGTTTGAGGCCGTT...  NaN
621  AGGAGAATAGAGAATTTAAACAAGAAGATGGAAGACGGGTTCCTAG...  NaN
622  TTCTGGTTCTCATGGAAAATGAGAGAACCCTAGACTTTCATGACTC...  NaN
623  GGTCCGACTACAGCTTAGGGATAATGCAAAGGAGCTGGGTAACGGT...  NaN
624  GATAATGAATGTATGGAAAGTGTGAGAAACGGAACGTATGACTACC...  NaN
625  TAAAAAGAGAGGAAATAAGTGGAGTAAAATTGGAATCAATAGGAAT...  NaN
626  TACAGTGGCGAGTTCCCTAGCACTGGCAATCATGGTAGCTGGTCTA...  NaN
627                     TCGTTACAATGCAGAATTTGCATTAAATTG  NaN

[628 rows x 2 columns]

Как я могу раздеть новую строку, присутствующую между одной последовательностью, используя панд. Заранее спасибо !!

Ответы [ 3 ]

0 голосов
/ 04 января 2019

Вы можете вручную прочитать файл и преобразовать его в DataFrame pandas с чем-то вроде:

import pandas as pd

with open('seg.txt', 'r') as fp:
    lines = fp.readlines()

data = {'id': [], 'seq': []}
sequence = ''

for line in lines:
    if line[0] == '\n':
        if len(sequence) != 0:
            data['seq'].append(sequence)
            sequence = ''
        # skip empty lines
        continue
    if ',' in line:
        data['id'].append(line.split(',')[0])
    else:
        # concatenate lines with sequences
        sequence += line.strip()

# add on last sequence
if len(sequence) != 0:
    data['seq'].append(sequence)

# create dataframe
df = pd.DataFrame(data)
0 голосов
/ 04 января 2019

Вы можете использовать .read () , чтобы сначала манипулировать текстовым файлом, а затем преобразовать список в фрейм данных

with open("seq.txt") as f:
arr = f.read()
arr = [i.split(",\n") for i in arr.split("\n\n\n")]

df = pd.DataFrame(arr, columns=["id", "seq", "ss"]).drop(columns=["ss"])
df.head()

enter image description here

Был 3-й случайный столбец None, который не исчез бы, поэтому я уронил его.

0 голосов
/ 04 января 2019

Почти по определению разрывы строк являются важной частью файлов CSV, поэтому у Pandas '1001 * нет способа игнорировать их. Лучше всего вручную удалять разрывы строк, например:

import pandas as pd
import re

with open ("seq.txt", "r") as myfile:
    data=myfile.readlines()

data = re.sub('\n', '', ''.join(data))
data = data.split(',')
df = pd.DataFrame([data], names=["id", "seq"])
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...