Как разобрать относительно организованный, но не разделенный текст, используя Python? - PullRequest
0 голосов
/ 11 июля 2019

Я пытаюсь извлечь данные из текстового файла, который отформатирован, как показано на рисунке.Он включает в себя список операций и то, что мне нужно для каждого случая: имя пациента, время начала (время1), время окончания (время2), тип процедуры и имя хирурга.Вот необработанный текст.очевидно, имена пациентов и хирургов заменяются настоящими именами:

Run on: 10/07/19 - 1444                                                       Hospital                                                        PAGE 1

Run by: H                                                          Final Slate For: 11/07/19 THU                                                   

PIR        Patient Name                     R/L/B   Proposed Procedure                                          Surgeon                            Path Reg'd      Dur
POR Time   Unit Number   PHN                                                                                    Assist                             Bld Req'd     PIR-POR
Pri        DOB           Age/S                                                                                                                     Med Imaging
Loc        Bed Type                                                                                                                                Req'd Staff
Ward


OR Room - 1                                           Room End Time: 1730          Anaesthetist: S,A T                                            
OHS 0900-2000                                               
0745       patient 1                             Replace Root and Ascending                                              surgeon1   GENERAL                
1305       RC02654289   96985693                        Aorta/Hemiarch (Tissue), Amputate Left                                                   4 UNITS                
3A         21/12/1943     75/M                            Atrial Appendage                                                                         Perfusionist           
SDA        ICU                                                                                                                                                            
RC-T2S    
 Weeks on Waitlist:  5   (36 days)                                                                                                                                  320
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1400       patient2                           Coronary Artery Bypass Graft                                            surgeon2   GENERAL                
1730       RC00968458   906854959                                                                                                                 SCREEN                 
2B         18/06/1958     61/M                                                                                                                     Perfusionist           
INPT       ICU                                                                                                                                                            
RC-T2S    
 Weeks on Waitlist:  2   (17 days)                                                                                                                                  210
                                                  Other Comments:   DM Type 2                                                                      

Run on: 10/07/19 - 1444                                                      Hospital                                                        PAGE 2

Run by: H                                                         Final Slate For: 11/07/19 THU                                                   

PIR        Patient Name                     R/L/B   Proposed Procedure                                          Surgeon                            Path Reg'd      Dur
POR Time   Unit Number   PHN                                                                                    Assist                             Bld Req'd     PIR-POR
Pri        DOB           Age/S                                                                                                                     Med Imaging
Loc        Bed Type                                                                                                                                Req'd Staff
Ward


OR Room - 2                                           Room End Time: 1825          Anaesthetist: K,N S                                             
OHS 0900-1930                                               
0745       Patient3                          Aortic Valve Replacement (Mechanical)                                   Surgeon3   GENERAL                
1205       RC00584564   9095681571                                                                                                                 4 UNITS                
3A         13/04/1955     64/F                                                                                                                     Perfusionist           
SDA        ICU                                                                                                                                                            
RC-T2S    
 Weeks on Waitlist: 14   (98 days)                                                                                                                                  260
                                                  Other Comments:   DM Type 2                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

raw data

Мне нужно, чтобы вывод был примерно таким:

patinet1 | time1 | time2 | procedure1 | surgeon1
patinet2 | time1 | time2 | procedure2 | surgeon2
.
.
.

1 Ответ

0 голосов
/ 11 июля 2019

Я просмотрел код и исправил его,

Это должно сработать,

import re

#read input file content
with open('input.txt') as inputFile:
    inputText = inputFile.read()

regx = r'^(\d{4})\s{2,}(\D+?)(?=\s{2,})\s{2,}(\D+?)(?=\s{2,})\s{2,}(\D+?)(?=\s{2,})|(^\d{4})'

parsedText = re.findall(regx,inputText,flags=re.M)

rows = []

#organizing data to write to file
for line in parsedText:
    if len(line[0]):
        rows.append(list(line))
    else :
        rows[-1][-1] = line[-1]

#writing to file 
with open('output.txt','w') as csvfile:
    for row in rows:
        csvfile.write("{} | {} | {} | {} | {}\n".format(row[1],row[0],row[4],row[2],row[3]))

Вы можете найти регулярное выражение, которое я использовал здесь для объяснения, https://regex101.com/r/mHWcTD/1

1st Alternative ^(\d{4})\s{2,}(\D+?)(?=\s{2,})\s{2,}(\D+?)(?=\s{2,})\s{2,}(\D+?)(?=\s{2,})
    ^ asserts position at start of a line
    1st Capturing Group (\d{4}) # Captures the start time
        \d{4} matches a digit (equal to [0-9])
        {4} Quantifier — Matches exactly 4 times
    \s{2,} matches any whitespace character (equal to [\r\n\t\f\v ])
        {2,} Quantifier — Matches between 2 and unlimited times, as many times as possible, giving back as needed (greedy)
    2nd Capturing Group (\D+?) # captures patient name
        \D+? matches any character that\'s not a digit (equal to [^0-9])
            +? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
    Positive Lookahead (?=\s{2,})
        Assert that the Regex below matches
            \s{2,} matches any whitespace character (equal to [\r\n\t\f\v ])
                {2,} Quantifier — Matches between 2 and unlimited times, as many times as possible, giving back as needed (greedy)
            \s{2,} matches any whitespace character (equal to [\r\n\t\f\v ])
                {2,} Quantifier — Matches between 2 and unlimited times, as many times as possible, giving back as needed (greedy)
    3rd Capturing Group (\D+?) # captures operation details
        \D+? matches any character that\'s not a digit (equal to [^0-9])
            +? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
    Positive Lookahead (?=\s{2,})
        Assert that the Regex below matches
            \s{2,} matches any whitespace character (equal to [\r\n\t\f\v ])
    \s{2,} matches any whitespace character (equal to [\r\n\t\f\v ])
        {2,} Quantifier — Matches between 2 and unlimited times, as many times as possible, giving back as needed (greedy)
    4th Capturing Group (\D+?) # captures surgeons name
    Positive Lookahead (?=\s{2,})
        Assert that the Regex below matches
2nd Alternative (^\d{4})
    5th Capturing Group (^\d{4}) # captures end time
        ^ asserts position at start of a line
        \d{4} matches a digit (equal to [0-9])
        {4} Quantifier — Matches exactly 4 times

Пример ввода:

Run on: 10/07/19 - 1444                                                       Hospital                                                        PAGE 1

Run by: H                                                          Final Slate For: 11/07/19 THU                                                   

PIR        Patient Name                     R/L/B   Proposed Procedure                                          Surgeon                            Path Reg'd      Dur
POR Time   Unit Number   PHN                                                                                    Assist                             Bld Req'd     PIR-POR
Pri        DOB           Age/S                                                                                                                     Med Imaging
Loc        Bed Type                                                                                                                                Req'd Staff
Ward


OR Room - 1                                           Room End Time: 1730          Anaesthetist: S,A T                                            
OHS 0900-2000                                               
0745       Morgan Freeman                             Replace Root and Ascending                                              Dr. Henry Cavail   GENERAL                
1305       RC02654289   96985693                        Aorta/Hemiarch (Tissue), Amputate Left                                                   4 UNITS                
3A         21/12/1943     75/M                            Atrial Appendage                                                                         Perfusionist           
SDA        ICU                                                                                                                                                            
RC-T2S    
 Weeks on Waitlist:  5   (36 days)                                                                                                                                  320
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1400       Alicia Cuthbart                           Coronary Artery Bypass Graft                                            Dr. Denzel Washington   GENERAL                
1730       RC00968458   906854959                                                                                                                 SCREEN                 
2B         18/06/1958     61/M                                                                                                                     Perfusionist           
INPT       ICU                                                                                                                                                            
RC-T2S    
 Weeks on Waitlist:  2   (17 days)                                                                                                                                  210
                                                  Other Comments:   DM Type 2                                                                      

Run on: 10/07/19 - 1444                                                      Hospital                                                        PAGE 2

Run by: H                                                         Final Slate For: 11/07/19 THU                                                   

PIR        Patient Name                     R/L/B   Proposed Procedure                                          Surgeon                            Path Reg'd      Dur
POR Time   Unit Number   PHN                                                                                    Assist                             Bld Req'd     PIR-POR
Pri        DOB           Age/S                                                                                                                     Med Imaging
Loc        Bed Type                                                                                                                                Req'd Staff
Ward


OR Room - 2                                           Room End Time: 1825          Anaesthetist: K,N S                                             
OHS 0900-1930                                               
0745       John van-Damn                          Aortic Valve Replacement (Mechanical)                                   Dr. Bon Jovi   GENERAL                
1205       RC00584564   9095681571                                                                                                                 4 UNITS                
3A         13/04/1955     64/F                                                                                                                     Perfusionist           
SDA        ICU                                                                                                                                                            
RC-T2S    
 Weeks on Waitlist: 14   (98 days)                                                                                                                                  260
                                                  Other Comments:   DM Type 2                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Пример вывода:

Morgan Freeman | 0745 | 1305 | Replace Root and Ascending | Dr. Henry Cavail
Alicia Cuthbart | 1400 | 1730 | Coronary Artery Bypass Graft | Dr. Denzel Washington
John van-Damn | 0745 | 1205 | Aortic Valve Replacement (Mechanical) | Dr. Bon Jovi

...