Question

Мне нужно прочитать несколько больших файлов (от 50k до 100k строк), структурированных в группы, разделенные пустыми строками. Каждая группа начинается с одного и того же шаблона «№999999999 дд / мм / гггг ZZZ». Вот некоторые примеры данных.

№ 813829461 от 16.09.1987 г. 270
Tit.SUZANO PAPEL E CELULOSE S.A. (BR / BA)
C.N.P.J./C.I.C./N INPI: 16404287000155
Прокурадор: Марселло До Насцименто

№ 815326777 от 28.12.1989 351
Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA (BR / RJ)
C.N.P.J./C.I.C./NºINPI: 34162651000108
Апр .: Номинатива; Nat .: De Produto
Марка: ТРИО ТРОПИЧЕСКИЙ
Clas.Prod/Serv: 09.40
* DEFERIDO CONFORME RESOLUÇÃO 123 DE 06/01/2006, PUBLICADA NA RPI 1829, DE 24.01.01.
Прокурадор: WALDEMAR RODRIGUES PEDRA

№ 900148764 от 01.01.2007 LD3
Tit.TIARA BOLSAS E CALÇADOS LTDA
Прокурадор: Марсия Феррейра Гомес
* Escritório: Marcas Marcantes e Patentes Ltda
* Exigência Formal não responseida Satisfatoriamente, Pedido de Registro de Marca, считая несуществующим, de acordo com Art. 157 да LPI
* Protocol of Petição de cumprimento de Exigência Официально: 810080140197

Я написал некоторый код, который анализирует его соответственно. Есть что-нибудь, что я могу улучшить, чтобы улучшить читаемость или производительность? Вот что я захожу так далеко:

import re, pprint

class Despacho(object):
    """
    Class to parse each line, applying the regexp and storing the results
    for future use
    """
    regexp = {
        re.compile(r'No.([\d]{9})  ([\d]{2}/[\d]{2}/[\d]{4})  (.*)'): lambda self: self._processo,
        re.compile(r'Tit.(.*)'): lambda self: self._titular,
        re.compile(r'Procurador: (.*)'): lambda self: self._procurador,
        re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'): lambda self: self._documento,
        re.compile(r'Apres.: (.*) ; Nat.: (.*)'): lambda self: self._apresentacao,
        re.compile(r'Marca: (.*)'): lambda self: self._marca,
        re.compile(r'Clas.Prod/Serv: (.*)'): lambda self: self._classe,
        re.compile(r'\*(.*)'): lambda self: self._complemento,
    }

    def __init__(self):
        """
        'complemento' is the only field that can be multiple in a single registry
        """
        self.complemento = []

    def _processo(self, matches):
        self.processo, self.data, self.despacho = matches.groups()

    def _titular(self, matches):
        self.titular = matches.group(1)

    def _procurador(self, matches):
        self.procurador = matches.group(1)

    def _documento(self, matches):
        self.documento = matches.group(1)

    def _apresentacao(self, matches):
        self.apresentacao, self.natureza = matches.groups()

    def _marca(self, matches):
        self.marca = matches.group(1)

    def _classe(self, matches):
        self.classe = matches.group(1)

    def _complemento(self, matches):
        self.complemento.append(matches.group(1))

    def read(self, line):
        for pattern in Despacho.regexp:
            m = pattern.match(line)
            if m:
                Despacho.regexp[pattern](self)(m)


def process(rpi):
    """
    read data and process each group
    """
    rpi = (line for line in rpi)
    group = False

    for line in rpi:
        if line.startswith('No.'):
            group = True
            d = Despacho()        

        if not line.strip() and group: # empty line - end of block
            yield d
            group = False

        d.read(line)


arquivo = open('rm1972.txt') # file to process
for desp in process(arquivo):
    pprint.pprint(desp.__dict__)
    print('--------------')

nosklo · Answer 1 · 27 января 2009

Это довольно хорошо. Ниже приведены некоторые предложения, дайте мне знать, если вам нравится их:

import re
import pprint
import sys

class Despacho(object):
    """
    Class to parse each line, applying the regexp and storing the results
    for future use
    """
    #used a dict with the keys instead of functions.
    regexp = {
        ('processo', 
         'data', 
         'despacho'): re.compile(r'No.([\d]{9})  ([\d]{2}/[\d]{2}/[\d]{4})  (.*)'),
        ('titular',): re.compile(r'Tit.(.*)'),
        ('procurador',): re.compile(r'Procurador: (.*)'),
        ('documento',): re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'),
        ('apresentacao',
         'natureza'): re.compile(r'Apres.: (.*) ; Nat.: (.*)'),
        ('marca',): re.compile(r'Marca: (.*)'),
        ('classe',): re.compile(r'Clas.Prod/Serv: (.*)'),
        ('complemento',): re.compile(r'\*(.*)'),
    }

    def __init__(self):
        """
        'complemento' is the only field that can be multiple in a single registry
        """
        self.complemento = []


    def read(self, line):
        for attrs, pattern in Despacho.regexp.iteritems():
            m = pattern.match(line)
            if m:
                for groupn, attr in enumerate(attrs):
                    # special case complemento:
                    if attr == 'complemento':
                        self.complemento.append(m.group(groupn + 1))
                    else:
                        # set the attribute on the object
                        setattr(self, attr, m.group(groupn + 1))

    def __repr__(self):
        # defines object printed representation
        d = {}
        for attrs in self.regexp:
            for attr in attrs:
                d[attr] = getattr(self, attr, None)
        return pprint.pformat(d)

def process(rpi):
    """
    read data and process each group
    """
    #Useless line, since you're doing a for anyway
    #rpi = (line for line in rpi)
    group = False

    for line in rpi:
        if line.startswith('No.'):
            group = True
            d = Despacho()        

        if not line.strip() and group: # empty line - end of block
            yield d
            group = False

        d.read(line)

def main():
    arquivo = open('rm1972.txt') # file to process
    for desp in process(arquivo):
        print desp # can print directly here.
        print('-' * 20)
    return 0

if __name__ == '__main__':
    main()

Dave Swersky · Answer 2 · 27 января 2009

Было бы легче помочь, если бы у вас была особая проблема. Производительность будет в значительной степени зависеть от эффективности конкретного движка регулярных выражений, который вы используете. 100 тыс. Строк в одном файле не кажутся такими уж большими, но, опять же, все зависит от вашей среды.

Я использую Expresso в своей разработке .NET для проверки выражений на точность и производительность. Поиск Google обнаружил Kodos , инструмент разработки регулярных выражений Python с графическим интерфейсом.

akaihola · Answer 3 · 28 января 2009

Другая версия с одним объединенным регулярным выражением:

#!/usr/bin/python

import re
import pprint
import sys

class Despacho(object):
    """
    Class to parse each line, applying the regexp and storing the results
    for future use
    """
    #used a dict with the keys instead of functions.
    regexp = re.compile(
        r'No.(?P<processo>[\d]{9})  (?P<data>[\d]{2}/[\d]{2}/[\d]{4})  (?P<despacho>.*)'
        r'|Tit.(?P<titular>.*)'
        r'|Procurador: (?P<procurador>.*)'
        r'|C.N.P.J./C.I.C./N INPI :(?P<documento>.*)'
        r'|Apres.: (?P<apresentacao>.*) ; Nat.: (?P<natureza>.*)'
        r'|Marca: (?P<marca>.*)'
        r'|Clas.Prod/Serv: (?P<classe>.*)'
        r'|\*(?P<complemento>.*)')

    simplefields = ('processo', 'data', 'despacho', 'titular', 'procurador',
                    'documento', 'apresentacao', 'natureza', 'marca', 'classe')

    def __init__(self):
        """
        'complemento' is the only field that can be multiple in a single
        registry
        """
        self.__dict__ = dict.fromkeys(self.simplefields)
        self.complemento = []

    def parse(self, line):
        m = self.regexp.match(line)
        if m:
            gd = dict((k, v) for k, v in m.groupdict().items() if v)
            if 'complemento' in gd:
                self.complemento.append(gd['complemento'])
            else:
                self.__dict__.update(gd)

    def __repr__(self):
        # defines object printed representation
        return pprint.pformat(self.__dict__)

def process(rpi):
    """
    read data and process each group
    """
    d = None

    for line in rpi:
        if line.startswith('No.'):
            if d:
                yield d
            d = Despacho()
        d.parse(line)
    yield d

def main():
    arquivo = file('rm1972.txt') # file to process
    for desp in process(arquivo):
        print desp # can print directly here.
        print '-' * 20

if __name__ == '__main__':
    main()

Sridhar Iyer · Answer 4 · 27 января 2009

Я бы не использовал здесь регулярные выражения. Если вы знаете, что ваши строки будут начинаться с фиксированных строк, почему бы не проверить эти строки и не написать вокруг них логику?

for line in open(file):
    if line[0:3]=='No.':
        currIndex='No'
        map['No']=line[4:]
   ....
   ...
   else if line.strip()=='':
       //store the record in the map and clear the map
   else:
      //append line to the last index in map.. this is when the record overflows to the next line.
      Map[currIndex]=Map[currIndex]+"\n"+line

Считайте приведенный выше код просто псевдокодом.

Kiv · Answer 5 · 27 января 2009

В целом выглядит хорошо, но почему у вас есть строка:

rpi = (line for line in rpi)

Вы уже можете перебирать файловый объект без этого промежуточного шага.

Извлечение информации из больших структурированных текстовых файлов

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 5 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Извлечение информации из больших структурированных текстовых файлов

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 5 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы