Question

У меня есть скрипт Python, который используется для разбора электронных писем из больших документов. Этот скрипт использует всю мою оперативную память на моей машине и блокирует ее там, где мне нужно ее перезапустить. Мне было интересно, есть ли способ, которым я могу ограничить это или, может быть, даже сделать паузу после того, как он закончил, читая один файл и предоставляя некоторый вывод. Любая помощь будет здорово, спасибо.

#!/usr/bin/env python

# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
# - Does not save to file (pipe the output to a file if you want it saved).
# Twitter @Critical24 - DefensiveThinking.io 


from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
                    "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
    """Returns the contents of filename as a string."""
    with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
    return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
    """Returns an iterator of matched emails found in string s."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
    return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

import os
not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk('.'):#This recursively searches all sub directories for files
for file in files:
    _,file_ext = os.path.splitext(file)#Here we get the extension of the file
    file_path = os.path.join(root,file)
    if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
       print("File %s is not parseble"%file_path)
       continue #This one continues the loop to the next file
    if os.path.isfile(file_path):
        for email in get_emails(file_to_str(file_path)):
            print(email)

tobias_k · Answer 1 · 14 сентября 2018

Похоже, вы читаете файлы объемом до 8 ГБ в память, используя f.read().Вместо этого вы можете попробовать применить регулярное выражение к каждой строке файла, не сохраняя в памяти весь файл.

with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
    return (email[0] for line in f
                     for email in re.findall(regex, line.lower())
                     if not email[0].startswith('//'))

Однако это может занять очень много времени.Кроме того, я не проверял ваше регулярное выражение на возможные проблемы.

Umer · Answer 2 · 14 сентября 2018

Я думаю, вы должны попробовать этот ресурс модуль:

import resource
resource.setrlimit(resource.RLIMIT_AS, (megs * 1048576L, -1L))

Скрипт Python использует всю оперативную память

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Скрипт Python использует всю оперативную память

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы