Question

Мне нужно написать функцию get_specified_words(filename), чтобы получить список строчных слов из текстового файла.Должны применяться все следующие условия:

Включать все последовательности символов в нижнем регистре, включая те, которые содержат символ - или ' и те, которые заканчиваются символом '.
Исключить слова, которые заканчиваются на -.
Функция должна обрабатывать только строки между начальной и конечной линиями маркера
Используйте это регулярное выражение для извлеченияслова из каждой соответствующей строки файла: valid_line_words = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line)
Перед использованием регулярного выражения убедитесь, что строка строки в нижнем регистре.
Используйте необязательный параметр кодировки при открытии файлов для чтения.То есть ваш открытый вызов файла должен выглядеть как открытый (filename, encoding = 'utf-8').Это будет особенно полезно, если ваша операционная система не устанавливает кодировку Python по умолчанию в UTF-8.

Пример текстового файла testing.txt содержит следующее:

That are after the start and should be dumped.
So should that

and that
and yes, that
*** START OF SYNTHETIC TEST CASE ***
Toby's code was rather "interesting", it had the following issues: short,
meaningless identifiers such as n1 and n; deep, complicated nesting;   
a doc-string drought; very long, rambling and unfocused functions; not 
enough spacing between functions; inconsistent spacing before and 
after operators, just like   this      here. Boy was he going to get a low
style mark.... Let's hope he asks his friend Bob to help him bring his code
up to an acceptable level.
*** END OF SYNTHETIC TEST CASE ***
This is after the end and should be ignored too.

Have a nice day.

Вот мой код:

import re

def stripped_lines(lines):
    for line in lines:
        stripped_line = line.rstrip('\n')
        yield stripped_line

def lines_from_file(fname):
    with open(fname, 'rt') as flines:
        for line in stripped_lines(flines):
            yield line

def is_marker_line(line, start='***', end='***'):
    min_len = len(start) + len(end)
    if len(line) < min_len:
        return False
    return line.startswith(start) and line.endswith(end)


def advance_past_next_marker(lines):
    for line in lines:
        if is_marker_line(line):
            break


def lines_before_next_marker(lines):
    valid_lines = []
    for line in lines:
        if is_marker_line(line):
            break
         valid_lines.append(re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line))
    for content_line in valid_lines:
        yield content_line


def lines_between_markers(lines):
    it = iter(lines)
    advance_past_next_marker(it)
    for line in lines_before_next_marker(it):
        yield line


def words(lines):
    text = '\n'.join(lines).lower().split()
    return text

def get_valid_words(fname):
    return words(lines_between_markers(lines_from_file(fname)))

# This must be executed
filename = "valid.txt"
all_words = get_valid_words(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(all_words)))
print("word list:")
print("\n".join(all_words))

Вот мой вывод:

 File "C:/Users/jj.py", line 45, in <module>
text = '\n'.join(lines).lower().split()
builtins.TypeError: sequence item 0: expected str instance, list found

Вот ожидаемый результат:

valid.txt loaded ok.
73 valid words found.
word list:
toby's
code
was
rather
interesting
it
had
the
following
issues
short
meaningless
identifiers
such
as
n
and
n
deep
complicated
nesting
a
doc-string
drought
very
long
rambling
and
unfocused
functions
not
enough
spacing
between
functions
inconsistent
spacing
before
and
after
operators
just
like
this
here
boy
was
he
going
to
get
a
low
style
mark
let's
hope
he
asks
his
friend
bob
to
help
him
bring
his
code
up
to
an
acceptable
level

Мне нужна помощь для получения моего кодаРабота.Любая помощь приветствуется.

Corentin Limier · Answer 1 · 20 октября 2018

lines_between_markers(lines_from_file(fname))

дает вам список допустимых слов.

Так что вам просто нужно сгладить его:

def words(lines):
    words_list = [w for line in lines for w in line]
    return words_list

Делает трюк.

НоЯ думаю, что вы должны пересмотреть дизайн вашей программы:

lines_between_markers должен давать только строки между маркерами, но это делает больше.Регулярное выражение должно использоваться для результата этой функции, а не внутри функции.

Чего вы не сделали:

Убедитесь, что строка строки в нижнем регистре, прежде чем использовать обычнуюexpression.

Используйте необязательный параметр encoding при открытии файлов для чтения.То есть ваш открытый вызов файла должен выглядеть как открытый (имя файла, кодировка = 'utf-8').

Регулярное выражение для поиска допустимых слов в файле

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Регулярное выражение для поиска допустимых слов в файле

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов