Как разобрать подстроку с отсутствующей подстрокой из текста с помощью регулярных выражений - PullRequest
0 голосов
/ 28 января 2020

Я хочу проанализировать подстроки из строки, которая соответствует определенному формату

Формат ответов на опрос имеет вид

<survey_name>_<category_name>_<question_type>.<response_type>

ex: входные строки

y_survey_category1_1st.no
x_survey_category2_2nd
survey_z_category_3_3rd.yes_more_7
x_survey_category_4_4th.excluded
survey_z_category5.yes_more_7
survey_z_category_6.yes_more_7

Вот что у меня есть. Это работает для большинства случаев, кроме случая, когда тип вопроса является необязательным (например, 5 и 6 входных данных выше).

Здесь приведены ограничения для каждой части

 1. survey_name can only be one of the 3 values
 2. category_name will always be present and can have underscores
 3. question_type may be present and may have underscore in it
 4. response_type may be present and may have underscore in it
 5. Either question_type or response_type or both will always be present

(x_survey|y_survey|survey_z)_([\w_]+)_(1st|2nd|3rd|4th)[.]?(.*)

https://regex101.com/r/bGc0gM/1

Любая помощь в том, как изменить регулярное выражение чтобы он работал на все случаи?

Ответы [ 2 ]

1 голос
/ 28 января 2020

Трудно найти регулярное выражение, использующее question_type, даже если это необязательно, я принудительно назначил букву категории / подчеркивание и конец на di git

category : [a-z_]+\d
all : (x_survey|y_survey|survey_z)_([a-z_]+\d)(?>_(1st|2nd|3rd|4th))?(?>\.(.*))?

Regex demo

0 голосов
/ 02 февраля 2020

Короткая версия с фрагментом кода, который показывает регулярное выражение, которое будет работать. В шаблон добавлены дополнительные пробелы, поэтому вам нужно установить флаг re.X. Также установите re.I, чтобы игнорировать регистр.

    # Capture:
    # <survey_name>_<category_name>_<question_type>.<response_type>
    #     (0)           (1)            (2) or (4)     (3) or (6)
    pat = r"""^(x_survey|y_survey|survey_z)    # <sn>  (0)
               _                               # _
               ([^.]+                          # <cn> (1)
       (?:                                     # One of
            _(1st|2nd|3rd|4th)  [.]([\w]+)$ |  # qt (2) & rt (3)
            _(1st|2nd|3rd|4th)            $ |  # qt (4)
        (?<!_(1st|2nd|3rd|4th)) [.]([\w]+)$    # rt (6)
       )
           """
    matcher = re.compile(pat, re.I | re.X)

Для более длинной версии с двумя вариантами для решения с включенными тестовыми примерами:

"""
Format: <survey_name>_<category_name>_<question_type>.<response_type>

 1. survey_name can only be one of the 3 values
 2. category_name will always be present and can have underscores
 3. question_type may be present and may have underscore in it
 4. response_type may be present and may have underscore in it
 5. Either question_type or response_type or both will always be present

A) <survery_name> always there
    easy to find, one of three: (x_survey|y_survey|survey_z)
B) <category_name> always there
    has 0 or more internal underscores
C) <question_type> optional
    one ore more internal underscores
    ends before a dot or at end of line
    one of 4 values: (1st|2nd|3rd|4th)
D) <response_type> optional
    starts before a .
    ends at end of line

Both category_name and question_type can have zero or more internal
underscores.  This results in an ambiguity, since we have no way of knowing
when category_name ends and question_type starts.

Assume that question_type is one of the 4 values (1st|2nd|3rd|4th).  this
results in 3 valid cases and one that should not match:

Format: <survey_name>_<category_name>_<question_type>.<response_type>

0) Both question_type and response_type present
   (x_survey|y_survey|survey_z)_<category_name>_(1st|2nd|3rd|4th).<response_type>
   -->
   p1 = r"^(x_survey|y_survey|survey_z)_([^.]+)_(1st|2nd|3rd|4th)[.]([\w]+)$"  # noqa:

1) Only question_type and no response_type present
   (x_survey|y_survey|survey_z)_<category_name>_(1st|2nd|3rd|4th)
   -->
   p2 = r"^(x_survey|y_survey|survey_z)_([^.]+)_(1st|2nd|3rd|4th)$"

2) No question_type and only response_type present
   (x_survey|y_survey|survey_z)_<category_name>.<response_type>
   -->
   p3 = r"^(x_survey|y_survey|survey_z)_([^.]+)(?<!_(1st|2nd|3rd|4th))[.]([\w]+)$"  # noqa:

3) Neither question_type nor response_type present
   (x_survey|y_survey|survey_z)_<category_name>

   Neither of p1, p2 nor p3 will match.

Since the patterns are mutually exclusive we can try them one after the other.
We could also combine them into one pattern.

We can combine the three patterns in one large pattern or we can try them one
after the other.

"""
from collections import namedtuple
import re

Response = namedtuple('Response', ['survey_name',
                                   'category_name',
                                   'question_type',
                                   'response_type'])

cases = ["survey_z__CATEGORY",
         "y_survey_category1_1st.no",
         "x_survey_category2_2nd",
         "survey_z_category_3_3rd.yes_more_7",
         "x_survey_category_4_4th.excluded",
         "X_SURVEY_CATEGORY_4_4TH.excluded",
         "survey_z_category5.yes_more_7",
         "survey_z_category_6.yes_more_7",
         "survey_z_category_7._yes_more_77",
         "survey_z_category_8_._yes_more_88",
         "survey_z_category_8888__foo._yes_more_77_",
         "survey_z_category_22_22_1st_2nd._yes_more_77_",
         "survey_z__CATEGORY_2222_1ST__2ND._yes_odd_77_",
         "survey_z__CATEGORY_3333_1ST__2ND",
         "survey_z__CATEGORY_foo_1ST__2ND._yes_odd_77_",
         ]


def parse_survey_response_1(line):
    """Parse a line with a survey response seting optional values not

    present to None.  Return a Response or None when no match.

    Use a list of mutually exclusive patterns for line format:
    <survey_name>_<category_name>_<question_type>.<response_type>
    """
    # Format <sn>_<cn>_(1st|2nd|3rd|4th).<rt>
    # Format <sn>_<cn>_(1st|2nd|3rd|4th)
    # Format: <sn>_<cn>.<rt>
    prfx = r"^(x_survey|y_survey|survey_z)_([^.]+)"
    regexs = [
        prfx + r"_(1st|2nd|3rd|4th)[.]([\w]+)$",       # 4 captures
        prfx + r"_(1st|2nd|3rd|4th)($)",               # 3+1 captures
        prfx + r"(?<!_(1st|2nd|3rd|4th))[.]([\w]+)$",  # 4 captures
        ]
    matchers = [re.compile(r, re.I | re.X) for r in regexs]

    for m in matchers:
        parsed_line = m.search(line)
        if parsed_line:
            map_empty2none = (g if g else None for g in parsed_line.groups())
            return Response._make(map_empty2none)
    return None


def parse_survey_response_2(line):
    """Parse a line with a survey response seting optional values not

    present to None.  Return a Response or None when no match.

    Use a one large pattern for line format:
    <survey_name>_<category_name>_<question_type>.<response_type>
    """
    # Capture:
    # <survey_name>_<category_name>_<question_type>.<response_type>
    #     (0)           (1)            (2) or (4)     (3) or (6)
    pat = r"""^(x_survey|y_survey|survey_z)    # <sn>  (0)
               _                               # _
               ([^.]+)                         # <cn> (1)
       (?:                                     # One of
            _(1st|2nd|3rd|4th)  [.]([\w]+)$ |  # qt (2) & rt (3)
            _(1st|2nd|3rd|4th)            $ |  # qt (4)
        (?<!_(1st|2nd|3rd|4th)) [.]([\w]+)$    # rt (6)
       )
           """

    matcher = re.compile(pat, re.I | re.X)
    parsed_line = matcher.search(line)
    if parsed_line:
        pg = list(parsed_line.groups())
        pg[2] = pg[2] if pg[2] else pg[4]  # capture 2 or 4
        pg[3] = pg[3] if pg[3] else pg[6]  # capture 3 or 6
        return Response._make(pg[:4])
    return None


def unparse_survey(response):
    if response.response_type:
        head = '_'.join(e for e in response[:-1] if e)
        unparsed = '.'.join([head, response.response_type])
    else:
        unparsed = '_'.join(e for e in response if e)
    return unparsed


for c in cases:
    p1 = parse_survey_response_1(c)
    p2 = parse_survey_response_2(c)
    print(c)
    print(p1)
    print(p2)
    print(20*'=')
    if p1 or p2:
        assert(c == unparse_survey(p1))
        assert(c == unparse_survey(p2))

Запуск дает:

run reex02.py                                                                                                                                       
survey_z__CATEGORY
None
None
====================
y_survey_category1_1st.no
Response(survey_name='y_survey', category_name='category1', question_type='1st', response_type='no')
Response(survey_name='y_survey', category_name='category1', question_type='1st', response_type='no')
====================
x_survey_category2_2nd
Response(survey_name='x_survey', category_name='category2', question_type='2nd', response_type=None)
Response(survey_name='x_survey', category_name='category2', question_type='2nd', response_type=None)
====================
survey_z_category_3_3rd.yes_more_7
Response(survey_name='survey_z', category_name='category_3', question_type='3rd', response_type='yes_more_7')
Response(survey_name='survey_z', category_name='category_3', question_type='3rd', response_type='yes_more_7')
====================
x_survey_category_4_4th.excluded
Response(survey_name='x_survey', category_name='category_4', question_type='4th', response_type='excluded')
Response(survey_name='x_survey', category_name='category_4', question_type='4th', response_type='excluded')
====================
X_SURVEY_CATEGORY_4_4TH.excluded
Response(survey_name='X_SURVEY', category_name='CATEGORY_4', question_type='4TH', response_type='excluded')
Response(survey_name='X_SURVEY', category_name='CATEGORY_4', question_type='4TH', response_type='excluded')
====================
survey_z_category5.yes_more_7
Response(survey_name='survey_z', category_name='category5', question_type=None, response_type='yes_more_7')
Response(survey_name='survey_z', category_name='category5', question_type=None, response_type='yes_more_7')
====================
survey_z_category_6.yes_more_7
Response(survey_name='survey_z', category_name='category_6', question_type=None, response_type='yes_more_7')
Response(survey_name='survey_z', category_name='category_6', question_type=None, response_type='yes_more_7')
====================
survey_z_category_7._yes_more_77
Response(survey_name='survey_z', category_name='category_7', question_type=None, response_type='_yes_more_77')
Response(survey_name='survey_z', category_name='category_7', question_type=None, response_type='_yes_more_77')
====================
survey_z_category_8_._yes_more_88
Response(survey_name='survey_z', category_name='category_8_', question_type=None, response_type='_yes_more_88')
Response(survey_name='survey_z', category_name='category_8_', question_type=None, response_type='_yes_more_88')
====================
survey_z_category_8888__foo._yes_more_77_
Response(survey_name='survey_z', category_name='category_8888__foo', question_type=None, response_type='_yes_more_77_')
Response(survey_name='survey_z', category_name='category_8888__foo', question_type=None, response_type='_yes_more_77_')
====================
survey_z_category_22_22_1st_2nd._yes_more_77_
Response(survey_name='survey_z', category_name='category_22_22_1st', question_type='2nd', response_type='_yes_more_77_')
Response(survey_name='survey_z', category_name='category_22_22_1st', question_type='2nd', response_type='_yes_more_77_')
====================
survey_z__CATEGORY_2222_1ST__2ND._yes_odd_77_
Response(survey_name='survey_z', category_name='_CATEGORY_2222_1ST_', question_type='2ND', response_type='_yes_odd_77_')
Response(survey_name='survey_z', category_name='_CATEGORY_2222_1ST_', question_type='2ND', response_type='_yes_odd_77_')
====================
survey_z__CATEGORY_3333_1ST__2ND
Response(survey_name='survey_z', category_name='_CATEGORY_3333_1ST_', question_type='2ND', response_type=None)
Response(survey_name='survey_z', category_name='_CATEGORY_3333_1ST_', question_type='2ND', response_type=None)
====================
survey_z__CATEGORY_foo_1ST__2ND._yes_odd_77_
Response(survey_name='survey_z', category_name='_CATEGORY_foo_1ST_', question_type='2ND', response_type='_yes_odd_77_')
Response(survey_name='survey_z', category_name='_CATEGORY_foo_1ST_', question_type='2ND', response_type='_yes_odd_77_')
====================
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...