Question

Когда я захожу на этот URL (https://www.example.com/blog/author/), он показывает мне статьи, написанные автором. Мне нужно создать скрипт, чтобы найти все ссылки на этой странице статей этого автора. Теперь статьи находятся вдругая папка, две папки внутри сервера (https://www.example.com/blog/some-folder/article). Папки бывают следующих двух типов:

https://www.example.com/some-numerical/this-is-a-post/

т.е. https://www.example.com/123/sample-article

https://www.example.com/some-word/this-is-a-post/

то есть https://www.example.com/data/sample-post/

Как мне это сделать с помощью регулярных выражений и Python?

Я пробовал следующий код, но не могу получить правильное регулярное выражение.

import re
import requests
r = requests.get("https://www.example.com/blog/author/abc") 
data = r.content  # Content of response
links = re.findall('https://www.example.com/blog/*+/', data)
print(links)

это просто распечатывает URL: https://www.example.com/blog/

Emma · Answer 1 · 02 июня 2019

Если мы хотим передать URL, которые имеют example.com и sample-article, тогда мы можем начать с выражения, похожего на:

(https?:\/\/)?(www\.)?example\.com\/(.+)\/(sample-article)

Demo

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(https?:\/\/)?(www\.)?example\.com\/(.+)\/(sample-article)"

test_str = ("example.com/123/sample-article\n"
    "example.com/dogs/sample-article\n"
    "www.example.com/123/sample-article\n"
    "www.example.com/dogs/sample-article\n"
    "https://www.example.com/dogs/sample-article\n"
    "http://www.example.com/dogs/sample-article")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx Circuit

jex.im визуализирует регулярные выражения:

Редактировать:

Если мы хотим проанализировать HTML здесь, было бы намного лучше использовать анализатор HTML.В противном случае изменения нашего выражения станут утомительными и ненужными.

Если бы это было невозможно, мы бы начали с выражения с левыми и правыми границами, похожими на:

href=\"((https?:\/\/www\.example\.com)\/[a-z]+?\/[0-9]+?\/[a-z-]+\/)\"

Demo

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"href=\"((https?:\/\/www\.example\.com)\/[a-z]+?\/[0-9]+?\/[a-z-]+\/)\""

test_str = "<div id=\"content\" class=\"\" role=\"main\"><article id=\"post-5463\" class=\"post-entry clearfix post-5463 post type-post status-publish format-standard has-post-thumbnail hentry category-13 tag-wordpress\"><div class=\"post-box\"><div class=\"post-header clearfix\"><div class=\"post-format-icon post-format-standard\"> <i class=\"fa fa-pencil\"></i></div><div class=\"post-info-wrap\"><h2 class=\"post-title\"><a href=\"https://www.getastra.com/blog/911/wordpress-hacked-sending-spam/\" title=\"WordPress Website Hacked &#038; Sending Spam: Symptoms, Causes &#038; Cleanup\" rel=\"bookmark\">WordPress Website Hacked &#038; Sending Spam: Symptoms, Causes &#038; Cleanup</a></h2><div class=\"post-meta clearfix\"><ul><li><img style=\"--aspect-ratio:1;\" alt='' data-src='https://secure.gravatar.com/avatar/fc475099fdd637e1ec27d0b5c73cb876?s=35&#038;d=retro&#038;r=g' data-srcset='https://secure.gravatar.com/avatar/fc475099fdd637e1ec27d0b5c73cb876?s=70&#038;d=retro&#038;r=g 2x' class='avatar avatar-35 photo lazyload' height='35' width='35' /></li><li><a href=\"https://www.getastra.com/blog/author/vikas/\" rel=\"author\">By: <span class=\"fn\"></span></a></li><li><i class=\"fa fa-clock-o\"></i><time class=\"entry-date published\" datetime=\"2019-05-09T16:21:26+05:30\">May 9, 2019</time></li><li><i class=\"fa fa-comments\"></i><a href=\"/#respond\" class=\"comments-link\">Leave a comment</a></li></ul></div></div></div><div class=\"post-media\"> <figure class=\"post-thumbnail-wrapper \"> <a href=\"https://www.getastra.com/blog/911/wordpress-hacked-sending-spam/\" title=\"WordPress Website Hacked &#038; Sending Spam: Symptoms, Causes &#038; Cleanup\" rel=\"bookmark\"> <img width=\"748\" height=\"350\" data-srcset=\"https://www.getastra.com/blog/wp-content/uploads/2019/05/CopyofCopyofCopyofTemplate84_4c3eb9ee60c31a6a26529b08360fa628_2000-748x350.png 748w, https://www.getastra.com/blog/wp-content/uploads/2019/05/CopyofCopyofCopyofTemplate84_4c3eb9ee60c31a6a26529b08360fa628_2000.png 750w\" data-src=\"https://www.getastra.com/blog/wp-content/uploads/2019/05/CopyofCopyofCopyofTemplate84_4c3eb9ee60c31a6a26529b08360fa628_2000-748x350.png\" class=\"attachment-blog-featured size-blog-featured wp-post-image lazyload\" alt=\"WordPress Website Hacked &amp; Sending Spam: Symptoms, Causes &amp; Cleanup\" sizes=\"(max-width: 748px) 100vw, 748px\" style=\"--aspect-ratio:2.1371428571429;\" /><div class=\"thumb-overlay\"> <i class=\"fa fa-link\"></i></div> </a> </figure></div><div class=\"post-con</a></li></ul></div></body></html>"

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Выход

Match 1 was found at 395-466: href="https://www.getastra.com/blog/911/wordpress-hacked-sending-spam/"
Group 1 found at 401-465: https://www.example.com/blog/911/wordpress-hacked-sending-spam/
Group 2 found at 401-425: https://www.getastra.com
Match 2 was found at 1522-1593: href="https://www.example.com/blog/911/wordpress-hacked-sending-spam/"
Group 1 found at 1528-1592: https://www.example.com/blog/911/wordpress-hacked-sending-spam/
Group 2 found at 1528-1552: https://www.example.com
Match 3 was found at 3065-3136: href="https://www.example.com/blog/911/wordpress-hacked-sending-spam/"
Group 1 found at 3071-3135: https://www.example.com/blog/911/wordpress-hacked-sending-spam/
Group 2 found at 3071-3095: https://www.example.com

Как создать регулярное выражение для поиска файлов двух папок внутри URL?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Demo

Test

RegEx Circuit

Редактировать:

Demo

Test

Выход

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Как создать регулярное выражение для поиска файлов двух папок внутри URL?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Demo

Test

RegEx Circuit

Редактировать:

Demo

Test

Выход

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы