Как скрести HTML с BeautifulSoup? - PullRequest
0 голосов
/ 07 декабря 2018

Я написал скрипт, который очищает сайт и помещает содержимое в текстовый файл.У меня проблема, потому что, как и в приведенном ниже коде, есть два абзаца, и я хочу получить текст из обоих абзацев, но отдельно.Итак, мой вопрос: есть ли способ поиска только абзацев между двумя конкретными классами h2 или как это решить?

HTML:

<h2 class="pt-3" id="mitigation">Mitigation</h2>
<p>Access tokens are an integral part of the security system within Windows and cannot be turned off. However, an attacker must already have administrator level access on the local system to make full use of this technique; be sure to restrict users and accounts to the least privileges they require to do their job.</p><p>Any user can also spoof access tokens if they have legitimate credentials. Follow mitigation guidelines for preventing adversary use of <a href="/techniques/T1078">Valid Accounts</a>. Limit permissions so that users and user groups cannot create tokens. This setting should be defined for the local system account only. GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Create a token object. <span  id="scite-ref-19-a" class="scite-citeref-number" data-reference="Microsoft Create Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/create-a-token-object" target="_blank" data-hasqtip="18" aria-describedby="qtip-18">[19]</a></sup></span> Also define who can create a process level token to only the local and network service through GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Replace a process level token. <span  id="scite-ref-20-a" class="scite-citeref-number" data-reference="Microsoft Replace Process Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/replace-a-process-level-token" target="_blank" data-hasqtip="19" aria-describedby="qtip-19">[20]</a></sup></span></p><p>Also limit opportunities for adversaries to increase privileges by limiting Privilege Escalation opportunities.</p>
<h2 class="pt-3" id="detection">Detection</h2>
<p>If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the <code>runas</code> command. Detailed command-line logging is not enabled by default in Windows. <span  id="scite-ref-21-a" class="scite-citeref-number" data-reference="Microsoft Command-line Logging"><sup><a href="https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-ds/manage/component-updates/command-line-process-auditing" target="_blank" data-hasqtip="20" aria-describedby="qtip-20">[21]</a></sup></span></p><p>If an adversary is using a payload that calls the Windows token APIs directly, analysts can detect token manipulation only through careful analysis of user network activity, examination of running processes, and correlation with other endpoint and network behavior. </p><p>There are many Windows API calls a payload can take advantage of to manipulate access tokens (e.g., <code>LogonUser</code> <span  id="scite-ref-22-a" class="scite-citeref-number" data-reference="Microsoft LogonUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378184(v=vs.85).aspx" target="_blank" data-hasqtip="21" aria-describedby="qtip-21">[22]</a></sup></span>, <code>DuplicateTokenEx</code> <span  id="scite-ref-23-a" class="scite-citeref-number" data-reference="Microsoft DuplicateTokenEx"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa446617(v=vs.85).aspx" target="_blank" data-hasqtip="22" aria-describedby="qtip-22">[23]</a></sup></span>, and <code>ImpersonateLoggedOnUser</code> <span  id="scite-ref-24-a" class="scite-citeref-number" data-reference="Microsoft ImpersonateLoggedOnUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378612(v=vs.85).aspx" target="_blank" data-hasqtip="23" aria-describedby="qtip-23">[24]</a></sup></span>). Please see the referenced Windows API pages for more information.</p><p>Query systems for process and thread token information and look for inconsistencies such as user owns processes impersonating the local SYSTEM account. <span  id="scite-ref-3-a" class="scite-citeref-number" data-reference="BlackHat Atkinson Winchester Token Manipulation"><sup><a href="https://www.blackhat.com/docs/eu-17/materials/eu-17-Atkinson-A-Process-Is-No-One-Hunting-For-Token-Manipulation.pdf" target="_blank" data-hasqtip="2" aria-describedby="qtip-2">[3]</a></sup></span></p>

Код:

import requests
from bs4 import BeautifulSoup
import time
from docx import Document

def linkgenerator_getlink():
   link = "https://attack.mitre.org/techniques/"
    for i in range(1001, 1224):
        fullurl = link + "T" + str(i) + "/"
        source = requests.get(fullurl).text
        time.sleep(15)
        soup = BeautifulSoup(source, 'lxml')

        document = Document()
        document.add_heading(soup.find('h1').text.strip().encode("UTF-8"), 0)

        p = soup.findAll("p")
        for x in p:
            paragraphs = unicode(x.text)
            p1 = document.add_paragraph(paragraphs)
        document.save('C:\\Users\XXX\Desktop\\script\\' + (str("T%s.docx") % str(i)))
        print "========== %s-es szamu doksi is ready ==========" % i



linkgenerator_getlink()

Ответы [ 2 ]

0 голосов
/ 07 декабря 2018

Это дает внутренний текст тега <p> рядом со всеми тегами <h2> с указанными классами:

import bs4 as bs

content = """<h2 class="pt-3" id="mitigation">Mitigation</h2>
<p>Access tokens are an integral part of the security system within Windows and cannot be turned off. However, an attacker must already have administrator level access on the local system to make full use of this technique; be sure to restrict users and accounts to the least privileges they require to do their job.</p><p>Any user can also spoof access tokens if they have legitimate credentials. Follow mitigation guidelines for preventing adversary use of <a href="/techniques/T1078">Valid Accounts</a>. Limit permissions so that users and user groups cannot create tokens. This setting should be defined for the local system account only. GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Create a token object. <span  id="scite-ref-19-a" class="scite-citeref-number" data-reference="Microsoft Create Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/create-a-token-object" target="_blank" data-hasqtip="18" aria-describedby="qtip-18">[19]</a></sup></span> Also define who can create a process level token to only the local and network service through GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Replace a process level token. <span  id="scite-ref-20-a" class="scite-citeref-number" data-reference="Microsoft Replace Process Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/replace-a-process-level-token" target="_blank" data-hasqtip="19" aria-describedby="qtip-19">[20]</a></sup></span></p><p>Also limit opportunities for adversaries to increase privileges by limiting Privilege Escalation opportunities.</p>
<h2 class="pt-3" id="detection">Detection</h2>
<p>If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the <code>runas</code> command. Detailed command-line logging is not enabled by default in Windows. <span  id="scite-ref-21-a" class="scite-citeref-number" data-reference="Microsoft Command-line Logging"><sup><a href="https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-ds/manage/component-updates/command-line-process-auditing" target="_blank" data-hasqtip="20" aria-describedby="qtip-20">[21]</a></sup></span></p><p>If an adversary is using a payload that calls the Windows token APIs directly, analysts can detect token manipulation only through careful analysis of user network activity, examination of running processes, and correlation with other endpoint and network behavior. </p><p>There are many Windows API calls a payload can take advantage of to manipulate access tokens (e.g., <code>LogonUser</code> <span  id="scite-ref-22-a" class="scite-citeref-number" data-reference="Microsoft LogonUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378184(v=vs.85).aspx" target="_blank" data-hasqtip="21" aria-describedby="qtip-21">[22]</a></sup></span>, <code>DuplicateTokenEx</code> <span  id="scite-ref-23-a" class="scite-citeref-number" data-reference="Microsoft DuplicateTokenEx"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa446617(v=vs.85).aspx" target="_blank" data-hasqtip="22" aria-describedby="qtip-22">[23]</a></sup></span>, and <code>ImpersonateLoggedOnUser</code> <span  id="scite-ref-24-a" class="scite-citeref-number" data-reference="Microsoft ImpersonateLoggedOnUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378612(v=vs.85).aspx" target="_blank" data-hasqtip="23" aria-describedby="qtip-23">[24]</a></sup></span>). Please see the referenced Windows API pages for more information.</p><p>Query systems for process and thread token information and look for inconsistencies such as user owns processes impersonating the local SYSTEM account. <span  id="scite-ref-3-a" class="scite-citeref-number" data-reference="BlackHat Atkinson Winchester Token Manipulation"><sup><a href="https://www.blackhat.com/docs/eu-17/materials/eu-17-Atkinson-A-Process-Is-No-One-Hunting-For-Token-Manipulation.pdf" target="_blank" data-hasqtip="2" aria-describedby="qtip-2">[3]</a></sup></span></p>"""

soup = bs.BeautifulSoup(content, features="html.parser")

for h2_tag in soup('h2', {'class': 'pt-3'}):
    print(h2_tag.next_sibling.next_sibling.text)
    print("") # line of separation after each paragragh

Вывод:

Access tokens are an integral part of the security system within Windows and cannot be turned off. However, an attacker must already have administrator level access on the local system to make full use of this technique; be sure to restrict users and accounts to the least privileges they require to do their job.

If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the runas command. Detailed command-line logging is not enabled by default in Windows. [21]
0 голосов
/ 07 декабря 2018

До тех пор, пока вы точно знаете теги, вы можете просто их вводить. В противном случае вам может понадобиться создать переменные для итерации.Но у вас будет лучшее представление об этом, так как вы знаете, как выглядит HTML.

import bs4

r = '''<h2 class="pt-3" id="mitigation">Mitigation</h2>
        <p>Access tokens are an integral part of the security system within Windows and cannot be turned off. However, an attacker must already have administrator level access on the local system to make full use of this technique; be sure to restrict users and accounts to the least privileges they require to do their job.</p><p>Any user can also spoof access tokens if they have legitimate credentials. Follow mitigation guidelines for preventing adversary use of <a href="/techniques/T1078">Valid Accounts</a>. Limit permissions so that users and user groups cannot create tokens. This setting should be defined for the local system account only. GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Create a token object. <span  id="scite-ref-19-a" class="scite-citeref-number" data-reference="Microsoft Create Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/create-a-token-object" target="_blank" data-hasqtip="18" aria-describedby="qtip-18">[19]</a></sup></span> Also define who can create a process level token to only the local and network service through GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Replace a process level token. <span  id="scite-ref-20-a" class="scite-citeref-number" data-reference="Microsoft Replace Process Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/replace-a-process-level-token" target="_blank" data-hasqtip="19" aria-describedby="qtip-19">[20]</a></sup></span></p><p>Also limit opportunities for adversaries to increase privileges by limiting Privilege Escalation opportunities.</p>
        <h2 class="pt-3" id="detection">Detection</h2>
        <p>If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the <code>runas</code> command. Detailed command-line logging is not enabled by default in Windows. <span  id="scite-ref-21-a" class="scite-citeref-number" data-reference="Microsoft Command-line Logging"><sup><a href="https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-ds/manage/component-updates/command-line-process-auditing" target="_blank" data-hasqtip="20" aria-describedby="qtip-20">[21]</a></sup></span></p><p>If an adversary is using a payload that calls the Windows token APIs directly, analysts can detect token manipulation only through careful analysis of user network activity, examination of running processes, and correlation with other endpoint and network behavior. </p><p>There are many Windows API calls a payload can take advantage of to manipulate access tokens (e.g., <code>LogonUser</code> <span  id="scite-ref-22-a" class="scite-citeref-number" data-reference="Microsoft LogonUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378184(v=vs.85).aspx" target="_blank" data-hasqtip="21" aria-describedby="qtip-21">[22]</a></sup></span>, <code>DuplicateTokenEx</code> <span  id="scite-ref-23-a" class="scite-citeref-number" data-reference="Microsoft DuplicateTokenEx"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa446617(v=vs.85).aspx" target="_blank" data-hasqtip="22" aria-describedby="qtip-22">[23]</a></sup></span>, and <code>ImpersonateLoggedOnUser</code> <span  id="scite-ref-24-a" class="scite-citeref-number" data-reference="Microsoft ImpersonateLoggedOnUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378612(v=vs.85).aspx" target="_blank" data-hasqtip="23" aria-describedby="qtip-23">[24]</a></sup></span>). Please see the referenced Windows API pages for more information.</p><p>Query systems for process and thread token information and look for inconsistencies such as user owns processes impersonating the local SYSTEM account. <span  id="scite-ref-3-a" class="scite-citeref-number" data-reference="BlackHat Atkinson Winchester Token Manipulation"><sup><a href="https://www.blackhat.com/docs/eu-17/materials/eu-17-Atkinson-A-Process-Is-No-One-Hunting-For-Token-Manipulation.pdf" target="_blank" data-hasqtip="2" aria-describedby="qtip-2">[3]</a></sup></span></p>'''


html = bs4.BeautifulSoup(r)

# assuming the 1st paragraph you want is id="mitigation"
# find that, then grab the next sibling
para_1 = html.find('h2', {'id':'mitigation'})
p1 = para_1.find_next_sibling('p').text

para_2 = html.find('h2', {'id':'detection'})
p2 = para_2.find_next_sibling('p').text
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...