Попытка разобрать и разделить на заголовки и содержание.Проблема в том, что оба имеют одинаковый класс и теги, как разделить? - PullRequest
0 голосов
/ 22 апреля 2019

Я пытаюсь очистить веб http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html, разделив его на 2 части. Заголовок и содержимое. Проблема в том, что оба имеют одинаковый класс и теги.Помимо использования регулярных выражений и жесткого кодирования, Как различить и извлечь в 2 столбца в Excel?

На картинке (https://ibb.co/8X5xY9C) или в предоставленной ссылке на сайт, жирным шрифтом (кроме букв алфавита (A)и позже «наверх») представляет заголовок, а пояснение (не выделено жирным шрифтом чуть ниже полужирного) представляет контент (контент даже состоит из блоков «li» и «ul» позже на сайте, которые должны находиться под соответствующим заголовком)

#Code to Start With
from bs4 import BeautifulSoup
import requests

url = "http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html";
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")  
Heading = soup.findAll('strong') 
content = soup.findAll('div', {"class": "comp-rich-text"})

Вывод Excel выглядит как-то Ссылка эта

https://i.stack.imgur.com/NsMmm.png

Ответы [ 2 ]

1 голос
/ 23 апреля 2019

Я немного подумал об этом и подумал о лучшем решении. Вместо того, чтобы «собирать» мое первоначальное решение, я решил добавить здесь второе решение:

Итак, подумав еще раз и следуя моей логике разделения html по заголовкам (по сути, разбивая его там, где мы находим <strong> теги), я выбираю преобразование в строки, используя .prettify(), а затем делю на эти конкретные строки / теги и читать обратно в BeautifulSoup, чтобы вытащить текст. Из того, что я вижу, похоже, что он ничего не пропустил, но вам придется поискать через фрейм данных, чтобы дважды проверить:

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

sections = soup.find_all('div',{'class':'accordion-section-content'})

results = {}
for section in sections:
    splits = section.prettify().split('<strong>')
    for each in splits:
        try:
            headline, content = each.split('</strong>')[0].strip(), each.split('</strong>')[1]
            headline = BeautifulSoup(headline, 'html.parser').text.strip()
            content = BeautifulSoup(content, 'html.parser').text.strip()

            content_split = content.split('\n')
            content = ' '.join([ text.strip() for text in content_split if text != ''])

            results[headline] = content
        except:
            continue

df = pd.DataFrame(results.items(), columns = ['Headings','Content'])
df.to_csv('C:/test.csv', index=False)

Выход:

print (df)
                                         Headings                                            Content
0                                Age requirements  Applicants must be at least 18 years old at th...
1                                   Affordability  Our affordability calculator is the same one u...
2                        Agricultural restriction  The only acceptable agricultural tie is where ...
3         Annual percentage rate of charge (APRC)  The APRC is all fees associated with the mortg...
4                                  Adverse credit  We consult credit reference agencies to look a...
5                          Applicants (number of)           The maximum number of applicants is two.
6                          Armed Forces personnel  Unsecured personal loans are only acceptable f...
7                                    Back to back  Back to back is typically where the vendor has...
8                       Customer funded purchase:  when the customer has funded the purchase usin...
9                                       Bridging:  residential mortgage applications where the cu...
10                                     Inherited:  a recently inherited property where the benefi...
11                                       Porting:  where a fixed/discounted rate was ported to a ...
12                          Repossessed property:  where the vendor is the mortgage lender in pos...
13                                 Part exchange:  where the vendor is a large national house bui...
14                                Bank statements  We accept internet bank statements in paper fo...
15                                          Bonus  For guaranteed bonuses we will consider an ave...
16              British National working overseas  Applicants must be resident in the UK. Applica...
17                           Builder's Incentives  The maximum amount of acceptable incentive is ...
18                           Buy-to-let (purpose)  A buy-to-let mortgage can be used for:  Purcha...
19                                Capital Raising  - Acceptable purposes  permanent home improvem...
20                     Buy-to-let (affordability)  Buy to Let affordability must be assessed usin...
21              Buy-to-let (eligibility criteria)  The property must be in England, Scotland, Wal...
22             Definition of a portfolio landlord  We define a portfolio landlord as a customer w...
23                              Carer's Allowance  Carer's Allowance is paid to people aged 16 or...
24                                       Cashback  Where a mortgage product includes a cashback f...
25                              Casual employment  Contract/agency workers with income paid throu...
26                     Certification of documents  When submitting copies of documents, please en...
27                                  Child Benefit  We can accept up to 100% of working tax credit...
28                                Childcare costs  We use the actual amount the customer has decl...
29   When should childcare costs not be included?  There are a number of situations where childca...
..                                            ...                                                ...
108                                 Shared equity  We lend on the Government-backed shared equity...
109                              Shared ownership  We do not lend against Shared Ownership proper...
110                              Solicitors' fees  We have a panel of solicitors for our fees ass...
111                             Source of deposit  We reserve the right to ask for proof of depos...
112                      Sole trader/partnerships  We will take an average of the last two years'...
113                        Standard variable rate  A standard variable rate  (SVR) is a type of v...
114                                 Student loans  Repayment of student loans is dependent on rec...
115                                        Tenure  Acceptable property tenure: Feuhold, Freehold,...
116                                          Term  Minimum term is 3 years  Residential - Maximum...
117                     Unacceptable income types  The following forms of income are classed as u...
118                        Bereavement allowance:  paid to widows, widowers or surviving civil pa...
119                Employee benefit trusts (EBT):  this is a tax mitigation scheme used in conjun...
120                                     Expenses:  not acceptable as they're paid to reimburse pe...
121                              Housing Benefit:  payment of full or partial contribution to cla...
122                               Income Support:  payment for people on low incomes, working les...
123                       Job Seeker's Allowance:  paid to people who are unemployed or working 1...
124                                      Stipend:  a form of salary paid for internship/apprentic...
125                           Third Party Income:  earned by a spouse, partner, parent who are no...
126                             Universal Credit:  only certain elements of the Universal Credit ...
127                              Universal Credit  The Standard Allowance element, which is the n...
128               Valuations: day one instruction  We are now instructing valuations on day one f...
129                         Valuation instruction  A valuation will be automatically instructed w...
130                                Valuation fees  A valuation will always be obtained using a pa...
131                                  Please note:  W  hen upgrading the free valuation for a home...
132                       Adding fees to the loan  Product fees are the only fees which can be ad...
133                                   Product fee  This fee is paid when the mortgage is arranged...
134                                Working abroad  Previously, we required applicants to be  empl...
135                                  Acceptable -  We may consider applications from people who: ...
136                              Not acceptable -  We will not consider applications from  people...
137                Working and Family Tax Credits  We can accept up to 100% of Working Tax Credit...

[138 rows x 2 columns]
0 голосов
/ 23 апреля 2019

РЕДАКТИРОВАТЬ: СМОТРИТЕ ДРУГОЕ ПРЕДОСТАВЛЕННОЕ РЕШЕНИЕ

Это сложно.Я попытался по существу захватить заголовки, а затем использовать их, чтобы захватить весь текст после заголовка, и это переходит к следующему заголовку.Приведенный ниже код немного запутан и требует некоторой очистки, но, надеюсь, приведет вас к работе с ним или заставит двигаться в правильном направлении:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
    headlines = section.find_all('strong')
    headlines = [each.text for each in headlines ]

    for i, headline in enumerate(headlines):
        if headline != headlines[-1]:
            next_headline = headlines[i+1]
        else:
            next_headline = ''
        try:
            find_content = section(text=headline)[0].parent.parent.find_next_siblings()
            if ':' in headline and 'Gifted deposit' not in headline and 'Help to Buy' not in headline:
                content = section(text=headline)[0].parent.nextSibling
                results[headline] = content.strip()
                break

        except:
            find_content = section(text=re.compile(headline))[0].parent.parent.find_next_siblings()
        if find_content == []:
            try:
                find_content = section(text=headline)[0].parent.parent.parent.find_next_siblings()
            except:
                find_content = section(text=re.compile(headline))[0].parent.parent.parent.find_next_siblings()

        content = []
        for sibling in find_content:
            if next_headline not in sibling.text or headline == headlines[-1]:
                content.append(sibling.text)
            else:
                content = '\n'.join(content)
                results[headline.strip()] = content.strip()
                break
        if headline == headlines[-1]:
            content = '\n'.join(content)
            results[headline] = content.strip()

df = pd.DataFrame(results.items())
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...