Для l oop - проблема с отступами. Нужно переставить - PullRequest
1 голос
/ 30 января 2020

Следующий скрипт должен получать данные с разных страниц.

Трудно добавить данные, так как URL-адрес является c и не изменяется. Стол, который я скребу, - это всплывающее окно.

URL: https://www.formularylookup.com/

Вот некоторые данные и средство смены страниц (не включая данные с других страниц):

html = '<div class="k-grid k-widget" data-role="grid" id="lookupDetailsGrid" style="height: 75%;"><div class="k-grid-header" style="padding-right: 5px;"><div class="k-grid-header-wrap"><table role="grid"><colgroup><col class="k-hierarchy-col"/><col/><col/><col/><col/><col/></colgroup><thead role="rowgroup"><tr role="row"><th class="k-hierarchy-cell k-header"> </th><th class="col-plan k-header" data-field="PlanName" data-index="0" data-title="Plan" id="bbabe5e5-37a9-4e77-a67d-753ebd33ef7a" role="columnheader" rowspan="1">Plan</th><th class="col-status k-header" data-field="UnifiedTierShortName" data-index="1" data-title="Status" id="4d4d4a85-131c-4372-8bc3-5c525a49efda" role="columnheader" rowspan="1">Status</th><th class="col-raw-status k-header" data-field="DrugListTierName" data-index="2" data-title="Raw Status" id="1f9a4790-b954-40fa-aded-8a84b2cc5e34" role="columnheader" rowspan="1">Raw Status</th><th class="col-restrictions k-header" data-field="Restrictions" data-index="3" data-title="Restrictions" id="6c4c0c52-5c36-4191-9412-134b0c2c9f5a" role="columnheader" rowspan="1">Restrictions</th><th class="col-alternatives k-header" data-field="CoveredAlternatives" data-index="4" data-title="Covered Alternatives" id="8ab600f2-bdc6-4da2-8f5c-b4e932bdaa4c" role="columnheader" rowspan="1">Covered Alternatives</th></tr></thead></table></div></div><div class="k-grid-content" style="height: 476px;"><table class="k-selectable" data-role="selectable" role="treegrid" style="height: auto;"><colgroup><col class="k-hierarchy-col"/><col/><col/><col/><col/><col/></colgroup><tbody role="rowgroup"><tr class="k-master-row" data-uid="9c23ff5f-4fcc-4d1d-8d6a-2d56893e046e" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Tropicana Atlantic City</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63890" mmit_isformulary="false">120 </div></td></tr><tr class="k-alt k-master-row" data-uid="c964375c-83e7-41b0-af5f-84e4c26649be" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">WCA Hospital</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63891" mmit_isformulary="false">120 </div></td></tr><tr class="k-master-row" data-uid="c3cf9b00-1515-41d9-a6a5-c2ab0a72b5a3" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Tropicana Entertainment Inc</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63892" mmit_isformulary="false">120 </div></td></tr><tr class="k-alt k-master-row" data-uid="8b7dbae9-88db-426d-957a-09e35fcfcc54" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Riverside Health Care</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63893" mmit_isformulary="false">120 </div></td></tr><tr class="k-master-row" data-uid="15f8ec55-d3b2-4f63-b0c9-bf3105b19bad" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Transylvania University</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63894" mmit_isformulary="false">120 </div></td></tr><tr class="k-alt k-master-row" data-uid="32d72970-aeea-43e8-9c6b-18edbc00bfea" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">WestPoint Home</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63895" mmit_isformulary="false">120 </div></td></tr><tr class="k-master-row" data-uid="13442ffb-67ef-4f90-8108-5b2cff58281c" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">CVR Partners</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63896" mmit_isformulary="false">120 </div></td></tr><tr class="k-alt k-master-row" data-uid="b449ce4a-c522-4800-a8b6-aca35ef7a7c0" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">CVR Energy</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63897" mmit_isformulary="false">120 </div></td></tr><tr class="k-master-row" data-uid="94aa0a86-cac9-4a74-a413-d1cee3b9aff8" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Kentucky Teachers Retirement System</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="63898" mmit_isformulary="false">120 </div></td></tr><tr class="k-alt k-master-row" data-uid="7be626dd-3c80-45ea-b76f-46f6b662f383" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">SEIU Local 32BJ District 36 Benefits Fund</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="64379" mmit_isformulary="false">120 </div></td></tr><tr class="k-master-row" data-uid="ec9245d6-88d3-4a37-8346-221320a930bd" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Bass Pro Shops</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="64574" mmit_isformulary="false">120 </div></td></tr><tr class="k-alt k-master-row" data-uid="18c672c0-42f3-4f77-93b7-4b40b85169f8" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Laborers Union Local 872</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="64575" mmit_isformulary="false">120 </div></td></tr><tr class="k-master-row" data-uid="630a2c44-4256-47f3-a1c7-d43635af1b30" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Ironworkers District Council of Tennessee Valley &amp; Vicinity</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="64576" mmit_isformulary="false">120 </div></td></tr><tr class="k-alt k-master-row" data-uid="3fb4e1b8-d1b4-487e-9348-1ed456e360df" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Teamsters Local 830</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="65073" mmit_isformulary="false">120 </div></td></tr><tr class="k-master-row" data-uid="b785e533-e7f6-45fc-83a7-275f3cd4001e" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">UA Local 74</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="65079" mmit_isformulary="false">120 </div></td></tr><tr class="k-alt k-master-row" data-uid="5460f5b9-5882-45fc-b40b-6125084a51ad" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">1199SEIU Home Care Industry Benefit</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="3201" mmit_isformulary="false">121 </div></td></tr><tr class="k-master-row" data-uid="a2bf1ccb-66ae-43b8-a59a-8d19e5f79c9b" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">United Federation of Teachers (UFT)</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="15354" mmit_isformulary="false">121 </div></td></tr><tr class="k-alt k-master-row" data-uid="92710963-d8a7-4177-85b6-703437893e9d" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Medical University of South Carolina</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="15375" mmit_isformulary="false">121 </div></td></tr><tr class="k-master-row" data-uid="3fa555a5-f1ac-4b8c-aa45-45012275959c" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">University of South Carolina by BlueChoice</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="61594" mmit_isformulary="false">121 </div></td></tr><tr class="k-alt k-master-row" data-uid="e51abae1-36b3-423c-a95b-7c6b783f43d7" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">School Employees Retirement System of Ohio (SERS)</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="17026" mmit_isformulary="false">58 </div></td></tr><tr class="k-master-row" data-uid="efe3a535-661b-4c29-9458-079db9151209" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Ohio Teachers Retirement System (STRS)</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="17027" mmit_isformulary="false">58 </div></td></tr><tr class="k-alt k-master-row" data-uid="39dc9ebd-bee8-4209-a4b7-dba8b7e58349" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Ohio PERS Non-Medicare Plan</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="60209" mmit_isformulary="false">58 </div></td></tr><tr class="k-master-row" data-uid="1371938b-8242-4ad2-b7e7-6ccee0fabdb3" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Perdue Farms</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="62392" mmit_isformulary="false">121 </div></td></tr><tr class="k-alt k-master-row" data-uid="aba9a103-a22f-4d75-bec5-ae6e619029da" role="row"><td class="k-hierarchy-cell"><a class="" href="#" tabindex="-1"></a></td><td class="col-plan" role="gridcell">Montana State Fund Workers Comp</td><td class="col-status icon-status icon-status-not-covered" role="gridcell">Not Covered</td><td class="col-raw-status" role="gridcell">Not reimbursed</td><td class="col-restrictions" role="gridcell"></td><td class="col-alternatives" role="gridcell"><div class="HasAlternativesPopUp" mmit_id="65760" mmit_isformulary="false">121 </div></td></tr></tbody></table></div><div class="k-pager-wrap k-grid-pager k-widget k-floatwrap" data-role="pager"><a class="k-link k-pager-nav k-pager-first" data-page="1" href="#" tabindex="-1" title="Go to the first page"><span class="k-icon k-i-seek-w">Go to the first page</span></a><a class="k-link k-pager-nav" data-page="28" href="#" tabindex="-1" title="Go to the previous page"><span class="k-icon k-i-arrow-w">Go to the previous page</span></a><ul class="k-pager-numbers k-reset"><li class="k-current-page"><span class="k-link k-pager-nav">29</span></li><li><a class="k-link" data-page="20" href="#" tabindex="-1" title="More pages">...</a></li><li><a class="k-link" data-page="21" href="#" tabindex="-1">21</a></li><li><a class="k-link" data-page="22" href="#" tabindex="-1">22</a></li><li><a class="k-link" data-page="23" href="#" tabindex="-1">23</a></li><li><a class="k-link" data-page="24" href="#" tabindex="-1">24</a></li><li><a class="k-link" data-page="25" href="#" tabindex="-1">25</a></li><li><a class="k-link" data-page="26" href="#" tabindex="-1">26</a></li><li><a class="k-link" data-page="27" href="#" tabindex="-1">27</a></li><li><a class="k-link" data-page="28" href="#" tabindex="-1">28</a></li><li><span class="k-state-selected">29</span></li></ul><a class="k-link k-pager-nav k-state-disabled" data-page="29" href="#" tabindex="-1" title="Go to the next page"><span class="k-icon k-i-arrow-e">Go to the next page</span></a><a class="k-link k-pager-nav k-pager-last k-state-disabled" data-page="29" href="#" tabindex="-1" title="Go to the last page"><span class="k-icon k-i-seek-e">Go to the last page</span></a><span class="k-pager-info k-label">1401 - 1424 of 1424 items</span></div></div>'

Текущая функция: Пока код получает один элемент со страницы 1, добавляется к итогу, нажимает кнопку следующей страницы (до страницы 2), получает второй элемент со страницы 1, добавляется к итогу, нажимает следующую страницу (страницу 3), получает третий элемент со страницы 1, добавляется к итогу, нажмите следующую страницу (страница 4) и т.д.

Желаемая функция: Что нужно сделать, это очистить все элементы на странице 1 и добавить к итогу, нажать следующую страницу, очистить все новые элементы на странице 2 и добавить к итогу, нажать следующую страницу , et c.

Я использую комбинацию BS4 и Selenium.

        html = self.browser.page_source
        total = [] 
        soup = BeautifulSoup(html, "lxml")

        for e in soup.find_all("div", {"id":"lookupDetailsGrid"}):
            #max_page gets me the total number of pages to go through (27 pages)
            max_page = e.find("a", {"title":"Go to the last page"})["data-page"]

            for page in range(1,int(max_page)):

                for detail_grid in e.find_all("tr", {"class":"k-master-row"}):

                    pharm = detail_grid.find("td", {"class":"col-plan"}).text
                    print(pharm)
                    data = {"Plan": pharm}
                    total.append(data)

                    time.sleep(5)

                    #Clicks next page button
                    self.browser.find_element_by_xpath('//*[@id="lookupDetailsGrid"]/div[3]/a[3]').click()

        df = pd.DataFrame(total)
        print(df)

В результате получается

CVS Caremark Performance Standard Control w/Advanced Specialty Control
CVS Caremark Performance Standard Opt-Out w/ Advanced Specialty Control 
CVS Caremark Advanced Control Formulary
CVS Caremark Performance Standard Control
AT&T

Какие первые 5 результатов на странице 1. Всего 50 на странице.

Вопрос: Как мне организовать отступы?

Желаемый результат: 50 элементов на страницу, в общей сложности 1330 пунктов в общей сложности на 27 страницах.

Последнее обновление:

html = self.browser.page_source
total = [] 
soup = BeautifulSoup(html, "lxml")

    for grid in soup.find_all("div", {"class":"popup k-window-content k-content"}):

        for e in grid.find_all("div", {"id":"lookupDetailsGrid"}):
            max_page = e.find("a", {"title":"Go to the last page"})["data-page"]

            for b in range(1, int(max_page)):    
                self.browser.find_element_by_xpath('//*[@id="lookupDetailsGrid"]/div[3]/a[3]').click()
                print("page_clicked")
                time.sleep(2)

                for detail_grid in e.find_all("tr", {"class":"k-master-row"}):
                    pharm = detail_grid.find("td", {"class":"col-plan"}).text

                    data = {"Plan": pharm}
                    total.append(data)
                    print(pharm)
...