Замените таблицу в html-файле текстом .... (например, @@ ## Здесь была таблица) - PullRequest
1 голос
/ 25 апреля 2019

Я извлекаю текст из HTML-файла в Python, используя Beautifulsoup.Я хочу извлечь все текстовые данные и сбросить таблицы.Но можем ли мы что-то сделать, чтобы заменить таблицу в html текстом (например, "@@ ## Здесь была таблица @@ ##")

Я смог прочитать html-файл, используя beautifulsoup иудалена таблица uisng strip_tables (html).Но не уверен, как удалить таблицу и заменить ее текстом, указав таблицу здесь.

def strip_tables(soup):
    """Removes all tables from the soup object."""
    for script in soup(["table"]): 
        script.extract()
    return soup

sample_html_file = "/Path/file.html"
html = read_from_file(sample_html_file) 
# This function reads the file and returns a file handle for beautifulsoup
soup = BeautifulSoup(html, "lxml")
my_text = strip_tables( soup ).text

Это HTML-файл с таблицей:

By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine  President and Chief Executive OfficerSunnyvale, California  October 4, 2018

Table of Contents  TABLE OF CONTENTS             Page   QUESTIONS AND ANSWERS REGARDING  THIS SOLICITATION AND VOTING AT THE ANNUAL MEETING      1   PROPOSAL ONEELECTION OF  DIRECTORS      7   Classes of our Board      7   Director NomineesClass III Directors      7   Continuing DirectorsClass I and Class II Directors      8   Board of Directors Recommendation      11   PROPOSAL TWOTO APPROVE  AN AMENDMENT TO OUR 2016 EQUITY INCENTIVE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN      12   Summary of the Amended 2016 Plan      13   Summary of U.S. Federal Income Tax Consequences      20   New Plan Benefits      22   Existing Plan Benefits to Employees and Directors      23   Board of Directors Recommendation      23   PROPOSAL THREETO APPROVE  AN AMENDMENT TO OUR 2007 EMPLOYEE STOCK PURCHASE PLAN TO INCREASE THE NUMBER OF SHARES OF COMMON STOCK AUTHORIZED FOR ISSUANCE UNDER SUCH PLAN        A-1   APPENDIX B     AMENDED AND RESTATED 2007 EMPLOYEE STOCK PURCHASE PLAN      B-1    ii    Table of Contents    PROXY STATEMENT FOR  ACCURAY INCORPORATED  2018 ANNUAL MEETING OF STOCKHOLDERS  TO BE HELD ON NOVEMBER 16, 2018      

This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)

Это данные после strip_tables:

By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine  President and Chief Executive OfficerSunnyvale, California  October 4, 2018
     This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)

Ожидаемый результат

By order of the Board of Directors, /s/ JOSHUA H. LEVINE Joshua H. Levine  President and Chief Executive OfficerSunnyvale, California  October 4, 2018 
" @@## There was a table here @@## "
This proxy statement (Proxy Statement) is furnished to our stockholders of record as of the close of business on September 20, 2018 (the Record Date)

1 Ответ

2 голосов
/ 26 апреля 2019

Пожалуйста, попробуйте использовать replaceWith() вместо extract() в функции strip_tables.Надеюсь, это поможет вам.

def strip_tables(soup):
    """Removes all tables from the soup object."""
    for script in soup(["table"]): 
        script.replaceWith(" @@## There was a table here @@## ")
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...