Question

Я пытаюсь создать вложенную таблицу содержимого на основе тегов заголовка HTML.

Мой HTML-файл:

<html>
<head>
  <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
  <h1>
            My report Name
  </h1>
  <h1 id="2">First Chapter                          </h1>
  <h2 id="3"> First Sub-chapter of the first chapter</h2>
  <ul>
    <h1 id="text1">Useless h1</h1>
    <p>
      some text
    </p>
  </ul>
  <h2 id="4">Second Sub-chapter of the first chapter </h2>
  <ul>
    <h1 id="text2">Useless h1</h1>
    <p>
      some text
    </p>
  </ul>
  <h1 id="5">Second Chapter                          </h1>
  <h2 id="6">First Sub-chapter of the Second chapter </h2>
  <ul>
    <h1 id="text6">Useless h1</h1>
    <p>
      some text
    </p>
  </ul>
  <h2 id="7">Second Sub-chapter of the Second chapter </h2>
  <ul>
    <h1 id="text6">Useless h1</h1>
    <p>
      some text
    </p>
  </ul>
</body>
</html>

Мой код Python:

import from lxml import html
from bs4 import BeautifulSoup as soup
import re
import codecs
#Access to the local URL(Html file)
f = codecs.open("C:\\x\\test.html", 'r')
page = f.read()
f.close()
#html parsing
page_soup = soup(page,"html.parser")
tree = html.fromstring(page)#extract report name
ref = page_soup.find("h1",{"id": False}).text.strip()
print("the name of the report is : " + ref + " \n")

chapters = page_soup.findAll('h1', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(chapters)) + " chapter(s)")
for index, chapter in enumerate(chapters):
    print(str(index+1) +"-" + str(chapter.text.strip()) + "\n")

sub_chapters = page_soup.findAll('h2', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(sub_chapters)) + " sub_chapter(s)")
for index, sub_chapter in enumerate(sub_chapters):
    print(str(index+1) +"-" +str(sub_chapter.text.strip()) + "\n")

С помощью этого кода я могу получить все главы и все подглавы, но это не моя цель.

Моя цель - получить приведенное ниже в качестве содержания:

1-First Chapter
    1-First sub-chapter of the first chapter
    2-Second sub-chapter of the first chapter
2-Second Chapter    
    1-First sub-chapter of the Second chapter
    2-Second sub-chapter of the Second chapter

Любые рекомендации или идеи о том, как добиться желаемого формата содержания?

bob0the0mighty · Answer 1 · 12 марта 2019

Если вы хотите изменить свой HTML-макет на что-то похожее на приведенное ниже:

<html>

<head>
  <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>

<body>
  <article>
    <h1>
      My report Name
    </h1>
    <section>
      <h2 id="chapter-one">First Chapter</h2>
      <section>
        <h3 id="one-one"> First Sub-chapter of the first chapter</h3>
        <ul>
          <h4 id="text1">Useless h4</h4>
          <p>
            some text
          </p>
        </ul>
      </section>
      <section>
        <h3 id="one-two">Second Sub-chapter of the first chapter</h3>
        <ul>
          <h4 id="text2">Useless h4</h4>
          <p>
            some text
          </p>
        </ul>
      </section>
    </section>
    <section>
      <h2 id="chapter-two">Second Chapter </h2>
      <section>
        <h3 id="two-one">First Sub-chapter of the Second chapter</h3>
        <ul>
          <h4 id="text6">Useless h4</h4>
          <p>
            some text
          </p>
        </ul>
      </section>
      <section>
        <h3 id="two-two">Second Sub-chapter of the Second chapter</h3>
        <ul>
          <h4 id="text6">Useless h4</h4>
          <p>
            some text
          </p>
        </ul>
      </section>
    </section>
  </article>
</body>

</html>

Тогда ваш код на Python станет немного проще:

from lxml import html
from bs4 import BeautifulSoup as soup
import re
import codecs

#Access to the local URL(Html file)
with codecs.open("index.html", 'r') as f:
  page = f.read()

#html parsing
page_soup = soup(page,"html.parser")
tree = html.fromstring(page)#extract report name
ref = page_soup.find("h1").text.strip()
print("the name of the report is : " + ref + " \n")

chapters = page_soup.findAll('h2')
for index, chapter in enumerate(chapters):
    print(str(index+1) +"-" + str(chapter.text.strip()))
    sub_chapters = chapter.find_parent().find_all("h3")
    for index2, sub_chapter in enumerate(sub_chapters):
       print("\t" + str(index2+1) +"-" +str(sub_chapter.text.strip()))

Я немного обновил код чтения страницы и попытался использовать в обновленном скрипте больше идиоматического питона.

Также обратите внимание, что:

sub_chapters = chapter.find_parent().find_all("h3")

find_all относится к родителю главы, а не ко всему документу

Ajax1234 · Answer 2 · 12 марта 2019

Вы можете использовать itertools.groupby после нахождения всех данных, связанных с каждой главой:

from itertools import groupby, count
import re
from bs4 import BeautifulSoup as soup
data = [[i.name, re.sub('\s+$', '', i.text)] for i in soup(content, 'html.parser').find_all(re.compile('h1|h2'), {'id':re.compile('^\d+$')})]
grouped, _count = [[a, list(b)] for a, b in groupby(data, key=lambda x:x[0] == 'h1')], count(1)
new_grouped = [[grouped[i][-1][0][-1], [c for _, c in grouped[i+1][-1]]] for i in range(0, len(grouped), 2)]
final_string = '\n'.join(f'{next(_count)}-{a}\n'+'\n'.join(f'\t{i}-{c}' for i, c in enumerate(b, 1)) for a, b in new_grouped)
print(final_string)

Вывод:

1-First Chapter
    1- First Sub-chapter of the first chapter
    2-Second Sub-chapter of the first chapter
2-Second Chapter
    1-First Sub-chapter of the Second chapter
    2-Second Sub-chapter of the Second chapter

Автоматически генерировать вложенное оглавление на основе тегов заголовка, используя python

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Автоматически генерировать вложенное оглавление на основе тегов заголовка, используя python

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы