Санировать HTML-контент с помощью Python - PullRequest
1 голос
/ 20 сентября 2019

Я работаю с внешним API, который отправляет мне текст из электронных писем HTML.Текст приходит без структуры HTML (например, <html> ... </html> и т. Д.).Мне нужно санировать этот текст и выводить в Slack.Я попытался использовать BeautifulSoup и Bleach, ни один из которых не работает, предположительно из-за частичного характера HTML во входных данных.

Пример входного текста выглядит так:

&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.&lt;/div&gt;
&lt;div style=&#39;box-sizing:border-box;margin:0px 0px 24px;background-image:initial;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:0px;padding:0px;vertical-align:baseline;color:rgb(51,51,51);font-family:Georgia,&quot;Bitstream Charter&quot;,serif;font-size:16px&#39;&gt;Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.&lt;/div&gt;

Я хотел бы следующий вывод для ввода выше:

Bacon ipsum dolor amet cupim meatball ham hock pancetta ball tip ribeye cow brisket bresaola short ribs drumstick short loin. Turkey pastrami boudin andouille fatback tenderloin pork beef jowl rump hamburger buffalo capicola prosciutto. Meatball jerky pig filet mignon cow. Tenderloin flank tongue venison. Spare ribs fatback jerky pig boudin biltong filet mignon pancetta capicola.
Jerky salami brisket, landjaeger beef ribs meatball swine alcatra. Pork chop doner kielbasa jowl biltong tri-tip. Sausage sirloin prosciutto ribeye meatball capicola andouille picanha rump bacon turkey kevin pancetta landjaeger jowl. Spare ribs burgdoggen landjaeger buffalo capicola cow corned beef flank frankfurter boudin salami t-bone doner. Kevin filet mignon ribeye, pork belly andouille chuck pig drumstick. Short ribs tri-tip ball tip rump flank.
Pig biltong doner fatback. Tail hamburger kielbasa pastrami buffalo boudin cupim, pig jerky prosciutto venison pork chop chuck sirloin kevin. Bresaola bacon drumstick ball tip salami ribeye capicola beef ribs. Meatball tenderloin drumstick bresaola rump short ribs. Salami venison chuck burgdoggen.
Strip steak ham prosciutto, biltong meatball kielbasa boudin shankle ground round bacon. Alcatra short loin chuck shankle hamburger shank, buffalo sausage turkey prosciutto tongue kielbasa venison. Shank cow turducken beef ribs meatloaf pork belly. Pastrami leberkas ball tip pancetta short loin sirloin turducken rump hamburger cupim strip steak ground round brisket filet mignon pork. Beef shankle kevin tail picanha bacon beef ribs cow ground round pig ham rump. Bresaola spare ribs tenderloin pastrami, ham jowl short loin hamburger shankle tail venison pig meatloaf.

Я использовал следующую простую процедуру Bleach:

def textify(html):
 text = bleach.clean(html)
 return text

С BeautifulSoup я также использовал некоторые регулярные выражения для очистки вывода:

def textify(html):
  html = re.sub('<br>', '\n', html)
  soup = BeautifulSoup(html)
  text = soup.getText()
  text = re.sub(r'\&lt;', '<', text)
  text = re.sub(r'\&gt\;', '>', text)
  text = re.sub(r'\&\#39\;', "'", text)
  return text

1 Ответ

1 голос
/ 20 сентября 2019

Сначала необходимо удалить строки из строки перед тем, как передать их в отбеливатель или в красивую пару, используя html-модуль стандартной библиотеки :

from html import unescape

html = "&lt;div style=&#39;bo...div&gt;"
unescaped_html = unescape(html)

text = bleach.clean(unescaped_html)
soup = BeautifulSoup(unescaped_html)
...