Question

Существует ли какая-либо утилита (или пример исходного кода), которая усекает HTML (для предварительного просмотра) в Java? Я хочу выполнить усечение на сервере, а не на клиенте.

Я использую HTMLUnit для разбора HTML.

UPDATE:
Я хочу иметь возможность предварительного просмотра HTML-кода, чтобы усечение сохраняло структуру HTML-кода, удаляя элементы после желаемой длины вывода.

Scott Brady · Answer 1 · 07 декабря 2011

Я написал еще одну Java-версию truncateHTML. Эта функция усекает строку до количества символов, сохраняя целые слова и теги HTML.

public static String truncateHTML(String text, int length, String suffix) {
    // if the plain text is shorter than the maximum length, return the whole text
    if (text.replaceAll("<.*?>", "").length() <= length) {
        return text;
    }
    String result = "";
    boolean trimmed = false;
    if (suffix == null) {
        suffix = "...";
    }

    /*
     * This pattern creates tokens, where each line starts with the tag.
     * For example, "One, <b>Two</b>, Three" produces the following:
     *     One,
     *     <b>Two
     *     </b>, Three
     */
    Pattern tagPattern = Pattern.compile("(<.+?>)?([^<>]*)");

    /*
     * Checks for an empty tag, for example img, br, etc.
     */
    Pattern emptyTagPattern = Pattern.compile("^<\\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param).*>$");

    /*
     * Modified the pattern to also include H1-H6 tags
     * Checks for closing tags, allowing leading and ending space inside the brackets
     */
    Pattern closingTagPattern = Pattern.compile("^<\\s*/\\s*([a-zA-Z]+[1-6]?)\\s*>$");

    /*
     * Modified the pattern to also include H1-H6 tags
     * Checks for opening tags, allowing leading and ending space inside the brackets
     */
    Pattern openingTagPattern = Pattern.compile("^<\\s*([a-zA-Z]+[1-6]?).*?>$");

    /*
     * Find &nbsp; &gt; ...
     */
    Pattern entityPattern = Pattern.compile("(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)");

    // splits all html-tags to scanable lines
    Matcher tagMatcher =  tagPattern.matcher(text);
    int numTags = tagMatcher.groupCount();

    int totalLength = suffix.length();
    List<String> openTags = new ArrayList<String>();

    boolean proposingChop = false;
    while (tagMatcher.find()) {
        String tagText = tagMatcher.group(1);
        String plainText = tagMatcher.group(2);

        if (proposingChop &&
                tagText != null && tagText.length() != 0 &&
                plainText != null && plainText.length() != 0) {
            trimmed = true;
            break;
        }

        // if there is any html-tag in this line, handle it and add it (uncounted) to the output
        if (tagText != null && tagText.length() > 0) {
            boolean foundMatch = false;

            // if it's an "empty element" with or without xhtml-conform closing slash
            Matcher matcher = emptyTagPattern.matcher(tagText);
            if (matcher.find()) {
                foundMatch = true;
                // do nothing
            }

            // closing tag?
            if (!foundMatch) {
                matcher = closingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    foundMatch = true;
                    // delete tag from openTags list
                    String tagName = matcher.group(1);
                    openTags.remove(tagName.toLowerCase());
                }
            }

            // opening tag?
            if (!foundMatch) {
                matcher = openingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    // add tag to the beginning of openTags list
                    String tagName = matcher.group(1);
                    openTags.add(0, tagName.toLowerCase());
                }
            }

            // add html-tag to result
            result += tagText;
        }

        // calculate the length of the plain text part of the line; handle entities (e.g. &nbsp;) as one character
        int contentLength = plainText.replaceAll("&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};", " ").length();
        if (totalLength + contentLength > length) {
            // the number of characters which are left
            int numCharsRemaining = length - totalLength;
            int entitiesLength = 0;
            Matcher entityMatcher = entityPattern.matcher(plainText);
            while (entityMatcher.find()) {
                String entity = entityMatcher.group(1);
                if (numCharsRemaining > 0) {
                    numCharsRemaining--;
                    entitiesLength += entity.length();
                } else {
                    // no more characters left
                    break;
                }
            }

            // keep us from chopping words in half
            int proposedChopPosition = numCharsRemaining + entitiesLength;
            int endOfWordPosition = plainText.indexOf(" ", proposedChopPosition-1);
            if (endOfWordPosition == -1) {
                endOfWordPosition = plainText.length();
            }
            int endOfWordOffset = endOfWordPosition - proposedChopPosition;
            if (endOfWordOffset > 6) { // chop the word if it's extra long
                endOfWordOffset = 0;
            }

            proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
            if (plainText.length() >= proposedChopPosition) {
                result += plainText.substring(0, proposedChopPosition);
                proposingChop = true;
                if (proposedChopPosition < plainText.length()) {
                    trimmed = true;
                    break; // maximum length is reached, so get off the loop
                }
            } else {
                result += plainText;
            }
        } else {
            result += plainText;
            totalLength += contentLength;
        }
        // if the maximum length is reached, get off the loop
        if(totalLength >= length) {
            trimmed = true;
            break;
        }
    }

    for (String openTag : openTags) {
        result += "</" + openTag + ">";
    }
    if (trimmed) {
        result += suffix;
    }
    return result;
}

Nico · Answer 2 · 09 июня 2011

Здесь есть функция PHP, которая делает это: http://snippets.dzone.com/posts/show/7125

Я сделал быстрый и грязный порт Java для начальной версии, но в комментариях есть последующие улучшенные версии, которые стоит рассмотреть (особенно те, которые работают с целыми словами):

public static String truncateHtml(String s, int l) {
  Pattern p = Pattern.compile("<[^>]+>([^<]*)");

  int i = 0;
  List<String> tags = new ArrayList<String>();

  Matcher m = p.matcher(s);
  while(m.find()) {
      if (m.start(0) - i >= l) {
          break;
      }

      String t = StringUtils.split(m.group(0), " \t\n\r\0\u000B>")[0].substring(1);
      if (t.charAt(0) != '/') {
          tags.add(t);
      } else if ( tags.get(tags.size()-1).equals(t.substring(1))) {
          tags.remove(tags.size()-1);
      }
      i += m.start(1) - m.start(0);
  }

  Collections.reverse(tags);
  return s.substring(0, Math.min(s.length(), l+i))
      + ((tags.size() > 0) ? "</"+StringUtils.join(tags, "></")+">" : "")
      + ((s.length() > l) ? "\u2026" : "");

}

Примечание: Вам понадобится Apache Commons Lang для StringUtils.join().

Stefan Kendall · Answer 3 · 23 марта 2010

Я думаю, вам нужно написать свой собственный XML-парсер, чтобы выполнить это. Вытащите узел тела, добавьте узлы, пока длина двоичного кода не станет <некоторый фиксированный размер, а затем пересоберите документ. Если HTMLUnit не создает семантический XHTML, я бы порекомендовал <a href="http://home.ccil.org/~cowan/XML/tagsoup/" rel="nofollow noreferrer"> tagsoup .

Если вам нужен анализатор / обработчик XML, я бы порекомендовал XOM .

user3615395 · Answer 4 · 09 сентября 2014

public class SimpleHtmlTruncator {

    public static String truncateHtmlWords(String text, int max_length) {
        String input = text.trim();
        if (max_length > input.length()) {
            return input;
        }
        if (max_length < 0) {
            return new String();
        }
        StringBuilder output = new StringBuilder();
        /**
         * Pattern pattern_opentag = Pattern.compile("(<[^/].*?[^/]>).*");
         * Pattern pattern_closetag = Pattern.compile("(</.*?[^/]>).*"); Pattern
         * pattern_selfclosetag = Pattern.compile("(<.*?/>).*");*
         */
        String HTML_TAG_PATTERN = "<(\"[^\"]*\"|'[^']*'|[^'\">])*>";
        Pattern pattern_overall = Pattern.compile(HTML_TAG_PATTERN + "|" + "\\s*\\w*\\s*");
        Pattern pattern_html = Pattern.compile("(" + HTML_TAG_PATTERN + ")" + ".*");
        Pattern pattern_words = Pattern.compile("(\\s*\\w*\\s*).*");
        int characters = 0;
        Matcher all = pattern_overall.matcher(input);
        while (all.find()) {
            String matched = all.group();
            Matcher html_matcher = pattern_html.matcher(matched);
            Matcher word_matcher = pattern_words.matcher(matched);
            if (html_matcher.matches()) {
                output.append(html_matcher.group());
            } else if (word_matcher.matches()) {
                if (characters < max_length) {
                    String word = word_matcher.group();
                    if (characters + word.length() < max_length) {
                        output.append(word);
                    } else {
                        output.append(word.substring(0,
                                (max_length - characters) > word.length()
                                ? word.length() : (max_length - characters)));
                    }
                    characters += word.length();
                }
            }
        }
        return output.toString();
    }

    public static void main(String[] args) {
        String text = SimpleHtmlTruncator.truncateHtmlWords("<html><body><br/><p>abc</p><p>defghij</p><p>ghi</p></body></html>", 4);
        System.out.println(text);
    }
}

Ralph · Answer 5 · 03 января 2013

Я нашел этот блог: dencat: усечение HTML в Java

Содержит порт Java Python, функция шаблона Django truncate_html_words

David Z · Answer 6 · 24 марта 2010

Я могу предложить вам сценарий Python, который я написал для этого: http://www.ellipsix.net/ext-tmp/summarize.txt. К сожалению, у меня нет версии Java, но вы можете перевести ее самостоятельно и изменить, если хотите. Это не очень сложно, просто что-то, что я взломал вместе для своего сайта, но я использую его чуть больше года, и в целом, похоже, работает довольно хорошо.

Если вам нужно что-то надежное, то синтаксический анализатор XML (или SGML) почти наверняка станет лучшей идеей, чем я.

HTML усечение в Java

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 6 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

HTML усечение в Java

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 6 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы