Question

Мне нужно добыть содержимое большинства известных файлов документов, таких как:

PDF
HTML
Док / Док и т. Д.

Для большинства этих форматов файлов я планирую использовать:

http://tika.apache.org/

Но на данный момент Tika не поддерживает файлы MHTML (* .mht) .. (http://en.wikipedia.org/wiki/MHTML) В C # есть несколько примеров (http://www.codeproject.com/KB/files/MhtBuilder.aspx), но я не нашел ни одного в Java.

Я попытался открыть файл * .mht в 7Zip, и это не удалось ... Хотя WinZip смог распаковать файл в изображения и текст (CSS, HTML, Script) в текстовые и двоичные файлы ...

Согласно странице MSDN (http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content) и странице code project, о которой я упоминал ранее ... MHT-файлы используют сжатие GZip ....

Попытка распаковать в Java приводит к следующим исключениям: С java.uti.zip.GZIPInputStream

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)

А с java.util.zip.ZipFile

 java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)

Пожалуйста, предложите, как распаковать его ....

Спасибо ....

Favonius · Answer 1 · 13 июля 2010

Честно говоря, я не ожидал решения в ближайшем будущем и собирался сдаться, но кое-как я наткнулся на эту страницу:

http://en.wikipedia.org/wiki/MIME#Multipart_messages

http://msdn.microsoft.com/en-us/library/ms527355%28EXCHG.10%29.aspx

Хотя, на первый взгляд, не очень броский.Но если вы посмотрите внимательно, вы получите подсказку.Прочитав это, я запустил свой IE и случайно запустил сохранение страниц в файл *.mht.Позвольте мне построчно ...

Но позвольте мне заранее объяснить, что моей конечной целью было отделить / извлечь содержимое html и проанализировать его ... решение само по себе не является полным, поскольку онозависит от character set или encoding, которые я выбираю при сохранении.Но даже при том, что он будет извлекать отдельные файлы с небольшими заминками ...

Надеюсь, это будет полезно всем, кто пытается разобрать / распаковать *.mht/MHTML файлов:)

======= Объяснение ======== ** Взят из файла MHT **

From: "Saved by Windows Internet Explorer 7"

Это программное обеспечение, используемое для сохранения файла

Subject: Google
Date: Tue, 13 Jul 2010 21:23:03 +0530
MIME-Version: 1.0

Тема, дата и mime-версия ... очень похожи на почтовый формат

  Content-Type: multipart/related;
type="text/html";

Это часть, которая сообщает нам, что это multipart документ.Составной документ содержит один или несколько различных наборов данных, объединенных в одно тело, в заголовке объекта должно появиться поле multipart Content-Type.Здесь мы также можем видеть тип как "text/html".

boundary="----=_NextPart_000_0007_01CB22D1.93BBD1A0"

Из всего этого это самая важная часть.Это уникальный разделитель, который разделяет две разные части (HTML, изображения, CSS, сценарий и т. Д.). Как только вы получите это, все станет проще ... Теперь, мне просто нужно пройтись по документу, найти различные разделы и сохранить их в соответствии с их Content-Transfer-Encoding (base64, quoted-printable и т. Д.) ...,.

SAMPLE

 ------=_NextPart_000_0007_01CB22D1.93BBD1A0
 Content-Type: text/html;
 charset="utf-8"
 Content-Transfer-Encoding: quoted-printable
 Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" =
.
.
.

** JAVA CODE **

Интерфейс для определения констант.

public interface IConstants 
{
    public String BOUNDARY = "boundary";
    public String CHAR_SET = "charset";
    public String CONTENT_TYPE = "Content-Type";
    public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding";
    public String CONTENT_LOCATION = "Content-Location";

    public String UTF8_BOM = "=EF=BB=BF";

    public String UTF16_BOM1 = "=FF=FE";
    public String UTF16_BOM2 = "=FE=FF";
}

Класс основного синтаксического анализатора ...

/**
 * This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0
 * which accompanies this distribution, and is available at
 * http://www.eclipse.org/legal/epl-v10.html
 */
package com.test.mht.core;

import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import sun.misc.BASE64Decoder;

/**
 * File to parse and decompose *.mts file in its constituting parts.
 * @author Manish Shukla 
 */

public class MHTParser implements IConstants
{
    private File mhtFile;
    private File outputFolder;

    public MHTParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;
    }

    /**
     * @throws Exception
     */
    public void decompress() throws Exception
    {
        BufferedReader reader = null;

        String type = "";
        String encoding = "";
        String location = "";
        String filename = "";
        String charset = "utf-8";
        StringBuilder buffer = null;

        try
        {
            reader = new BufferedReader(new FileReader(mhtFile));

            final String boundary = getBoundary(reader);
            if(boundary == null)
                throw new Exception("Failed to find document 'boundary'... Aborting");

            String line = null;
            int i = 1;
            while((line = reader.readLine()) != null)
            {
                String temp = line.trim();
                if(temp.contains(boundary)) 
                {
                    if(buffer != null) {
                        writeBufferContentToFile(buffer,encoding,filename,charset);
                        buffer = null;
                    }

                    buffer = new StringBuilder();
                }else if(temp.startsWith(CONTENT_TYPE)) {
                    type = getType(temp);
                }else if(temp.startsWith(CHAR_SET)) {
                    charset = getCharSet(temp);
                }else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) {
                    encoding = getEncoding(temp);
                }else if(temp.startsWith(CONTENT_LOCATION)) {
                    location = temp.substring(temp.indexOf(":")+1).trim();
                    i++;
                    filename = getFileName(location,type);
                }else {
                    if(buffer != null) {
                        buffer.append(line + "\n");
                    }
                }
            }

        }finally 
        {
            if(null != reader)
                reader.close();
        }

    }

    private String getCharSet(String temp) 
    {
        String t = temp.split("=")[1].trim();
        return t.substring(1, t.length()-1);
    }

    /**
     * Save the file as per character set and encoding 
     */
    private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset) 
    throws Exception
    {

        if(!outputFolder.exists())
            outputFolder.mkdirs();

        byte[] content = null; 

        boolean text = true;

        if(encoding.equalsIgnoreCase("base64")){
            content = getBase64EncodedString(buffer);
            text = false;
        }else if(encoding.equalsIgnoreCase("quoted-printable")) {
            content = getQuotedPrintableString(buffer);         
        }
        else
            content = buffer.toString().getBytes();

        if(!text)
        {
            BufferedOutputStream bos = null;
            try
            {
                bos = new BufferedOutputStream(new FileOutputStream(filename));
                bos.write(content);
                bos.flush();
            }finally {
                bos.close();
            }
        }else 
        {
            BufferedWriter bw = null;
            try
            {
                bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset));
                bw.write(new String(content));
                bw.flush();
            }finally {
                bw.close();
            }
        }
    }

    /**
     * When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF'</br>
     * @see http://en.wikipedia.org/wiki/Byte_order_mark
     */
    private byte[] getQuotedPrintableString(StringBuilder buffer) 
    {
        //Set<String> uniqueHex = new HashSet<String>();
        //final Pattern p = Pattern.compile("(=\\p{XDigit}{2})*");

        String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=\n", "");

        //Matcher m = p.matcher(temp);
        //while(m.find()) {
        //  uniqueHex.add(m.group());
        //}

        //System.out.println(uniqueHex);

        //for (String hex : uniqueHex) {
            //temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1)));
        //}     

        return temp.getBytes();
    }

    /*private String getASCIIValue(String hex) {
        return ""+(char)Integer.parseInt(hex, 16);
    }*/
    /**
     * Although system dependent..it works well
     */
    private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception {
        return new BASE64Decoder().decodeBuffer(buffer.toString());
    }

    /**
     * Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL.
     * Otherwise it returns 'unknown.<type>'
     */
    private String getFileName(String location, String type) 
    {
        final Pattern p = Pattern.compile("(\\w|_|-)+\\.\\w+");
        String ext = "";
        String name = "";
        if(type.toLowerCase().endsWith("jpeg"))
            ext = "jpg";
        else
            ext = type.split("/")[1];

        if(location.endsWith("/")) {
            name = "main";
        }else
        {
            name = location.substring(location.lastIndexOf("/") + 1);

            Matcher m = p.matcher(name);
            String fname = "";
            while(m.find()) {
                fname = m.group();
            }

            if(fname.trim().length() == 0)
                name = "unknown";
            else
                return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length()));
        }
        return getUniqueName(name,ext);
    }

    /**
     * Returns a qualified unique output file path for the parsed path.</br>
     * In case the file already exist it appends a numarical value a continues
     */
    private String getUniqueName(String name,String ext)
    {
        int i = 1;
        File file = new File(outputFolder,name + "." + ext);
        if(file.exists())
        {
            while(true)
            {
                file = new File(outputFolder, name + i + "." + ext);
                if(!file.exists())
                    return file.getAbsolutePath();
                i++;
            }
        }

        return file.getAbsolutePath();
    }

    private String getType(String line) {
        return splitUsingColonSpace(line);
    }

    private String getEncoding(String line){
        return splitUsingColonSpace(line);
    }

    private String splitUsingColonSpace(String line) {
        return line.split(":\\s*")[1].replaceAll(";", "");
    }

    /**
     * Gives you the boundary string
     */
    private String getBoundary(BufferedReader reader) throws Exception 
    {
        String line = null;

        while((line = reader.readLine()) != null)
        {
            line = line.trim();
            if(line.startsWith(BOUNDARY)) {
                return line.substring(line.indexOf("\"") + 1, line.lastIndexOf("\""));
            }
        }

        return null;
    }
}

С уважением,

wener · Answer 2 · 03 марта 2015

Вам не нужно делать это самостоятельно.

С зависимостью

<dependency>
    <groupId>org.apache.james</groupId>
    <artifactId>apache-mime4j</artifactId>
    <version>0.7.2</version>
</dependency>

Свернуть файл MHT

public static void main(String[] args)
{
    MessageTree.main(new String[]{"YOU MHT FILE PATH"});
}

MessageTree будет

/**
 * Displays a parsed Message in a window. The window will be divided into
 * two panels. The left panel displays the Message tree. Clicking on a
 * node in the tree shows information on that node in the right panel.
 *
 * Some of this code have been copied from the Java tutorial's JTree section.
 */

Тогда вы можете посмотреть на это.

; -)

David Turner · Answer 3 · 12 апреля 2016

Опоздал на вечеринку, но остановился на ответе @ wener за всех, кто наткнулся на это.

Библиотека Apache Mime4J , по-видимому, является наиболее доступным решением для обработки EML или MHTML , намного проще, чем самостоятельная работа!

Моя функция-прототип 'parseMhtToFile' ниже удаляет html-файлы и другие артефакты из файла mht активного отчета Cognos, но может быть приспособлена для других целей.

Это написано на Groovy и требует Apache Mime4J 'core' и 'dom' jar (в настоящее время 0.7.2).

import org.apache.james.mime4j.dom.Message
import org.apache.james.mime4j.dom.Multipart
import org.apache.james.mime4j.dom.field.ContentTypeField
import org.apache.james.mime4j.message.DefaultMessageBuilder
import org.apache.james.mime4j.stream.MimeConfig

/**
 * Use Mime4J MessageBuilder to parse an mhtml file (assumes multipart) into
 * separate html files.
 * Files will be written to outDir (or parent) as baseName + partIdx + ext.
 */
void parseMhtToFile(File mhtFile, File outDir = null) {
    if (!outDir) {outDir = mhtFile.parentFile }
    // File baseName will be used in generating new filenames
    def mhtBaseName = mhtFile.name.replaceFirst(~/\.[^\.]+$/, '')

    // -- Set up Mime parser, using Default Message Builder
    MimeConfig parserConfig  = new MimeConfig();
    parserConfig.setMaxHeaderLen(-1); // The default is a mere 10k
    parserConfig.setMaxLineLen(-1); // The default is only 1000 characters.
    parserConfig.setMaxHeaderCount(-1); // Disable the check for header count.
    DefaultMessageBuilder builder = new DefaultMessageBuilder();
    builder.setMimeEntityConfig(parserConfig);

    // -- Parse the MHT stream data into a Message object
    println "Parsing ${mhtFile}...";
    InputStream mhtStream = mhtFile.newInputStream()
    Message message = builder.parseMessage(mhtStream);

    // -- Process the resulting body parts, writing to file
    assert message.getBody() instanceof Multipart
    Multipart multipart = (Multipart) message.getBody();
    def parts = multipart.getBodyParts();
    parts.eachWithIndex { p, i ->
        ContentTypeField cType = p.header.getField('content-type')
        println "${p.class.simpleName}\t${i}\t${cType.mimeType}"

        // Assume mime sub-type is a "good enough" file-name extension 
        // e.g. text/html = html, image/png = png, application/json = json
        String partFileName = "${mhtBaseName}_${i}.${cType.subType}"
        File partFile = new File(outDir, partFileName)

        // Write part body stream to file
        println "Writing ${partFile}...";
        if (partFile.exists()) partFile.delete();
        InputStream partStream = p.body.inputStream;
        partFile.append(partStream);
    }
}

Использование просто:

File mhtFile = new File('<path>', 'Report-en-au.mht')
parseMhtToFile(mhtFile)
println 'Done.'

Вывод:

Parsing <path>\Report-en-au.mht...
BodyPart    0   text/html
Writing <path>\Report-en-au_0.html...
BodyPart    1   image/png
Writing <path>\Report-en-au_1.png...
Done.

Мысли о других улучшениях:

Для «текстовых» частей пантомимы вы можете получить доступ к Reader вместо Stream, который может быть более подходящим для интеллектуального анализа текста, когда запрашивается OP .
Для сгенерированных расширений имени файла я бы использовал другую библиотеку для поиска подходящего расширения, не предполагая, что подтип mime является адекватным.
Обработка одночастичных (не многочастных) и рекурсивных многочастных mhtml файлов и других сложностей. Для этого может потребоваться MimeStreamParser с пользовательской реализацией Content Handler .

Wajdy Essam · Answer 4 · 12 июля 2010

я был использован http://jtidy.sourceforge.net для анализа / чтения / индексации файлов MHT (но как обычных файлов, а не сжатых файлов)

Roki · Answer 5 · 12 июля 2010

U может попробовать http://www.chilkatsoft.com/mht-features.asp, он может упаковать / распаковать, и вы можете обрабатывать его после того, как обычные файлы. Ссылка для скачивания: http://www.chilkatsoft.com/java.asp

Как читать или анализировать MHTML (.mht) файлы в Java

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 5 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Как читать или анализировать MHTML (.mht) файлы в Java

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 5 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы