Question

Мое простое требование: чтение огромного (> миллиона) тестового файла строки (в этом примере предположим, что это CSV-код) и сохранение ссылки на начало этой строки для более быстрого поиска в будущем (прочитайте строку начиная с X).

Сначала я попробовал наивный и простой способ, используя StreamWriter и получив доступ к базовому BaseStream.Position. К сожалению, это не работает, как я планировал:

Учитывая файл, содержащий следующее

Foo
Bar
Baz
Bla
Fasel

и этот очень простой код

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = sr.BaseStream.Position;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos = sr.BaseStream.Position;
  }
}

вывод:

000 Foo
025 Bar
025 Baz
025 Bla
025 Fasel

Я могу себе представить, что поток пытается быть полезным / эффективным и, вероятно, читает (большие) куски всякий раз, когда необходимы новые данные. Для меня это плохо ..

Наконец, вопрос: есть ли способ получить смещение (byte, char) при чтении файла построчно, без использования основного потока и переписки с \ r \ n \ r \ n, кодированием строки и т. Д. Вручную? Ничего страшного, правда, я просто не люблю строить вещи, которые уже могут существовать ..

Thomas Levesque · Answer 1 · 07 апреля 2010

Вы можете создать оболочку TextReader, которая будет отслеживать текущую позицию в базе TextReader:

public class TrackingTextReader : TextReader
{
    private TextReader _baseReader;
    private int _position;

    public TrackingTextReader(TextReader baseReader)
    {
        _baseReader = baseReader;
    }

    public override int Read()
    {
        _position++;
        return _baseReader.Read();
    }

    public override int Peek()
    {
        return _baseReader.Peek();
    }

    public int Position
    {
        get { return _position; }
    }
}

Затем вы можете использовать его следующим образом:

string text = @"Foo
Bar
Baz
Bla
Fasel";

using (var reader = new StringReader(text))
using (var trackingReader = new TrackingTextReader(reader))
{
    string line;
    while ((line = trackingReader.ReadLine()) != null)
    {
        Console.WriteLine("{0:d3} {1}", trackingReader.Position, line);
    }
}

Quynh Nguyen · Answer 2 · 23 октября 2012

После поиска, тестирования и создания чего-то сумасшедшего, есть мой код для решения (я сейчас использую этот код в своем продукте).

public sealed class TextFileReader : IDisposable
{

    FileStream _fileStream = null;
    BinaryReader _binReader = null;
    StreamReader _streamReader = null;
    List<string> _lines = null;
    long _length = -1;

    /// <summary>
    /// Initializes a new instance of the <see cref="TextFileReader"/> class with default encoding (UTF8).
    /// </summary>
    /// <param name="filePath">The path to text file.</param>
    public TextFileReader(string filePath) : this(filePath, Encoding.UTF8) { }

    /// <summary>
    /// Initializes a new instance of the <see cref="TextFileReader"/> class.
    /// </summary>
    /// <param name="filePath">The path to text file.</param>
    /// <param name="encoding">The encoding of text file.</param>
    public TextFileReader(string filePath, Encoding encoding)
    {
        if (!File.Exists(filePath))
            throw new FileNotFoundException("File (" + filePath + ") is not found.");

        _fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read);
        _length = _fileStream.Length;
        _binReader = new BinaryReader(_fileStream, encoding);
    }

    /// <summary>
    /// Reads a line of characters from the current stream at the current position and returns the data as a string.
    /// </summary>
    /// <returns>The next line from the input stream, or null if the end of the input stream is reached</returns>
    public string ReadLine()
    {
        if (_binReader.PeekChar() == -1)
            return null;

        string line = "";
        int nextChar = _binReader.Read();
        while (nextChar != -1)
        {
            char current = (char)nextChar;
            if (current.Equals('\n'))
                break;
            else if (current.Equals('\r'))
            {
                int pickChar = _binReader.PeekChar();
                if (pickChar != -1 && ((char)pickChar).Equals('\n'))
                    nextChar = _binReader.Read();
                break;
            }
            else
                line += current;
            nextChar = _binReader.Read();
        }
        return line;
    }

    /// <summary>
    /// Reads some lines of characters from the current stream at the current position and returns the data as a collection of string.
    /// </summary>
    /// <param name="totalLines">The total number of lines to read (set as 0 to read from current position to end of file).</param>
    /// <returns>The next lines from the input stream, or empty collectoin if the end of the input stream is reached</returns>
    public List<string> ReadLines(int totalLines)
    {
        if (totalLines < 1 && this.Position == 0)
            return this.ReadAllLines();

        _lines = new List<string>();
        int counter = 0;
        string line = this.ReadLine();
        while (line != null)
        {
            _lines.Add(line);
            counter++;
            if (totalLines > 0 && counter >= totalLines)
                break;
            line = this.ReadLine();
        }
        return _lines;
    }

    /// <summary>
    /// Reads all lines of characters from the current stream (from the begin to end) and returns the data as a collection of string.
    /// </summary>
    /// <returns>The next lines from the input stream, or empty collectoin if the end of the input stream is reached</returns>
    public List<string> ReadAllLines()
    {
        if (_streamReader == null)
            _streamReader = new StreamReader(_fileStream);
        _streamReader.BaseStream.Seek(0, SeekOrigin.Begin);
        _lines = new List<string>();
        string line = _streamReader.ReadLine();
        while (line != null)
        {
            _lines.Add(line);
            line = _streamReader.ReadLine();
        }
        return _lines;
    }

    /// <summary>
    /// Gets the length of text file (in bytes).
    /// </summary>
    public long Length
    {
        get { return _length; }
    }

    /// <summary>
    /// Gets or sets the current reading position.
    /// </summary>
    public long Position
    {
        get
        {
            if (_binReader == null)
                return -1;
            else
                return _binReader.BaseStream.Position;
        }
        set
        {
            if (_binReader == null)
                return;
            else if (value >= this.Length)
                this.SetPosition(this.Length);
            else
                this.SetPosition(value);
        }
    }

    void SetPosition(long position)
    {
        _binReader.BaseStream.Seek(position, SeekOrigin.Begin);
    }

    /// <summary>
    /// Gets the lines after reading.
    /// </summary>
    public List<string> Lines
    {
        get
        {
            return _lines;
        }
    }

    /// <summary>
    /// Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.
    /// </summary>
    public void Dispose()
    {
        if (_binReader != null)
            _binReader.Close();
        if (_streamReader != null)
        {
            _streamReader.Close();
            _streamReader.Dispose();
        }
        if (_fileStream != null)
        {
            _fileStream.Close();
            _fileStream.Dispose();
        }
    }

    ~TextFileReader()
    {
        this.Dispose();
    }
}

Anton Purin · Answer 3 · 02 марта 2014

Это действительно сложная проблема. После очень долгого и изнурительного перечисления различных решений в Интернете (включая решения из этой ветки, спасибо!) Мне пришлось создать свой собственный велосипед.

У меня были следующие требования:

Производительность - чтение должно быть очень быстрым, поэтому считывание одного символа за раз или использование отражения недопустимы, поэтому требуется буферизация
Потоковое - файл может быть огромным, поэтому его невозможно полностью прочитать в память
Хвост - Хвост файла должен быть доступен
Длинные строки - строки могут быть очень длинными, поэтому буфер не может быть ограничен

Стабильный - однобайтовая ошибка была сразу видна во время использования. К сожалению для меня, несколько реализаций, которые я обнаружил, были с проблемами стабильности

public class OffsetStreamReader
{
    private const int InitialBufferSize = 4096;    
    private readonly char _bom;
    private readonly byte _end;
    private readonly Encoding _encoding;
    private readonly Stream _stream;
    private readonly bool _tail;

    private byte[] _buffer;
    private int _processedInBuffer;
    private int _informationInBuffer;

    public OffsetStreamReader(Stream stream, bool tail)
    {
        _buffer = new byte[InitialBufferSize];
        _processedInBuffer = InitialBufferSize;

        if (stream == null || !stream.CanRead)
            throw new ArgumentException("stream");

        _stream = stream;
        _tail = tail;
        _encoding = Encoding.UTF8;

        _bom = '\uFEFF';
        _end = _encoding.GetBytes(new [] {'\n'})[0];
    }

    public long Offset { get; private set; }

    public string ReadLine()
    {
        // Underlying stream closed
        if (!_stream.CanRead)
            return null;

        // EOF
        if (_processedInBuffer == _informationInBuffer)
        {
            if (_tail)
            {
                _processedInBuffer = _buffer.Length;
                _informationInBuffer = 0;
                ReadBuffer();
            }

            return null;
        }

        var lineEnd = Search(_buffer, _end, _processedInBuffer);
        var haveEnd = true;

        // File ended but no finalizing newline character
        if (lineEnd.HasValue == false && _informationInBuffer + _processedInBuffer < _buffer.Length)
        {
            if (_tail)
                return null;
            else
            {
                lineEnd = _informationInBuffer;
                haveEnd = false;
            }
        }

        // No end in current buffer
        if (!lineEnd.HasValue)
        {
            ReadBuffer();
            if (_informationInBuffer != 0)
                return ReadLine();

            return null;
        }

        var arr = new byte[lineEnd.Value - _processedInBuffer];
        Array.Copy(_buffer, _processedInBuffer, arr, 0, arr.Length);

        Offset = Offset + lineEnd.Value - _processedInBuffer + (haveEnd ? 1 : 0);
        _processedInBuffer = lineEnd.Value + (haveEnd ? 1 : 0);

        return _encoding.GetString(arr).TrimStart(_bom).TrimEnd('\r', '\n');
    }

    private void ReadBuffer()
    {
        var notProcessedPartLength = _buffer.Length - _processedInBuffer;

        // Extend buffer to be able to fit whole line to the buffer
        // Was     [NOT_PROCESSED]
        // Become  [NOT_PROCESSED        ]
        if (notProcessedPartLength == _buffer.Length)
        {
            var extendedBuffer = new byte[_buffer.Length + _buffer.Length/2];
            Array.Copy(_buffer, extendedBuffer, _buffer.Length);
            _buffer = extendedBuffer;
        }

        // Copy not processed information to the begining
        // Was    [PROCESSED NOT_PROCESSED]
        // Become [NOT_PROCESSED          ]
        Array.Copy(_buffer, (long) _processedInBuffer, _buffer, 0, notProcessedPartLength);

        // Read more information to the empty part of buffer
        // Was    [ NOT_PROCESSED                   ]
        // Become [ NOT_PROCESSED NEW_NOT_PROCESSED ]
        _informationInBuffer = notProcessedPartLength + _stream.Read(_buffer, notProcessedPartLength, _buffer.Length - notProcessedPartLength);

        _processedInBuffer = 0;
    }

    private int? Search(byte[] buffer, byte byteToSearch, int bufferOffset)
    {
        for (int i = bufferOffset; i < buffer.Length - 1; i++)
        {
            if (buffer[i] == byteToSearch)
                return i;
        }
        return null;
    }
}

Sergey Alekseev · Answer 4 · 15 июля 2012

Хотя решение Томаса Левеска работает хорошо, вот мое. Он использует отражение, поэтому он будет медленнее, но не зависит от кодировки. Кроме того, я добавил расширение Seek.

/// <summary>Useful <see cref="StreamReader"/> extentions.</summary>
public static class StreamReaderExtentions
{
    /// <summary>Gets the position within the <see cref="StreamReader.BaseStream"/> of the <see cref="StreamReader"/>.</summary>
    /// <remarks><para>This method is quite slow. It uses reflection to access private <see cref="StreamReader"/> fields. Don't use it too often.</para></remarks>
    /// <param name="streamReader">Source <see cref="StreamReader"/>.</param>
    /// <exception cref="ArgumentNullException">Occurs when passed <see cref="StreamReader"/> is null.</exception>
    /// <returns>The current position of this stream.</returns>
    public static long GetPosition(this StreamReader streamReader)
    {
        if (streamReader == null)
            throw new ArgumentNullException("streamReader");

        var charBuffer = (char[])streamReader.GetType().InvokeMember("charBuffer", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        var charPos = (int)streamReader.GetType().InvokeMember("charPos", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        var charLen = (int)streamReader.GetType().InvokeMember("charLen", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);

        var offsetLength = streamReader.CurrentEncoding.GetByteCount(charBuffer, charPos, charLen - charPos);

        return streamReader.BaseStream.Position - offsetLength;
    }

    /// <summary>Sets the position within the <see cref="StreamReader.BaseStream"/> of the <see cref="StreamReader"/>.</summary>
    /// <remarks>
    /// <para><see cref="StreamReader.BaseStream"/> should be seekable.</para>
    /// <para>This method is quite slow. It uses reflection and flushes the charBuffer of the <see cref="StreamReader.BaseStream"/>. Don't use it too often.</para>
    /// </remarks>
    /// <param name="streamReader">Source <see cref="StreamReader"/>.</param>
    /// <param name="position">The point relative to origin from which to begin seeking.</param>
    /// <param name="origin">Specifies the beginning, the end, or the current position as a reference point for origin, using a value of type <see cref="SeekOrigin"/>. </param>
    /// <exception cref="ArgumentNullException">Occurs when passed <see cref="StreamReader"/> is null.</exception>
    /// <exception cref="ArgumentException">Occurs when <see cref="StreamReader.BaseStream"/> is not seekable.</exception>
    /// <returns>The new position in the stream. This position can be different to the <see cref="position"/> because of the preamble.</returns>
    public static long Seek(this StreamReader streamReader, long position, SeekOrigin origin)
    {
        if (streamReader == null)
            throw new ArgumentNullException("streamReader");

        if (!streamReader.BaseStream.CanSeek)
            throw new ArgumentException("Underlying stream should be seekable.", "streamReader");

        var preamble = (byte[])streamReader.GetType().InvokeMember("_preamble", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
        if (preamble.Length > 0 && position < preamble.Length) // preamble or BOM must be skipped
            position += preamble.Length;

        var newPosition = streamReader.BaseStream.Seek(position, origin); // seek
        streamReader.DiscardBufferedData(); // this updates the buffer

        return newPosition;
    }
}

Sani Singh Huttunen · Answer 5 · 07 апреля 2010

Будет ли это работать:

using (var sr = new StreamReader(@"C:\Temp\LineTest.txt")) {
  string line;
  long pos = 0;
  while ((line = sr.ReadLine()) != null) {
    Console.Write("{0:d3} ", pos);
    Console.WriteLine(line);
    pos += line.Length;
  }
}

Чтение текстовых файлов построчно, с точным отчетом о смещении / положении

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 5 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Чтение текстовых файлов построчно, с точным отчетом о смещении / положении

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 5 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы