Вам нужно прочитать файл как последовательность байтов и обработать его как двоичный файл.
Затем, чтобы разобрать PDF-часть файла, вам нужно снова прочитать его как String, чтобы вы могли выполнить с ним регулярное выражение.
Строка должна быть в кодировке, которая никак не изменяет байты, и для этого есть специальная кодировка Codepage 28591 (ISO 8859-1)
, с которой байты в исходном файле используются как есть.
Для этого я написал следующую вспомогательную функцию:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
$Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
# Note: Codepage 28591 (ISO 8859-1) returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
Используя вышеупомянутую функцию, вы сможете получить двоичную часть из файла, состоящего из нескольких частей, например:
$inputFile = 'D:\blah.txt'
$outputFile = 'D:\blah.pdf'
# read the file as byte array
$fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
# and again as string where every byte has a 1-to-1 mapping to the file's original bytes
$binString = ConvertTo-BinaryString -Path $inputFile
# create your regex, all as ASCII byte characters: '%PDF.*%%EOF[\r?\n]{0,2}'
$regex = [Regex]'(?s)(\x25\x50\x44\x46[\x00-\xFF]*\x25\x25\x45\x4F\x46[\x0D\x0A]{0,2})'
$match = $regex.Match($binString)
# use a MemoryStream object to store the result
$stream = New-Object System.IO.MemoryStream
$stream.Write($fileBytes, $match.Index, $match.Length)
# save the binary data of the match as a series of bytes
[System.IO.File]::WriteAllBytes($outputFile, $stream.ToArray())
# clean up
$stream.Dispose()
Детали регулярного выражения:
( Match the regular expression below and capture its match into backreference number 1
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x50 Match the ASCII or ANSI character with position 0x50 (80 decimal => P) in the character set
\x44 Match the ASCII or ANSI character with position 0x44 (68 decimal => D) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x00-\xFF] Match a single character in the range between ASCII character 0x00 (0 decimal) and ASCII character 0xFF (255 decimal)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x25 Match the ASCII or ANSI character with position 0x25 (37 decimal => %) in the character set
\x45 Match the ASCII or ANSI character with position 0x45 (69 decimal => E) in the character set
\x4F Match the ASCII or ANSI character with position 0x4F (79 decimal => O) in the character set
\x46 Match the ASCII or ANSI character with position 0x46 (70 decimal => F) in the character set
[\x0D\x0A] Match a single character present in the list below
ASCII character 0x0D (13 decimal)
ASCII character 0x0A (10 decimal)
{0,2} Between zero and 2 times, as many times as possible, giving back as needed (greedy)
)