IndexError: список индекса выходит за пределы диапазона теперь, когда я изменил способ чтения файла - PullRequest
0 голосов
/ 05 июня 2018

Я пытаюсь прочитать и переформатировать очень большой (2 ГБ +) .out файл, который структурирован как CSV.Ранее я использовал стандартную функцию open () без такой проблемы, но изменил ее на codecs.open (), так как у него были проблемы с некоторыми символами.

Теперь он выдает

Traceback (most recent call last): line 21, in <module> if(r[5]==""): IndexError: list index out of range в первом ряду, хотя определенно есть элемент в r [5].(время выполнения составляет 0,301 с)

import sys
import csv
import datetime
import codecs
maxInt=sys.maxsize
decrement=True

while decrement:
    decrement=False
    try:
        csv.field_size_limit(maxInt)
    except OverflowError:
        maxInt = int(maxInt/10)
        decrement = True

with codecs.open("file.out", 'rU', 'utf-16-be') as source:
    rdr = csv.reader(source)
    with open("out.csv","w", newline='') as result:
        wtr = csv.writer(result)
        wtr.writerow(("Column1", "column2", "column3", "etc..."))
        for r in rdr:
            if(r[5]==""):
                continue
            wtr.writerow((datetime.datetime.strptime(r[5], '%m/%d/%Y').strftime('%Y-%m-%d'), r[3], r[7], r[9]+r[10]+" "+r[12]))

с использованием бросков utf-8 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 12: invalid continuation byte

с использованием бросков latin-1 или ISO-8859-1 UnicodeEncodeError: 'charmap' codec can't encode characters in position 57-58: character maps to <undefined>, хотя и после выполнения гораздо большего.

входной файл выглядит следующим образом:

"A00017","K","G","1999","4530","01/12/1999","","","","PEOPLE TO ELECT MANGINELLI","","","","258 MAGNIOLIA DRIVE","SELDEN","NY","11784","","","404.57","","","","","","","2","","NAA","07/22/1999 08:43:59"
"A00037","K","G","1999","999999","01/12/1999","","","","CITIZENS TO ELECT TEDISCO TO ASSEMBLY","","","","","","","","","","0","","","","","","","2","","",""
"A00037","K","N","1999","1693","01/15/1999","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND AVE","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","PREVIOUS LOAN FROM JAMES TEDISCO","","P","JM","07/15/1999 15:08:17"
"A00037","J","N","2000","1694","01/13/2000","","","","OUTSTANDING LOAN","","","","2176 GUILDERLAND","SCHENECTADY","NY","12306","","","10474.8","10474.8","","","OTHER","","LOANS FROM PREVIOUS CAMPAIGNS FROM J","","P","JM","01/14/1900 16:35:09"
"A00037","K","X","2000","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/20/2000 00:00:00"
"A00037","J","X","2001","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/17/2001 00:00:00"
"A00037","K","X","2002","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/19/2002 00:00:00"
"A00037","J","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/21/2003 00:00:00"
"A00037","K","X","2003","999999","","","","","","","","","","","","","","","","","","","","","","","","","07/16/2003 00:00:00"
"A00037","J","X","2004","999999","","","","","","","","","","","","","","","","","","","","","","","","","01/22/2004 00:00:00"

Я получил это далеко благодаря:

«Строка содержит NULL байт» в CSV-ридере (Python)

_csv.Error: поле больше предела поля (131072)

1 Ответ

0 голосов
/ 05 июня 2018

В 'file.out', из которого вы читаете, найдите разделяющий символ между элементами каждой ячейки строки.Как запятую '\ t'-tab или', '- и передайте ее атрибуту delimiter.

Попробуйте напечатать' r 'и увидите символ между именами столбцов или значениями в строке

rdr = csv.reader(source,delimiter=<separator>)
...