Менее интенсивный способ анализа большого файла JSON в Python - PullRequest
0 голосов
/ 04 октября 2019

Вот мой код

import json
data = []
with open("review.json") as f:
    for line in f:
        data.append(json.loads(line))

lst_string = []
lst_num = []
for i in range(len(data)):
    if (data[i]["stars"] == 5.0):
        x = data[i]["text"]
        for word in x.split():
            if word in lst_string:
                lst_num[lst_string.index(word)] += 1
            else:
                lst_string.append(word)
                lst_num.append(1)

result = set(zip(lst_string, lst_num))
print(result)
with open("set.txt", "w") as g:
    g.write(str(result))

Я пытаюсь написать набор всех слов в обзорах, которым были даны 5 звезд из вытянутого файла json, отформатированного как

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
{"review_id":"GJXCdrto3ASJOqKeVWPi6Q","user_id":"yXQM5uF2jS6es16SJzNHfg","business_id":"NZnhc2sEQy3RmzKTZnqtwQ","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon!  I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level!  \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit.  Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room.  Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure.  That was superb!  Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen.  The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement.  It was so much fun to be there! \n\nNext Travis started with the flat iron.  The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable.  It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists.  At the end of the blowout & style my hair was perfectly bouncey and looked terrific.  The only thing better?  That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas.  You make me feel beauuuutiful!","date":"2017-01-14 21:30:33"}
{"review_id":"2TzJjDVDEuAW6MR5Vuc1ug","user_id":"n6-Gk65cPZL6Uz8qRm3NYw","business_id":"WTqjgwHlXbSFevF32_DJVw","stars":1.0,"useful":3,"funny":0,"cool":0,"text":"I have to say that this office really has it together, they are so organized and friendly!  Dr. J. Phillipp is a great dentist, very friendly and professional.  The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable!  I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit!  I highly recommend this office for the nice synergy the whole office has!","date":"2016-11-09 20:09:03"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.\n\nDrink prices were pretty good.\n\nThe Server, Dawn, was friendly and accommodating. Very happy with her.\n\nIn summation, a great pub experience. Would go again!","date":"2018-01-09 20:56:38"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"a b aa bb a b","date":"2018-01-09 20:56:38"}

но он использует всю память на моем компьютере, прежде чем он может выводиться в текстовый файл. Как я могу использовать менее интенсивный способ памяти?

1 Ответ

0 голосов
/ 04 октября 2019

Получить текст только там, где stars == 5:

Данные:

  • В зависимости от вопроса, данные представляют собой файл, содержащий ряды диктов.

Получить текст в список:

  • Учитывая данные из Yelp Challenge , получение текста 5 stars в список не занимает столько памяти.
    • Диспетчер ресурсов Windows показал увеличение примерно на 1,3 ГБ, но размер объекта text_list составил около 25 МБ.
import json

text_list = list()
with open("review.json", encoding="utf8") as f:
    for line in f:
        line = json.loads(line)
        if line['stars'] == 5:
            text_list.append(line['text'])

print(text_list)

>>> ['Test text, example 1!', 'Test text, example 2!']

Дополнительно:

  • Все после загрузки данных, похоже, требует много памяти, которая не освобождается.
  • При очистке текста диспетчер ресурсов Windows увеличился на 16 ГБ, хотяОкончательный размер clean_text был также только около 25 МБ.
    • Интересно, что удаление clean_text не освобождает 16 ГБ памяти.
    • В Jupyter Lab перезапуск ядра освобождает память
    • В PyCharm также останавливается процессосвобождает память
    • Я попытался вручную запустить сборщик мусора, но это не освободило память

Очистить text_list:

import string

def clean_string(value: str) -> list:
    value = value.lower()
    value = value.translate(str.maketrans('', '', string.punctuation))
    value = value.split()
    return value

clean_text = [clean_string(item) for item in text_list]
print(clean_text)

>>> [['test', 'text', 'example', '1'], ['test', 'text', 'example', '2']]

Количество слов в clean_text:

from collection import Counter

words = Counter()

for item in clean_text:
    words.update(item)

print(words)

>>> Counter({'test': 2, 'text': 2, 'example': 2, '1': 1, '2': 1})
...