Создание одного JSON объекта путем чтения всех файлов .txt из каталога - PullRequest
0 голосов
/ 03 мая 2020

Я использую набор из двадцати двадцати групп новостей от scikitlearn. И есть 20. TXT файлов, и некоторые из них имеют структуру, как показано ниже - с именем новой группы, docID, From, Subject. Я хочу прочитать все файлы (20) из каталога, чтобы преобразовать его в json объект или CSV для ввода его в elasti c search for indexing.

Каждая новая статья начинается с "Newsgroup ", document_id et c .. Ниже приведен один пример.

Newsgroup: sci.space
document_id: 59497
From: et@teal.csn.org (Eric H. Taylor)
Subject: Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise

In article <C4KvJF.4qo@well.sf.ca.us> metares@well.sf.ca.us (Tom Van Flandern) writes:
>crb7q@kelvin.seas.Virginia.EDU (Cameron Randale Bass) writes:
>> Bruce.Scott@launchpad.unc.edu (Bruce Scott) writes:
>>> "Existence" is undefined unless it is synonymous with "observable" in
>>> physics.
>> [crb] Dong ....  Dong ....  Dong ....  Do I hear the death-knell of
>> string theory?
>
>     I agree.  You can add "dark matter" and quarks and a lot of other
>unobservable, purely theoretical constructs in physics to that list,
>including the omni-present "black holes."
>
>     Will Bruce argue that their existence can be inferred from theory
>alone?  Then what about my original criticism, when I said "Curvature
>can only exist relative to something non-curved"?  Bruce replied:
>"'Existence' is undefined unless it is synonymous with 'observable' in
>physics.  We cannot observe more than the four dimensions we know about."
>At the moment I don't see a way to defend that statement and the
>existence of these unobservable phenomena simultaneously.  -|Tom|-

"I hold that space cannot be curved, for the simple reason that it can have
no properties."
"Of properties we can only speak when dealing with matter filling the
space. To say that in the presence of large bodies space becomes curved,
is equivalent to stating that something can act upon nothing. I,
for one, refuse to subscribe to such a view." - Nikola Tesla

----
 ET  "Tesla was 100 years ahead of his time. Perhaps now his time comes."
----

Newsgroup: comp.os.ms-windows.misc
document_id: 10002
Subject: Re: Win31 & doublespace
From: edowdy@vax1.umkc.edu

In article <4363@hpwala.wal.hp.com>, chrisa@hpwarr.hp.com ( Chris Almy) writes:
> 
>   Doublespace, although I do not trust it for my hard disks, sounds
>   great for floppies. The thouoght of having to mount the disk
>   is anoying but something I can deal with. The problem arises 
>   when under windows. Is there a way to mount and unmount while
>   under windows or is this part of the upgrades soon to be 
>   available from other vendors?

Каждый файл .txt содержит почти 1000 документов с Newsgroup, document_id, From, Subject. Итак, вторая статья снова начинается с «Newgroup ...»

Я делаю ниже, чтобы прочитать файлы из каталога, но не уверен, как преобразовать и записать 4 вышеупомянутых поля в json / csv.

files = glob.glob(path + '\\*.txt')
# iterate over the list getting each file 
for fle in files:
   # open the file and then call .read() to get the text 
   with open(fle) as f:
     text = f.read()

1 Ответ

0 голосов
/ 04 мая 2020

Редактировать:

    import functools
    import operator

    targeted_fields = [ 'Newsgroup', 'document_id', 'From', 'Subject' ]

    _article_list = []
    _final_dict_list = []

    files = glob.glob(path + '\\*.txt')
    # iterate over the list getting each file 
    for fle in files:
      _tmp_subject_list = []
      _tmp_header_list = []
      with open(fle) as f:
        data = [x.strip("\n").split(':',1) for x in f.readlines()]
        for each in data:      
          if each[0] == targeted_fields[0]:
            _article_list.append( [ *_tmp_header_list, functools.reduce(operator.concat, _tmp_subject_list, []) ] )
            _tmp_subject_list = []
            _tmp_header_list = [each]

          elif each[0] in targeted_fields[1:3]:
            _tmp_header_list.append(each)

          else:
            _tmp_subject_list.append(each)

          if data.index(each) == len(data)-1:
            _article_list.append( [ *_tmp_header_list, functools.reduce(operator.concat, _tmp_subject_list, []) ] )

    _article_list = [ x for x in _article_list if len(x)>1 ]    # Removing empty lines

    for x in _article_list:
      _final_dict_list.append( { y[0] : ' '.join(y[1:]) for y in x} )

Следующий подход работает, даже если каждый файл содержит больше статей:

targeted_fields = [ 'Newsgroup', 'document_id', 'From', 'Subject' ]

_final_list = []

files = glob.glob(path + '\\*.txt')
# iterate over the list getting each file 
for fle in files:

  with open(fle) as f:
    data = [x.split(':',1) for x in f.readlines()]
    _temp_list = []
    for each in data:
      if each != '' and each[0] in targeted_fields:
        _temp_list.append(each)
      if len(_temp_list) // len(targeted_fields):
        _final_list.append({x[0]:x[1].strip("\n") for x in _temp_list})
        _temp_list = []

_final_list будет список словарей, образец формат (используя 2 статьи в 2 файлах, следовательно, 4 результата):

[ { 'From': ' et@teal.csn.org (Eric H. Taylor)',
    'Newsgroup': ' sci.space',
    'Subject': ' Re: Gravity waves, was: Predicting gravity wave quantization '
               '& Cosmic Noise',
    'document_id': ' 59497'},
  { 'From': ' et@teal.csn.org (Eric H. Taylor)2',
    'Newsgroup': ' sci.space2',
    'Subject': ' Re: Gravity waves, was: Predicting gravity wave quantization '
               '& Cosmic Noise2',
    'document_id': ' 594972'},
  { 'From': ' et@teal.csn.org (Eric H. Taylor)',
    'Newsgroup': ' sci.space',
    'Subject': ' Re: Gravity waves, was: Predicting gravity wave quantization '
               '& Cosmic Noise',
    'document_id': ' 59497'},
  { 'From': ' et@teal.csn.org (Eric H. Taylor)2',
    'Newsgroup': ' sci.space2',
    'Subject': ' Re: Gravity waves, was: Predicting gravity wave quantization '
               '& Cosmic Noise2',
    'document_id': ' 594972'}]

Для преобразования окончательных результатов в json:

import json

data = json.dumps(_final_list)

Json вывод:

[{"Newsgroup": " sci.space", "document_id": " 59497", "From": " et@teal.csn.org (Eric H. Taylor)", "Subject": " Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise"}, {"Newsgroup": " sci.space2", "document_id": " 594972", "From": " et@teal.csn.org (Eric H. Taylor)2", "Subject": " Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise2"}, {"Newsgroup": " sci.space", "document_id": " 59497", "From": " et@teal.csn.org (Eric H. Taylor)", "Subject": " Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise"}, {"Newsgroup": " sci.space2", "document_id": " 594972", "From": " et@teal.csn.org (Eric H. Taylor)2", "Subject": " Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise2"}]

Для преобразования в CSV:

keys = _final_list[0].keys()
with open('people.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(_final_list)

Csv вывод:

Newsgroup,document_id,From,Subject
 sci.space, 59497, et@teal.csn.org (Eric H. Taylor)," Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise"
 sci.space2, 594972, et@teal.csn.org (Eric H. Taylor)2," Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise2"
 sci.space, 59497, et@teal.csn.org (Eric H. Taylor)," Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise"
 sci.space2, 594972, et@teal.csn.org (Eric H. Taylor)2," Re: Gravity waves, was: Predicting gravity wave quantization & Cosmic Noise2"
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...