Как отсортировать список текста из json данных в pandas? - PullRequest
1 голос
/ 25 марта 2020

Я преобразовал json данные из одной папки в pandas фрейм данных. Но список не вышел последовательно. Кто-нибудь знает, как сортировать данные?

Это вывод json_files:


Однако мой ярлык имеет следующий порядок: Метка

0   BuzzFeed_Real_1
1   BuzzFeed_Real_2
2   BuzzFeed_Real_3
3   BuzzFeed_Real_4
4   BuzzFeed_Real_5
5   BuzzFeed_Real_6
6   BuzzFeed_Real_7
7   BuzzFeed_Real_8
8   BuzzFeed_Real_9
9   BuzzFeed_Real_10
10  BuzzFeed_Fake_1
11  BuzzFeed_Fake_2
12  BuzzFeed_Fake_3
13  BuzzFeed_Fake_4
14  BuzzFeed_Fake_5
15  BuzzFeed_Fake_6
16  BuzzFeed_Fake_7
17  BuzzFeed_Fake_8
18  BuzzFeed_Fake_9
19  BuzzFeed_Fake_10

Кто-нибудь знает, как сортировать данные на основе этикетка? Спасибо

Вот мой код:

        import os, json
import pandas as pd
import numpy as np

path_to_json = 'data/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('json')]

#Here I define my pandas dataframe with the colums I want to get from json
jsons_data = pd.DataFrame(columns=['text','title'])

#We need both json and an index number so use enumerate()
for index, js in enumerate(json_files):
    with open(os.path.join(path_to_json,js)) as json_file:
        json_text = json.load(json_file)

        #the same structure 
        text = json_text['text']
        title = json_text['title']

        #Here I push a list of data into pandas DataFrame at row given by 'index'
        jsons_data.loc[index] = [text,title]

#Now that we have the pertinen json data in our DataFrame 

и это вывод jsons_data:

text    title
0   Story highlights Obams reaffirms US commitment...   Obama in NYC: 'We all have a role to play' in ...
1   Well THAT’S Weird. If the Birther movement is ...   The AP, In 2004, Said Your Boy Obama Was BORN ...
2   The man arrested Monday in connection with the...   Bombing Suspect Filed Anti-Muslim Discriminati...
3   The Haitians in the audience have some newswor...   'Reporters' FLEE When Clintons Get EXPOSED!
4   Chicago Environmentalist Scumbags\n\nLeftists ...   The Black Sphere with Kevin Jackson
5   Obama weighs in on the debate\n\nPresident Bar...   Obama weighs in on the debate
6   Story highlights Ted Cruz refused to endorse T...   Donald Trump's rise puts Ted Cruz in a bind
7   Last week I wrote an article titled “Donald Tr...   More Milestone Moments for Donald Trump! – Eag...
8   Story highlights Trump has 45%, Clinton 42% an...   Georgia poll: Donald Trump, Hillary Clinton in...
9   Story highlights "This, though, is certain: to...   Hillary Clinton on police shootings: 'too many...
10  McCain Criticized Trump for Arpaio’s Pardon… S...   NFL Superstar Unleashes 4 Word Bombshell on Re...
11  On Saturday, September 17 at 8:30 pm EST, an e...   Another Terrorist Attack in NYC…Why Are we STI...
12  Less than a day after protests over the police...   Donald Trump: Drugs a 'Very, Very Big Factor' ...
13  Dolly Kyle has written a scathing “tell all” b...   HILLARY ON DISABLED CHILDREN During Easter Egg...
14  Former President Bill Clinton and his Clinton ...   Charity: Clinton Foundation Distributed “Water...
15  I woke up this morning to find a variation of ...   Proof The Mainstream Media Is Manipulating The...
16  Thanks in part to the declassification of Defe...   Declassified Docs Show That Obama Admin Create...
17  Critical Counties is a CNN series exploring 11...   Critical counties: Wake County, NC, could put ...
18  The Democrats are using an intimidation tactic...   Why is it “RACIST” to Question Someone’s Birth...
19  Back when the news first broke about the pay-t...   Clinton Foundation Spent 5.7% on Charity; Rest...

Ответы [ 2 ]

0 голосов
/ 25 марта 2020

Это должно дать вам то, что вам нужно для создания индекса из имен файлов. Дайте мне знать, если вам нужна помощь в настройке индекса и если вы хотите получить двойной индекс или объединить его в один индекс:

import os, json
import pandas as pd
import numpy as np

path_to_json = 'data/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('json')]

#Here I define my pandas dataframe with the colums I want to get from json
jsons_data = pd.DataFrame(columns=['text','title'])

#We need both json and an index number so use enumerate()
for index, js in enumerate(json_files):
    with open(os.path.join(path_to_json,js)) as json_file:
        json_text = json.load(json_file)

        #the same structure 
        text = json_text['text']
        title = json_text['title']

        #Here I push a list of data into pandas DataFrame at row given by 'index'
        jsons_data.loc[index] = [text,title]

# Add column to your data frame containing 'json_files' list values
jsons_data['json_files'] = json_files

import re

# Create Regex to identify 'Fake' or 'Real' BuzzFeed
news_type = r"(Fake|Real)"

# Create Regex to extract numeric count
news_type_count = r"(\d+)"

# Extract new type to column
jsons_data['news_type'] = jsons_data['json_files'].str.extract(pat=news_type)

# Extract numeric count to columne
jsons_data['news_type_count'] = jsons_data['json_files'].str.extract(pat=news_type_count)

# Convert numeric count to integer
jsons_data['news_type_count'] = jsons_data['news_type_count'].astype(int)

# Sort dataframe by 'news_type' and 'news_type_count'
jsons_data = jsons_data.sort_values(by=['news_type', 'news_type_count'])

# Print head of dataframe
0 голосов
/ 25 марта 2020

Вы можете использовать решение из this с разделенными значениями для Fake и Real строк, отсортированных по убыванию, и чисел, отсортированных по возрастанию:

L = ['BuzzFeed_Real_5-Webpage.json',

class reversor:
    def __init__(self, obj):
        self.obj = obj

    def __eq__(self, other):
        return other.obj == self.obj

    def __lt__(self, other):
        return other.obj < self.obj

a = sorted(L, key=lambda x: (reversor(x.split('_')[1]), int(x.split('_')[2].split('-')[0])))
print (a)
['BuzzFeed_Real_1-Webpage.json', 'BuzzFeed_Real_2-Webpage.json',
 'BuzzFeed_Real_3-Webpage.json', 'BuzzFeed_Real_4-Webpage.json', 
 'BuzzFeed_Real_5-Webpage.json', 'BuzzFeed_Real_6-Webpage.json', 
 'BuzzFeed_Real_7-Webpage.json', 'BuzzFeed_Real_8-Webpage.json', 
 'BuzzFeed_Real_9-Webpage.json', 'BuzzFeed_Real_10-Webpage.json', 
 'BuzzFeed_Fake_1-Webpage.json', 'BuzzFeed_Fake_2-Webpage.json', 
 'BuzzFeed_Fake_3-Webpage.json', 'BuzzFeed_Fake_4-Webpage.json', 
 'BuzzFeed_Fake_5-Webpage.json', 'BuzzFeed_Fake_6-Webpage.json', 
 'BuzzFeed_Fake_7-Webpage.json', 'BuzzFeed_Fake_8-Webpage.json', 
 'BuzzFeed_Fake_9-Webpage.json', 'BuzzFeed_Fake_10-Webpage.json']

Другая похожая идея по pandas - разделение значений по новым столбцам и последняя сортировка по DataFrame.sort_values:

df = pd.DataFrame({'a':L})
df = df.join(df['a'].str.split('_', expand=True))
df['num'] = df[2].str.extract('(\d+)', expand=False).astype(int)
df = df.sort_values([1, 'num'], ascending=[False, True])
print (df)
                                a         0     1                2  num
11   BuzzFeed_Real_1-Webpage.json  BuzzFeed  Real   1-Webpage.json    1
9    BuzzFeed_Real_2-Webpage.json  BuzzFeed  Real   2-Webpage.json    2
17   BuzzFeed_Real_3-Webpage.json  BuzzFeed  Real   3-Webpage.json    3
10   BuzzFeed_Real_4-Webpage.json  BuzzFeed  Real   4-Webpage.json    4
0    BuzzFeed_Real_5-Webpage.json  BuzzFeed  Real   5-Webpage.json    5
5    BuzzFeed_Real_6-Webpage.json  BuzzFeed  Real   6-Webpage.json    6
6    BuzzFeed_Real_7-Webpage.json  BuzzFeed  Real   7-Webpage.json    7
7    BuzzFeed_Real_8-Webpage.json  BuzzFeed  Real   8-Webpage.json    8
8    BuzzFeed_Real_9-Webpage.json  BuzzFeed  Real   9-Webpage.json    9
12  BuzzFeed_Real_10-Webpage.json  BuzzFeed  Real  10-Webpage.json   10
15   BuzzFeed_Fake_1-Webpage.json  BuzzFeed  Fake   1-Webpage.json    1
16   BuzzFeed_Fake_2-Webpage.json  BuzzFeed  Fake   2-Webpage.json    2
18   BuzzFeed_Fake_3-Webpage.json  BuzzFeed  Fake   3-Webpage.json    3
13   BuzzFeed_Fake_4-Webpage.json  BuzzFeed  Fake   4-Webpage.json    4
3    BuzzFeed_Fake_5-Webpage.json  BuzzFeed  Fake   5-Webpage.json    5
2    BuzzFeed_Fake_6-Webpage.json  BuzzFeed  Fake   6-Webpage.json    6
19   BuzzFeed_Fake_7-Webpage.json  BuzzFeed  Fake   7-Webpage.json    7
4    BuzzFeed_Fake_8-Webpage.json  BuzzFeed  Fake   8-Webpage.json    8
1    BuzzFeed_Fake_9-Webpage.json  BuzzFeed  Fake   9-Webpage.json    9
14  BuzzFeed_Fake_10-Webpage.json  BuzzFeed  Fake  10-Webpage.json   10

a = df['a'].tolist()
print (a)
['BuzzFeed_Real_1-Webpage.json', 'BuzzFeed_Real_2-Webpage.json',
 'BuzzFeed_Real_3-Webpage.json', 'BuzzFeed_Real_4-Webpage.json', 
 'BuzzFeed_Real_5-Webpage.json', 'BuzzFeed_Real_6-Webpage.json', 
 'BuzzFeed_Real_7-Webpage.json', 'BuzzFeed_Real_8-Webpage.json', 
 'BuzzFeed_Real_9-Webpage.json', 'BuzzFeed_Real_10-Webpage.json', 
 'BuzzFeed_Fake_1-Webpage.json', 'BuzzFeed_Fake_2-Webpage.json', 
 'BuzzFeed_Fake_3-Webpage.json', 'BuzzFeed_Fake_4-Webpage.json', 
 'BuzzFeed_Fake_5-Webpage.json', 'BuzzFeed_Fake_6-Webpage.json', 
 'BuzzFeed_Fake_7-Webpage.json', 'BuzzFeed_Fake_8-Webpage.json', 
 'BuzzFeed_Fake_9-Webpage.json', 'BuzzFeed_Fake_10-Webpage.json']
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.