как указывает @jezrael, данные не единообразны. иногда информации пять, иногда 6.
чтобы очистить это и загрузить в фрейм данных, вы можете сделать следующее:
import requests as r
import pandas as pd
raw = r.get('https://s3.amazonaws.com/todel162/kaggle_unicode1.txt')
# the raw data has some non ascii characters which you could likely ignore.
# and I ignore the last line if it is blank as that breaks the parsing.
data = raw.text.encode('ascii', errors='ignore').decode()
lines = [d.strip() for d in data.split('\n')]
if lines[-1] == '':
lines = lines[:-1]
# then split out sections of data
# this 1 lines replaces the following commented out for-loop more elegantly
blurbs = [l.split('**') for l in '**'.join(lines).split('****')]
# blurbs = []
# blurb = []
# for line in lines:
# if line == '':
# blurbs.append(blurb)
# blurb = []
# else:
# blurb.append(line)
# it seems each section can either have 5 or 6 elements, write a function to return a uniform format record, and use pandas.DataFrame.from_records to load into dataframe
def get_record(blurb):
if len(blurb) == 6:
return blurb
return blurb[:3] + [''] + blurb[3:]
cols = ['task_name', 'task_description', 'task_date', 'other', 'task_prize', 'task_teams']
df = pd.DataFrame.from_records([get_record(b) for b in blurbs], columns=cols)
df.head()
Это выводит следующее:
Out[8]:
task_name \
0 TalkingData AdTracking Fraud Detection Challenge
1 CVPR 2018 WAD Video Segmentation Challenge
2 iMaterialist Challenge (Fashion) at FGVC5
3 iMaterialist Challenge (Furniture) at FGVC5
4 Google Landmark Retrieval Challenge
task_description task_date \
0 Can you detect fraudulent click traffic for mo... Featured13 days to go
1 Can you segment each objects within image fram... Research2 months to go
2 Image classification of fashion products. Researcha month to go
3 Image Classification of Furniture & Home Goods. Researcha month to go
4 Given an image, can you find all of the same l... Researcha month to go
other task_prize task_teams
0 $25,000 3,382 teams
1 $2,500 32 teams
2 $2,500 67 teams
3 $2,500 238 teams
4 image data $2,500 129 teams
Как видите, данные правильно разбиваются на столбцы. Оттуда вы можете конвертировать типы, отбрасывать столбец other
и т. Д. И анализировать набор данных.