У меня есть информационный фрейм, который содержит новости за каждый день, и я пытаюсь проанализировать интенсивность ощущений за день, то есть сказать, является ли общее настроение дня из новостей позитивным, негативным или нейтральным.Вот кадр данных df_news
:
Date name
0 2017-10-20 Gucci debuts art installation at its Ginza sto...
1 2018-08-01 Gucci Joins Paris Fashion Week for Its Spring ...
2 2018-04-20 Gucci launches its new creative hub Gucci ArtL...
3 2017-10-20 Gucci to launch homeware line Gucci Decor - CP...
4 2017-12-07 GUCCI opens new store at Miami Design District...
5 2018-01-12 Gucci opens Gucci Garden in Florence - LUXUO
6 2018-02-26 GUCCI's wild experiment with the Fall Winter 2...
7 2018-08-09 Gucci Revamped London Flagship Store | The Imp...
8 2018-08-01 Alessandro Michele Announces new Gucci Home co...
9 2017-10-20 Before He Picks Up the CFDA’s International Aw...
Я пытался получить интенсивность ощущений с помощью следующего кода, который он использует SentimentIntensityAnalyzer
de nltk.sentiment.vader
:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import unicodedata
sid = SentimentIntensityAnalyzer()
for date, row in df_news.T.iteritems():
try:
sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
#print((sentence))
ss = sid.polarity_scores(str(sentence))
df_news.set_value(date, 'compound', ss['compound'])
df_news.set_value(date, 'neg', ss['neg'])
df_news.set_value(date, 'neu', ss['neu'])
df_news.set_value(date, 'pos', ss['pos'])
except TypeError:
print(df_news.loc[date, 'name'])
print(date)
Однако я получаю TypeError на определенные даты.Благодаря try catch
вы не принимаете это во внимание и рисуете следующую таблицу:
name compound neg neu pos
Date
2017-10-20 Gucci debuts art installation at its Ginza sto...
2018-08-01 Gucci Joins Paris Fashion Week for Its Spring ...
2018-04-20 Gucci launches its new creative hub Gucci ArtL... 0.4404 0 0.756 0.244
2017-10-20 Gucci to launch homeware line Gucci Decor - CP...
2017-12-07 GUCCI opens new store at Miami Design District... 0 0 1 0
2018-01-12 Gucci opens Gucci Garden in Florence - LUXUO 0 0 1 0
2018-02-26 GUCCI's wild experiment with the Fall Winter 2... 0 0 1 0
2018-08-09 Gucci Revamped London Flagship Store | The Imp... 0.3182 0 0.602 0.398
2018-08-01 Alessandro Michele Announces new Gucci Home co...
2017-10-20 Before He Picks Up the CFDA’s International Aw...
Но когда я удаляю try catch, чтобы понять, почему это не удалось, я получаю следующую ошибку:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-2e9dbfc62bce> in <module>
4 for date, row in df_news.T.iteritems():
5 # try:
----> 6 sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
7 #print((sentence))
8 ss = sid.polarity_scores(str(sentence))
TypeError: normalize() argument 2 must be str, not Series
Тогда я подумал, что проблема была в строках, которые были не строковыми, а, например, в первом:
>>>type(df_news['name'][0])
str
Чтобы получить данные
doc_data = {
"size": 10,
"query": {
"bool": {
"must" : [
{"term":{"text":"gucci"}}
]
}
}
}
docs = create_doc("https://elastic:rKzWu2WbXI@db.luxurynsight.com/luxurynsight_v2/news/_search",doc_data)
information_df = pd.DataFrame.from_dict(docs.json()["hits"]["hits"])
# Reading the JSON file
df_news = pd.read_json('data.json')
# Converting the element wise _source feature datatype to dictionary
df_news._source = df_news._source.apply(lambda x: dict(x))
# Creating name column
df_news['name'] = df_news._source.apply(lambda x: x['name'])
# Creating createdAt column
df_news['createdAt'] = df_news._source.apply(lambda x: x['createdAt'])
df_news['createdAt'] = pd.to_datetime(df_news['createdAt'], unit='ms')
df_news['createdAt'] = pd.DatetimeIndex(df_news.createdAt).normalize()
#df_news.createdAt.dt.normalize()
df_news['Date'] = df_news['createdAt']
df_news = df_news[['name','Date']]
df_news = df_news.set_index('Date')
information_df._source = information_df.apply(lambda x: dict(x))
df_news.reset_index()
Это должно датьназад:
Date name
0 2017-10-20 Gucci debuts art installation at its Ginza sto...
1 2018-08-01 Gucci Joins Paris Fashion Week for Its Spring ...
2 2018-04-20 Gucci launches its new creative hub Gucci ArtL...
3 2017-10-20 Gucci to launch homeware line Gucci Decor - CP...
4 2017-12-07 GUCCI opens new store at Miami Design District...
5 2018-01-12 Gucci opens Gucci Garden in Florence - LUXUO
6 2018-02-26 GUCCI's wild experiment with the Fall Winter 2...
7 2018-08-09 Gucci Revamped London Flagship Store | The Imp...
8 2018-08-01 Alessandro Michele Announces new Gucci Home co...
9 2017-10-20 Before He Picks Up the CFDA’s International Aw...
Редактировать:
Я сгруппировал статьи, которые появляются в те же дни, и поместил в списки статьи
# get date out of the index to column
df_news = df_news.reset_index()
# optional
df_news['Date'] = pd.to_datetime(df_news['Date'])
# groupby and output group rows as list
df_news = df_news.groupby('Date')['name'].apply(list)
df_news.head()
Это возвращает мне:
Date
2017-10-20 [Gucci debuts art installation at its Ginza st...
2017-12-07 [GUCCI opens new store at Miami Design Distric...
2018-01-12 [Gucci opens Gucci Garden in Florence - LUXUO]
2018-02-26 [GUCCI's wild experiment with the Fall Winter ...
2018-04-20 [Gucci launches its new creative hub Gucci Art...
2018-08-01 [Gucci Joins Paris Fashion Week for Its Spring...
2018-08-09 [Gucci Revamped London Flagship Store | The Im...
Name: name, dtype: object
Поэтому, когда я пытаюсь применить ответ Стала:
sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
То есть для нормализации каждого элемента в серии
, я получаю следующую ошибку:
---------------------------------------------------------------------------
IndexingError Traceback (most recent call last)
<ipython-input-173-1bc93a0a065c> in <module>
5 try:
6 #sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
----> 7 sentence = df_news.loc[date, 'name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
8 ss = sid.polarity_scores(str(sentence))
9 df_news.set_value(date, 'compound', ss['compound'])
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1470 except (KeyError, IndexError):
1471 pass
-> 1472 return self._getitem_tuple(key)
1473 else:
1474 # we by definition only have the 0th axis
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
873
874 # no multi-index, so validate all of the indexers
--> 875 self._has_valid_tuple(tup)
876
877 # ugly hack for GH #836
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
218 for i, k in enumerate(key):
219 if i >= self.obj.ndim:
--> 220 raise IndexingError('Too many indexers')
221 try:
222 self._validate_key(k, i)
IndexingError: Too many indexers
И когда я пытаюсь взять только 1046 * в указателе, я получаю предложение = df_news.loc[date].aply(x: ...
:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-176-308d1f6c6644> in <module>
5 try:
6 #sentence = unicodedata.normalize('NFKD', df_news.loc[date, 'name']).encode('ascii','ignore')
----> 7 sentence = df_news.loc[date].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii','ignore'))
8 ss = sid.polarity_scores(str(sentence))
9 df_news.set_value(date, 'compound', ss['compound'])
AttributeError: 'list' object has no attribute 'apply'