Как сказал @Merijn van Tilborg, вы должны очень четко помнить свои предложения, потому что, если существует более одного местоимения, ваша работа не сможет дать желаемых результатов.
Однако вы также можете управлять этими случаями.мы можем попробовать с пакетами dplyr
и tidytext
, но нам нужно немного очистить данные:
# explicit the genders
female <- c("She", "Her")
male <- c("He", "His")
# here your data, with several examples of cases
df <- data.frame(
line = c(1,2,3,4,5,6),
text = c("She is happy", # female
"Her dog is happy", # female (if we look at the subject, it's not female..)
"He is happy", # male
"His dog is happy", # male
"It is happy", # ?
"She and he are happy"), # both!
stringsAsFactors = FALSE ) # life saver
Теперь мы можем попробовать что-то вроде этого:
library(tidytext)
library(dplyr)
df %>%
unnest_tokens(word, text) %>% # put words in rows
mutate(gender = ifelse(word %in% tolower(female),'female',
ifelse(word %in% tolower(male), 'male','unknown'))) %>% # detect male and female, remember tolower!
filter(gender!='unknown') %>% # remove the unknown
right_join(df) %>% # join with the original sentences keeping all of them
select(-word) # remove useless column
line gender text
1 1 female She is happy
2 2 female Her dog is happy
3 3 male He is happy
4 4 male His dog is happy
5 5 <NA> It is happy
6 6 female She and he are happy
7 6 male She and he are happy
И вы можете видеть, что 1,2,3,4 предложения соответствуют вашему стандарту, «оно» не определено, и если есть мужчины и женщины, мы удваиваем ряд, и вы понимаете, почему.
Наконец, вы можете свернуть в одну строку, добавив в цепочку dplyr
это:
%>% group_by(text, line) %>% summarise(gender = paste(gender, collapse = ','))
# A tibble: 6 x 3
# Groups: text [?]
text line gender
<chr> <dbl> <chr>
1 He is happy 3 male
2 Her dog is happy 2 female
3 His dog is happy 4 male
4 It is happy 5 NA
5 She and he are happy 6 female,male
6 She is happy 1 female
РЕДАКТИРОВАТЬ : Давайте попробуем с вашими данными:
data1 <- read.table(text="
data1.Gender A B C D E data1.Description
1 Female 0 0 0 0 0 'Ranjit Singh President of Boparan Holdings Limited Ranjit is President of Boparan Holdings Limited.'
2 Female 0 0 0 NA NA 'He founded the business in 1993 and has more than 25 years’ experience in the food industry.'
3 Female 0 0 0 NA NA 'Ranjit is particularly skilled at growing businesses, both organically and through acquisition.'
4 Female 0 0 0 NA NA 'Notable acquisitions include Northern Foods and Brookes Avana in 2011.'
5 Female 0 0 0 NA NA 'Ranjit and his wife Baljinder Boparan are the sole shareholders of Boparan Holdings, the holding company for 2 Sisters Food Group.'
6 Female 0 0 0 NA NA 's'",stringsAsFactors = FALSE)
# explicit the genders, in this case I've put also the names
female <- c("She", "Her","Baljinder")
male <- c("He", "His","Ranjit")
# clean the data
df <- data.frame(
line = rownames(data1),
text = data1$data1.Description,
stringsAsFactors = FALSE)
library(tidytext)
library(dplyr)
df %>%
unnest_tokens(word, text) %>% # put words in rows
mutate(gender = ifelse(word %in% tolower(female),'female',
ifelse(word %in% tolower(male), 'male','unknown'))) %>% # detect male and female, remember tolower!
filter(gender!='unknown') %>% # remove the unknown
right_join(df) %>% # join with the original sentences keeping all of them
select(-word) %>%
group_by(text, line) %>%
summarise(gender = paste(gender, collapse = ','))
В результате:
Joining, by = "line"
# A tibble: 6 x 3
# Groups: text [?]
text line gender
<chr> <chr> <chr>
1 He founded the business in 1993 and has more than 25 years’ ex~ 2 male
2 Notable acquisitions include Northern Foods and Brookes Avana ~ 4 NA
3 Ranjit and his wife Baljinder Boparan are the sole shareholder~ 5 male,male,fe~
4 Ranjit is particularly skilled at growing businesses, both org~ 3 male
5 Ranjit Singh President of Boparan Holdings Limited Ranjit is P~ 1 male,male
6 s 6 NA
Настоящая игра состоит в том, чтобы определить все слова, которые вы можете считать «мужскими» или «женскими».