Как извлечь конкретный текст из корпуса? - PullRequest
1 голос
/ 31 октября 2019

У меня есть корпус с 213 документами разной длины. Моя цель состоит в том, чтобы извлечь из каждого документа определенный фрагмент текста, который относится к «фискальной политике». Что усложняет мою попытку, так это то, что фрагмент текста, который я хочу извлечь, отличается от текста к тексту. Единственные ключевые слова, которые регулярно появляются в начале: фискальная политика или фискальная политика , но не более того.

Давайте рассмотрим пример:

df <- data.frame(Text = c("Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. MORE TEXT"))

cp <- corpus (df)

Конечная цель состоит в том, чтобы получить корпус, подобный этому:

df <- data.frame(Text = c("As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future.", "Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes.", "As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries.", "Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes."))

cp <- corpus(df)

Обратите внимание, что я был бы счастлив, даже если бы я только получил немного интереса плюс "БОЛЬШЕ ТЕКСТА", который я не делаюхотеть. Я мог бы просто сделать это. Мне не удается туда добраться, хотя. До сих пор я безуспешно пытался использовать corpus_segment , а также безуспешно играть с фреймом данных.

Может ли кто-нибудь помочь мне с этим?

Большое спасибо!

1 Ответ

3 голосов
/ 31 октября 2019

Решение Base R, не требующее функции корпуса:

trimws(grep("fiscal polic.*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

В ответ на дальнейший вопрос - найти индекс и использовать его подмножество данных:

# Return vector of sentences containing pattern: 

trimws(grep("fiscal polic.*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

# Store the matched text as a vector: 

matched_text <- trimws(grep("fiscal .*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

#Get the index of the dataframe for each element:

matched_text_idx <- sapply(matched_text, function(x){which(grepl(x, df$Text))})

# If you want to subset the dataframe to contain only the elements which contain pattern: 

df$Text[(which(grepl("fiscal polic.*", df$Text)))]

Данные:

    df <- data.frame(Text = c("Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. MORE TEXT"), stringsAsFactors = FALSE)
...