У меня есть датафрейм (тиббл) слов, который выглядит следующим образом.
text confidence type start_time end_time
<chr> <chr> <chr> <chr> <chr>
1 Angela 0.7482 pronunciation 0.04 0.32
2 very 1.0 pronunciation 0.32 0.59
3 powerful 1.0 pronunciation 0.59 1.29
4 . 0.0 punctuation NA NA
5 And 1.0 pronunciation 1.3 1.65
6 with 1.0 pronunciation 1.65 1.87
7 every 1.0 pronunciation 1.88 2.24
8 hurricane 1.0 pronunciation 2.24 2.75
9 there's 0.8826 pronunciation 2.75 2.96
10 that 1.0 pronunciation 2.96 3.22
11 one's 0.6438 pronunciation 3.22 3.73
12 own 0.748 pronunciation 3.73 4.02
13 . 0.0 punctuation NA NA
14 It's 0.9278 pronunciation 4.02 4.19
15 usually 0.851 pronunciation 4.19 4.51
Я пытаюсь создать значение идентификатора предложения, чтобы я мог сгруппировать слова в предложения. Я бы хотел, чтобы идентификаторы начинались / заканчивались на type = punctuation
.
text confidence type start_time end_time sentence_id
<chr> <chr> <chr> <chr> <chr> <dbl>
1 Angela 0.7482 pronunciation 0.04 0.32 1
2 very 1.0 pronunciation 0.32 0.59 1
3 powerful 1.0 pronunciation 0.59 1.29 1
4 . 0.0 punctuation NA NA 1
5 And 1.0 pronunciation 1.3 1.65 2
6 with 1.0 pronunciation 1.65 1.87 2
7 every 1.0 pronunciation 1.88 2.24 2
8 hurricane 1.0 pronunciation 2.24 2.75 2
9 there's 0.8826 pronunciation 2.75 2.96 2
10 that 1.0 pronunciation 2.96 3.22 2
11 one's 0.6438 pronunciation 3.22 3.73 2
12 own 0.748 pronunciation 3.73 4.02 2
13 . 0.0 punctuation NA NA 2
14 It's 0.9278 pronunciation 4.02 4.19 3
15 usually 0.851 pronunciation 4.19 4.51 3
Я уверен, что есть относительно простой способ сделать это, но я не совсем понимаю. У кого-нибудь есть предложения? Если это поможет, вот dput:
structure(list(text = c("Angela", "very", "powerful", ".", "And",
"with", "every", "hurricane", "there's", "that", "one's", "own",
".", "It's", "usually"), confidence = c("0.7482", "1.0", "1.0",
"0.0", "1.0", "1.0", "1.0", "1.0", "0.8826", "1.0", "0.6438",
"0.748", "0.0", "0.9278", "0.851"), type = c("pronunciation",
"pronunciation", "pronunciation", "punctuation", "pronunciation",
"pronunciation", "pronunciation", "pronunciation", "pronunciation",
"pronunciation", "pronunciation", "pronunciation", "punctuation",
"pronunciation", "pronunciation"), start_time = c("0.04", "0.32",
"0.59", NA, "1.3", "1.65", "1.88", "2.24", "2.75", "2.96", "3.22",
"3.73", NA, "4.02", "4.19"), end_time = c("0.32", "0.59", "1.29",
NA, "1.65", "1.87", "2.24", "2.75", "2.96", "3.22", "3.73", "4.02",
NA, "4.19", "4.51")), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))