Как мне искать список в рамках данных? - PullRequest
1 голос
/ 17 мая 2019

Я пытаюсь найти конкретные термины в data.frame.Есть 7 столбцов с 1356 строками.Два столбца, которые меня интересуют, это тип списка.Я хотел бы знать, где в любом из столбцов появляется слово «охотник».

Если я использую sapply для проверки типов данных для каждого столбца, я получаю следующее:

sapply(dataframe, class)

         ID    pdf_name     keyword    page_num    line_num   line_text  token_text 
"integer"    "factor" "character"   "integer"   "integer"      "list"      "list" 

Когда я пытаюсь отфильтровать строки в моем data.frame, которые не включают мой поисковый термин, используя

filter(dataframe, !grepl("hunt",token_text))

, я получаю распечатку всего data.frame.В идеале я хотел бы получить распечатку только из тех строк, в которых поисковый термин присутствует в одном из списков.Вот то, что я до сих пор получаю head.

structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L), pdf_name = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), .Label = c("Ames - 1994 - The Northwest Coast Complex Hunter-Gatherers, Eco.pdf", 
"Byers and Broughton - 2004 - Holocene Environmental Change, Artiodactyl Abundan.pdf", 
"Byers et al. - 2005 - Holocene artiodactyl population histories and larg.pdf", 
"Clarkson and Bellas - 2014 - Mapping stone using GIS spatial modelling to pred.pdf", 
"Codding and Jones - 2013 - Environmental productivity predicts migration, dem.pdf", 
"Elston and Zeanah - 2002 - Thinking outside the box a new perspective on die.pdf", 
"Elston et al. - 2014 - Living outside the box An updated perspective on .pdf", 
"FinlaysonBillWa_2017_2ExpandingNotionsOfHu_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_3ConceptualisingSubsi_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_5OkhotskAndSushenHist_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_6ComparativeAnalysisO_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_7LetsStartWithOurAcad_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_8ExperimentalEthnoarc_TheDiversityOfHunterG.pdf", 
"Fowler et al. - 2013 - Archaeology in the Great Basin and Southwest Pap.pdf", 
"Fulkerson - 2017 - Engendering the Past The Status of Gender and Fem.pdf", 
"GowdyJohnM_1998_2WhatHuntersDoForALiv_LimitedWantsUnlimited.pdf", 
"GowdyJohnM_1998_3SharingTalkingAndGiv_LimitedWantsUnlimited.pdf", 
"GowdyJohnM_1998_5BeyondTheOriginalAff_LimitedWantsUnlimited.pdf", 
"GowdyJohnM_1998_8TheFutureOfHunterGat_LimitedWantsUnlimited.pdf", 
"Gray - 2011 - The Evolutionary Biology of Education How Our Hun.pdf", 
"Grayson and Woolfenden - 2016 - Giant Sloths and Sabertooth Cats Archaeology of .pdf", 
"GraysonDonaldKW_2016_ClovisCometsAndClimat_GiantSlothsAndSaberto.pdf", 
"GraysonDonaldKW_2016_ExtinctMammalsDangero_GiantSlothsAndSaberto.pdf", 
"Hildebrandt and McGuire - 2003 - Large-Game Hunting, Gender-Differentiated Work Org.pdf", 
"Hockett - 1991 - Toward Distinguishing Human and Raptor Patterning .pdf", 
"Hockett - 2005 - Middle and Late Holocene Hunting in the Great Basi.pdf", 
"Hockett - 2010 - Back to Study Hall Further Reflections on Large G.pdf", 
"Hockett et al. - 2013 - Large-scale trapping features from the Great Basin.pdf", 
"Hockett et al. - 2014 - Identifying Dart and Arrow Points in The Great Bas.pdf", 
"Janz - 2016 - Fragmented Landscapes and Economies of Abundance.pdf", 
"Kintigh - 1997 - Thoughts on Writing in Archaeology With Special Re.pdf", 
"LaBelle and Pelton - 2013 - Communal hunting along the Continental Divide of N.pdf", 
"Lawson and Borgerhoff Mulder - 2016 - The offspring quantity-quality trade-off and human.pdf", 
"Lemke - 2016 - Hunting Architecture and Foraging Lifeways beneath.pdf", 
"Lew-Levy et al. - 2017 - How Do Hunter-Gatherer Children Learn Subsistence .pdf", 
"Louderback et al. - 2011 - Middle-Holocene climates and human population dens.pdf", 
"M. W. Lake - 2014 - Trends in Archaeological Simulation.pdf", 
"Madsen and Simms - 1998 - The Fremont Complex A Behavioral Perspective.pdf", 
"Margaret W. Conkey and Joan M. Gero - 1997 - Programme to Practice Gender and Feminism in Arch.pdf", 
"Ross et al. - 2016 - Evidence for quantity–quality trade-offs, sex-spec.pdf", 
"Silva et al. - 2014 - Historical ethnobotany an overview of selected st.pdf", 
"Smith et al. - 2013 - Paleoindian technological provisioning strategies .pdf", 
"Stirn - 2014 - Modeling site location patterns amongst late-prehi.pdf", 
"Trigger - 1984 - Archaeology at the Crossroads What's New.pdf"
), class = "factor"), keyword = c("table", "table", "table", 
"table", "table", "table"), page_num = c(2L, 2L, 2L, 3L, 3L, 
3L), line_num = c(29L, 38L, 63L, 98L, 102L, 106L), line_text = list(
    "Salmon have advantages for foragers (72, 111); they occur at predictable times, in predictable places, and in once prodigious numbers. ", 
    "Such variation in clumping is not predictable. ", "People inevitably began taking advantage of the rich, predictable resource. ", 
    "Matson reasons that intensification, sedentism, and ownership of resource patches evolved among hunter-gatherers when the resources were sufficiently abundant, reliable, predictable, and limited geographically and temporally. ", 
    "Matson holds that intensification, inequality, and sedentism each flow as inevitable consequences of the stmcture of the resource base, but only intensification and status differentials are causally linked. ", 
    "Matson's view is that Northwest Coast societies would only develop in an environment that was reliably rich and predictable. "), 
    token_text = list(list(c("salmon", "have", "advantages", 
    "for", "foragers", "72", "111", "they", "occur", "at", "predictable", 
    "times", "in", "predictable", "places", "and", "in", "once", 
    "prodigious", "numbers")), list(c("such", "variation", "in", 
    "clumping", "is", "not", "predictable")), list(c("people", 
    "inevitably", "began", "taking", "advantage", "of", "the", 
    "rich", "predictable", "resource")), list(c("matson", "reasons", 
    "that", "intensification", "sedentism", "and", "ownership", 
    "of", "resource", "patches", "evolved", "among", "hunter", 
    "gatherers", "when", "the", "resources", "were", "sufficiently", 
    "abundant", "reliable", "predictable", "and", "limited", 
    "geographically", "and", "temporally")), list(c("matson", 
    "holds", "that", "intensification", "inequality", "and", 
    "sedentism", "each", "flow", "as", "inevitable", "consequences", 
    "of", "the", "stmcture", "of", "the", "resource", "base", 
    "but", "only", "intensification", "and", "status", "differentials", 
    "are", "causally", "linked")), list(c("matson's", "view", 
    "is", "that", "northwest", "coast", "societies", "would", 
    "only", "develop", "in", "an", "environment", "that", "was", 
    "reliably", "rich", "and", "predictable")))), row.names = c(NA, 
6L), class = "data.frame")

Ответы [ 2 ]

0 голосов
/ 17 мая 2019

Это tidyverse решение.Немного грязно из-за структуры ваших данных.Я перечислил ваш последний столбец в строки.Я сохранил ваш dput как df.

Во-первых, я unnest ваш последний столбец и свернуть его в строку.Затем я select только интересующие вас столбцы, и в-третьих, в which строках появляется слово "охотник".

library(dplyr)
library(stringr)
df %>% 
  dplyr::mutate(token_text = unlist(lapply(lapply(token_text, unlist), paste, collapse = " "))) %>% 
  dplyr::select(line_text, token_text) %>% 
  lapply(function(x) which(stringr::str_detect(x, "hunter")))
$`line_text`
[1] 4

$token_text
[1] 4
0 голосов
/ 17 мая 2019

Вот пример использования поддельного фрейма данных, который я сделал с набором данных sentences. Это длинный символьный вектор, но мы разделим пробелы так, чтобы listcol являлся столбцом списка отдельных слов в каждом предложении:

library(tidyverse)

dataframe <- sentences %>%
  enframe(name = "rowid", value = "sentence") %>%
  mutate(listcol = str_split(sentence, "\\s"))
dataframe
#> # A tibble: 720 x 3
#>    rowid sentence                                    listcol  
#>    <int> <chr>                                       <list>   
#>  1     1 The birch canoe slid on the smooth planks.  <chr [8]>
#>  2     2 Glue the sheet to the dark blue background. <chr [8]>
#>  3     3 It's easy to tell the depth of a well.      <chr [9]>
#>  4     4 These days a chicken leg is a rare dish.    <chr [9]>
#>  5     5 Rice is often served in round bowls.        <chr [7]>
#>  6     6 The juice of lemons makes fine punch.       <chr [7]>
#>  7     7 The box was thrown beside the parked truck. <chr [8]>
#>  8     8 The hogs were fed chopped corn and garbage. <chr [8]>
#>  9     9 Four hours of steady work faced us.         <chr [7]>
#> 10    10 Large size in stockings is hard to sell.    <chr [8]>
#> # … with 710 more rows

Итак, у нас есть фрейм данных с некоторым не-списочным столбцом rowid и списком-столбцом listcol. Мы можем отфильтровать, чтобы включить только те строки, где предложение содержит "The". Хитрость заключается в том, чтобы использовать map_lgl (или sapply) для проверки каждого элемента списка, чтобы увидеть, соответствует ли any элементов шаблону с str_detect (или grepl). ).

dataframe %>%
  filter(map_lgl(listcol, ~ any(str_detect(., "The"))))
#> # A tibble: 284 x 3
#>    rowid sentence                                          listcol   
#>    <int> <chr>                                             <list>    
#>  1     1 The birch canoe slid on the smooth planks.        <chr [8]> 
#>  2     4 These days a chicken leg is a rare dish.          <chr [9]> 
#>  3     6 The juice of lemons makes fine punch.             <chr [7]> 
#>  4     7 The box was thrown beside the parked truck.       <chr [8]> 
#>  5     8 The hogs were fed chopped corn and garbage.       <chr [8]> 
#>  6    11 The boy was there when the sun rose.              <chr [8]> 
#>  7    13 The source of the huge river is the clear spring. <chr [10]>
#>  8    18 The soft cushion broke the man's fall.            <chr [7]> 
#>  9    19 The salt breeze came across from the sea.         <chr [8]> 
#> 10    20 The girl at the booth sold fifty bonds.           <chr [8]> 
#> # … with 274 more rows

Создано в 2019-05-16 с помощью представительного пакета (v0.2.1)

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...