Как составить список результатов в текстовом файле с помощью регулярного выражения с R? - PullRequest
1 голос
/ 05 апреля 2019

Мне нужно преобразовать все аргументы в текстовом символьном векторе в простой для справки формат: список с 3 столбцами (ведущий, время и текст) с помощью R (извините, я должен был быть более понятным).

Например, ведущий должен быть

# HARPER'S

время должно быть

# [Day 1, 9:00 A.M.]

, а текст должен быть остальным в аргументе.

Iнеобходимо рассчитать количество аргументов в тексте (каждое начало

# HARPER'S [Day 1, 9:00 A.M.] 

является аргументом).Я хочу создать новый объект списка с именем «arguments», и каждый элемент списка является подсписком, который содержит три элемента («Presenter», «Time» и «Text»).

Затем извлеките имя докладчика.и время в двух символьных векторах (также удалите отступы), и оставьте элемент «предъявитель» и элемент «время» в подсписке для этого аргумента.

This is the text: 
 [1] "HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was"  
  [2] "used to describe the work of brilliant students who explored and expanded the"    
  [3] "uses to which this new technology might be employed.  There was even talk of a"   
  [4] "\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark"  
  [5] "connotations, suggestion the actions of a criminal.  What is the hacker ethic,"   
  [6] "and does it survive?"                                                             
  [7] ""                                                                                 
  [8] "ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It"  
  [9] "survives in anyone excited by technology's power to turn many small,"             
 [10] "insignificant things into one vast, beautiful thing.  It is a fraud because"      
 [11] "there is nothing magical about computers that causes a user to undergo"           
 [12] "religious conversion and devote himself to the public good.  Early automobile"    
 [13] "inventors were hackers too.  At first the elite drove in luxury.  Later"          
 [14] "practically everyone had a car.  Now we have traffic jams, drunk drivers, air"    
 [15] "pollution, and suburban sprawl.  The old magic of an automobile occasionally"     
 [16] "surfaces, but we possess no delusions that it automatically invades the"          
 [17] "consciousness of anyone who sits behind the wheel.  Computers are power, and"     
 [18] "direct contact with power can bring out the best or worst in a person.  It's"     
 [19] "tempting to think that everyone exposed to the technology will be grandly"        
 [20] "inspired, but, alas, it just ain't so."                                           
 [21] ""                                                                                 
 [22] "BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is"     
 [23] "avoiding waste; insisting on using idle computer power -- often hacking into a"   
 [24] "system to do so, while taking the greatest precautions not to damage the"         
 [25] "system.  A second goal of many hackers is the free exchange of  technical"        
 [26] "information.  These hackers feel that patent and copyright restrictions slow"     
 [27] "down technological advances.  A third goal is the advancement of human"           
 [28] "knowledge for its own sake.  Often this approach is unconventional.  People we"   
 [29] "call crackers often explore systems and do mischief.  The are called hackers by"  
 [30] "the press, which doesn't understand the issues."                                  
 [31] ""                                                                                 
 [32] "KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the"    
 [33] "explorations of basement tinkerers were very local.  Once we all became"          
 [34] "connected, the work of these investigations rippled through the world.  today"    
 [35] "the hacking spirit is alive and kicking in video, satellite TV, and radio.  In"   
 [36] "some fields they are called chippers, because the modify and peddle altered"      
 [37] "chips.  Everything that was once said about \"phone phreaks\" can be said about"  
 [38] "them too."

Я попытался вычислить длину аргумента.

length(grep("^([A-Z]+'*[A-Z]*)", text_data))
arguments = list(presenters = regmatches(text_data, regexpr("^([A-Z]+'*[A-Z]*)", text_data)), time = regmatches(text_data, regexpr("(\\[.*\\])", text_data)), text =  regmatches(paste(unlist(text_data), collapse =" ")), regexpr("(:\\s.*)", regmatches(paste(unlist(text_data), collapse =" "))))
text_data

Длина списка «аргументов» должна быть 55.

Примером вывода будет пример формата вывода данных

Спасибо большое за вашу помощь.

Ответы [ 4 ]

1 голос
/ 05 апреля 2019

Это ваш ввод:

text_data = """HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was
used to describe the work of brilliant students who explored and expanded the
uses to which this new technology might be employed.  There was even talk of a
\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark
connotations, suggestion the actions of a criminal.  What is the hacker ethic,
and does it survive? 

ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It
survives in anyone excited by technology's power to turn many small,
insignificant things into one vast, beautiful thing.  It is a fraud because
there is nothing magical about computers that causes a user to undergo
religious conversion and devote himself to the public good.  Early automobile
inventors were hackers too.  At first the elite drove in luxury.  Later
practically everyone had a car.  Now we have traffic jams, drunk drivers, air
pollution, and suburban sprawl.  The old magic of an automobile occasionally
surfaces, but we possess no delusions that it automatically invades the
consciousness of anyone who sits behind the wheel.  Computers are power, and
direct contact with power can bring out the best or worst in a person.  It's
tempting to think that everyone exposed to the technology will be grandly
inspired, but, alas, it just ain't so.

BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is
avoiding waste; insisting on using idle computer power -- often hacking into a
system to do so, while taking the greatest precautions not to damage the
system.  A second goal of many hackers is the free exchange of  technical
information.  These hackers feel that patent and copyright restrictions slow
down technological advances.  A third goal is the advancement of human
knowledge for its own sake.  Often this approach is unconventional.  People we
call crackers often explore systems and do mischief.  The are called hackers by
the press, which doesn't understand the issues.

KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the
explorations of basement tinkerers were very local.  Once we all became
connected, the work of these investigations rippled through the world.  today
the hacking spirit is alive and kicking in video, satellite TV, and radio.  In
some fields they are called chippers, because the modify and peddle altered
chips.  Everything that was once said about \"phone phreaks\" can be said about
them too."""

Извлеките три переменные, используя regex:

import re
argument = re.findall("(?P<presenter>[A-Z|']+).\[(?P<time>\w.+)\].\s+(?P<text>[\w\W]*?)(?=\n\n|\Z)",text_data)

На всякий случай, если вы хотите сделать из них словарь:

mydict = {'presenter':[],'time':[],'text':[]}
for i in argument:
    mydict['presenter'].append(i[0])
    mydict['time'].append(i[1])
    mydict['text'].append(i[2])

Или, если вы хотите сохранить их в csv файле:

import csv
with open("filename.csv","w") as mycsv:
    writers = csv.writer(mycsv)
    header = ['presenter','time','text']
    writers.writerow(header)
    for item in argument:
        writers.writerow(item)

Чтобы загрузить файл csv:

import pandas as pd
df = pd.read_csv("filename.csv")
df

Выход:

   presenter |  time              | text
--------------------------------------------------------------------------------------
0   HARPER'S |  Day 1, 9:00 A.M.  | When the computer was young, the word hacking ...
1   ADELAIDE |  Day 1, 9:25 A.M.  | the hacker ethic survives, and it is a fraud. ...
2   BRAND    |  Day 1, 9:54 A.M.  | The hacker ethic involves several things. One...
3   KK       |  Day 1, 11:19 A.M. | The hacker ethic went unnoticed early on becau...
1 голос
/ 05 апреля 2019
library(magrittr)
library(data.table)

text2df <- function(text) {
    idx <- c(1, which(text == ""), length(text))
    apply(matrix(c(idx[-length(idx)], idx[-1]), ncol = 2), 1, function(id1_id2) {
        presenter_text <- text[id1_id2[1]:id1_id2[2]]
        first_row <- paste(presenter_text[1:2], collapse = "") # presenter_text[1] can be ''
        presenter_name <- strsplit(first_row, split = " [", fixed = T)[[1]][1]
        presentation_time <- strsplit(first_row, split = "]: ", fixed = T)[[1]][1] %>% 
            gsub(paste0(presenter_name, " ["), "", ., fixed = T)
        presentation_text <- paste(c(
            gsub(paste0(presenter_name, " [", presentation_time, "]:"), "", first_row, fixed = T) %>% 
                stringi::stri_trim_left() # remove leading spaces
            , presenter_text[3:length(presenter_text)] %>% .[!is.na(.)] # filter NA if only one row of text
        ), collapse = "")
        data.table(presenter = presenter_name, time = presentation_time, text = presentation_text)
    }) %>% rbindlist
}
1 голос
/ 05 апреля 2019

С помощью способа, которым вы хотите записать данный текст, это регулярное выражение должно выполнять вашу работу, поскольку оно собирает докладчика, время и текст в три группы и, используя re.findall, находит весь текст и помещает их в список, где каждый из эти три информации присутствуют в кортеже как один элемент в списке. Проверьте это регулярное выражение,

(.*?)\s+(\[[^[\]]*\]):\s*([\w\W]*?)(?=\n\n|\Z)

Демо

Примеры кодов Python,

import re

s = """HARPER'S [Day 1, 9:00 A.M.]:  When the computer was young, the word hacking was
used to describe the work of brilliant students who explored and expanded the
uses to which this new technology might be employed.  There was even talk of a
\"hacker ethic.\"  Somehow, in the succeeding years, the word has taken on dark
connotations, suggestion the actions of a criminal.  What is the hacker ethic,
and does it survive? 

ADELAIDE [Day 1, 9:25 A.M.]:  the hacker ethic survives, and it is a fraud.  It
survives in anyone excited by technology's power to turn many small,
insignificant things into one vast, beautiful thing.  It is a fraud because
there is nothing magical about computers that causes a user to undergo
religious conversion and devote himself to the public good.  Early automobile
inventors were hackers too.  At first the elite drove in luxury.  Later
practically everyone had a car.  Now we have traffic jams, drunk drivers, air
pollution, and suburban sprawl.  The old magic of an automobile occasionally
surfaces, but we possess no delusions that it automatically invades the
consciousness of anyone who sits behind the wheel.  Computers are power, and
direct contact with power can bring out the best or worst in a person.  It's
tempting to think that everyone exposed to the technology will be grandly
inspired, but, alas, it just ain't so.

BRAND [Day 1, 9:54 A.M.]:  The hacker ethic involves several things.  One is
avoiding waste; insisting on using idle computer power -- often hacking into a
system to do so, while taking the greatest precautions not to damage the
system.  A second goal of many hackers is the free exchange of  technical
information.  These hackers feel that patent and copyright restrictions slow
down technological advances.  A third goal is the advancement of human
knowledge for its own sake.  Often this approach is unconventional.  People we
call crackers often explore systems and do mischief.  The are called hackers by
the press, which doesn't understand the issues.

KK [Day 1, 11:19 A.M.]:  The hacker ethic went unnoticed early on because the
explorations of basement tinkerers were very local.  Once we all became
connected, the work of these investigations rippled through the world.  today
the hacking spirit is alive and kicking in video, satellite TV, and radio.  In
some fields they are called chippers, because the modify and peddle altered
chips.  Everything that was once said about \"phone phreaks\" can be said about
them too."""

argument = re.findall(r'(.*?)\s+(\[[^[\]]*\]):\s*([\w\W]*?)(?=\n\n|\Z)', s)
print(argument)

Печатает список, содержащий кортеж из трех элементов presenter, time и text

[("HARPER'S", '[Day 1, 9:00 A.M.]', 'When the computer was young, the word hacking was\nused to describe the work of brilliant students who explored and expanded the\nuses to which this new technology might be employed.  There was even talk of a\n"hacker ethic."  Somehow, in the succeeding years, the word has taken on dark\nconnotations, suggestion the actions of a criminal.  What is the hacker ethic,\nand does it survive? '), ('ADELAIDE', '[Day 1, 9:25 A.M.]', "the hacker ethic survives, and it is a fraud.  It\nsurvives in anyone excited by technology's power to turn many small,\ninsignificant things into one vast, beautiful thing.  It is a fraud because\nthere is nothing magical about computers that causes a user to undergo\nreligious conversion and devote himself to the public good.  Early automobile\ninventors were hackers too.  At first the elite drove in luxury.  Later\npractically everyone had a car.  Now we have traffic jams, drunk drivers, air\npollution, and suburban sprawl.  The old magic of an automobile occasionally\nsurfaces, but we possess no delusions that it automatically invades the\nconsciousness of anyone who sits behind the wheel.  Computers are power, and\ndirect contact with power can bring out the best or worst in a person.  It's\ntempting to think that everyone exposed to the technology will be grandly\ninspired, but, alas, it just ain't so."), ('BRAND', '[Day 1, 9:54 A.M.]', "The hacker ethic involves several things.  One is\navoiding waste; insisting on using idle computer power -- often hacking into a\nsystem to do so, while taking the greatest precautions not to damage the\nsystem.  A second goal of many hackers is the free exchange of  technical\ninformation.  These hackers feel that patent and copyright restrictions slow\ndown technological advances.  A third goal is the advancement of human\nknowledge for its own sake.  Often this approach is unconventional.  People we\ncall crackers often explore systems and do mischief.  The are called hackers by\nthe press, which doesn't understand the issues."), ('KK', '[Day 1, 11:19 A.M.]', 'The hacker ethic went unnoticed early on because the\nexplorations of basement tinkerers were very local.  Once we all became\nconnected, the work of these investigations rippled through the world.  today\nthe hacking spirit is alive and kicking in video, satellite TV, and radio.  In\nsome fields they are called chippers, because the modify and peddle altered\nchips.  Everything that was once said about "phone phreaks" can be said about\nthem too.')]
0 голосов
/ 05 апреля 2019
import re
matchObj = re.search( r'(.*?)\[(.*?)\](.*\s)', line)
print(matchObj.group(1))
print(matchObj.group(2))
print(matchObj.group(3))

Это может помочь С помощью группы вы можете извлечь символы, если вы хотите изменить логику, которую вы можете изменить в скобках "()"

...