Как разделить и сделать новые CSV-файлы на основе даты / дня в R? - PullRequest
4 голосов
/ 19 апреля 2020

Привет, у меня есть файл 8 ГБ, который мне нужен для анализа. Однако моя RAM не так уж велика. Для эффективной работы я решил разделить свой CSV-файл на строки с помощью следующего кода:

library(tidyverse)

sample_df <- readr::read_csv("sample.csv") #Read in the csv file
dput(sample_df)

#break the large CSV so RAM and Rstudio doesn't crash

groups <- (split(sample_df, (seq(nrow(sample_df))-1) %/% 20)) #here I want 20 rows per file until last row is reached

for (i in seq_along(groups)) {
  write.csv(groups[[i]], paste0("sample_output_file", i, ".csv")) #iterate and write file
}

Это работало идеально, пока мой старший наставник не попросил меня провести анализ на основе каждой даты / дней. Я столкнулся с проблемой, потому что, разбивая по строкам, я в конечном итоге распределял даты по нескольким CSV. И это создает проблему нехватки ОЗУ и управления памятью, когда я пытаюсь прочитать 3-4 csvs для выполнения анализа на основе каждого дня.

Пример файла находится здесь: https://github.com/THsTestingGround/SO_splitbydate_question/blob/master/sample.csv

Так может кто-нибудь помочь мне, как разделить следующий образец CSV-файла, который я прочитал в initailly, на основе даты? Я хотел, чтобы все Aprl1 вместе в одном файле CSV, затем Aprl2 в другой и так далее. Я сделал попытку, но не смог добиться успеха.

Также мне было интересно, может ли readr::read_csv_chunked помочь нам чем-нибудь? Из документации я не увидел ничего, определяющего c.

, вот dput файла csv:

dput(sample_df)
structure(list(createdAt = c("Fri Apr 01 04:04:32 +0000 2020", 
"Fri Apr 01 04:04:36 +0000 2020", "Fri Apr 01 04:04:37 +0000 2020", 
"Fri Apr 02 04:04:40 +0000 2020", "Fri Apr 02 04:04:44 +0000 2020", 
"Fri Apr 02 04:04:46 +0000 2020", "Fri Apr 02 04:04:54 +0000 2020", 
"Fri Apr 02 04:04:56 +0000 2020", "Fri Apr 02 04:05:07 +0000 2020", 
"Fri Apr 02 04:05:12 +0000 2020", "Fri Apr 03 04:05:12 +0000 2020", 
"Fri Apr 03 04:05:19 +0000 2020", "Fri Apr 03 04:05:27 +0000 2020", 
"Fri Apr 03 04:05:33 +0000 2020", "Fri Apr 03 04:05:36 +0000 2020", 
"Fri Apr 03 04:06:11 +0000 2020", "Fri Apr 03 04:07:08 +0000 2020", 
"Fri Apr 03 04:07:14 +0000 2020", "Fri Apr 03 04:07:15 +0000 2020", 
"Fri Apr 03 04:07:20 +0000 2020", "Fri Apr 03 04:07:30 +0000 2020", 
"Fri Apr 03 04:07:51 +0000 2020", "Fri Apr 03 04:08:04 +0000 2020", 
"Fri Apr 03 04:08:09 +0000 2020", "Fri Apr 03 04:08:15 +0000 2020", 
"Fri Apr 03 04:08:22 +0000 2020", "Fri Apr 03 04:08:36 +0000 2020", 
"Fri Apr 03 04:08:46 +0000 2020", "Fri Apr 03 04:08:46 +0000 2020", 
"Fri Apr 03 04:09:01 +0000 2020", "Fri Apr 03 04:09:08 +0000 2020", 
"Fri Apr 03 04:09:10 +0000 2020", "Fri Apr 03 04:09:15 +0000 2020", 
"Fri Apr 03 04:09:26 +0000 2020", "Fri Apr 03 04:09:27 +0000 2020", 
"Fri Apr 03 04:09:28 +0000 2020", "Fri Apr 03 04:09:28 +0000 2020", 
"Fri Apr 03 04:09:35 +0000 2020", "Fri Apr 03 04:09:36 +0000 2020", 
"Fri Apr 03 04:09:41 +0000 2020", "Fri Apr 03 04:09:45 +0000 2020", 
"Fri Apr 03 04:10:16 +0000 2020", "Fri Apr 03 04:10:19 +0000 2020", 
"Fri Apr 03 04:10:22 +0000 2020", "Fri Apr 03 04:10:26 +0000 2020", 
"Fri Apr 03 04:10:31 +0000 2020", "Fri Apr 03 04:10:48 +0000 2020", 
"Fri Apr 04 04:11:19 +0000 2020", "Fri Apr 04 04:11:32 +0000 2020", 
"Fri Apr 04:11:44 +0000 2020"), timestamp = c(1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 1.58589e+12, 
1.58589e+12, 1.58589e+12, 1.58589e+12), id_str = c(1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.25e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18, 
1.24593e+18, 1.24593e+18, 1.24593e+18, 1.24593e+18), text = c("Finally. Make your own mask. Protect yourself and others. #coronavirus", 
"@ArvinderSoin do you feel the use of only masks for IPD rounds, in an environment where no patients have been teste…", 
"India, you actually deserve him for electing him.\n\nAb batti bhujao aur #corona bhagav.\n\nNo testing kits, no masks,…", 
"great picture to sum up everything\n#mask #maskefficiency #noclothmask #maskprotection #surgicalmask #N95 #FFP1…", 
"The greatest hazard to public health is official misinformation.\n\nAsian countries were wearing masks from the begin…", 
"#Florida official says @3M is selling face masks to foreign countries instead of his state amid #COVID19 crisis.\n", 
"Wearing masks is one of the protective measures preventing catching the novel #Coronavirus as the pandemic spreads…", 
"It took Americans two and a half months to start wearing masks. Think about why, maybe it could explain why the peo…", 
"#coronavirus watching me put on the same surgical mask 2 shifts in a row\n\n#COVID<U+30FC>19 #nurse", 
"Back in stock! NIOSH N95, go to our website.\nOnly 11,000 masks \n\n#facemask #facemasks #N95…", 
"Hence the vital importance of wearing masks when outside - #coronavirus #coronavirusindia #COVID2019india…", 
"@Read5000YrLeap @SenSchumer buy trump facemasks. support trump 2020 and be safe. ships from midwest. #Boycott3M… ", 
"When going out for essential activities, members of the public should wear reusable, non-medical cloth face coverin…", 
"@jmcmaccarr buy trump facemasks. support trump 2020 and be safe. ships from midwest. #Boycott3M @seanhannity…", 
"It took Americans two and a half months to start wearing masks. Think about why, maybe it could explain why the peo…", 
"@CNN Just #WearMask People    wearing a mask Nationwide ... SAVES…", 
"That is less than 4 million per week.  In Taiwan, everyone is allocated 3 surgical masks per week.  For Australia t…", 
"@Constitution999 @ChuckCallesto @realDonaldTrump buy trump facemasks. support trump 2020 and be safe. ships from mi…", 
"Regard the debate of face mask in general public, the evidence of effectiveness is quite clear #Covid19…", 
"Normalize putting on of masks. #COVID19 came to change the world order.", 
"@TwitterSafety the Honduran gov’t is lying on Twitter. Saying that they are making thousands of masks, protective v…", 
"Trump explaining that if you need a mask you can go to Walmart. Also that Costco has some great deals on caskets an…", 
"When lockdown is over... I just may add this to my “don’t forget..” along with my wallet, gloves, mask, hand saniti…", 
"Make your own mask: #covid19\n", "Please, everyone should wear a mask in public. Use whatever you can get hold of. Something is better than nothing (…", 
"@kittywuv1 So incredibly mesmerizing, even with the custom #covid19 mask!<U+0001F970><U+0001F60D><U+0001F618><U+0001F637><U+0001F497>", 
"@BeauTFC Happy to report that we’ve developed a 3-D printed mask. Passed N95 equivalent fit-test with Bitrex (surgi…", 
"On a lighter note. \n\nIt is questionable if these common surgical masks and cloth masks will protect us from…", 
"Medical workers face big mask shortage. This UF doctor came up with way to make many \n\n…", 
"Homemade face coverings. Well, I tried it didn't come out straight but it should work. <U+0001F637> #homemade #facecoverings…", 
"#covid19 In Africa, \"where are no masks, no treatment, no reanimation\", \"the same way experimental treatment for AI…", 
"@theblondeMD Happy to report that we’ve developed a 3-D printed mask. Passed N95 equivalent fit-test with Bitrex (s…", 
"I wouldn’t do a thing anyone from #China says to do. The masks they keep sending around the world are faulty, they…", 
"@TIME [covid19],important:\n1.from_air-&gt;mask-&gt;mask_reuse.\n2.from_touch-&gt;clean_hands.\n\nps1.20200328.…", 
"@3M stop selling masks to foreign companies. We WILL remember this!\n#COVID19Pandemic \n#covid19\n#N95masks", 
"Awareness for using mask by @WHO #recommendations @CMOTamilNadu #COVID19 #Corona @MoHFW_INDIA #TNHealth #CVB…", 
"@Rakshitwa @beingdumber @taapsee Nitish Kumar asked for 10 lakh N95 masks but got 50,000. Sought five lakh PPE kits…", 
"@CNN You mean the masks everyone was saying #Covit19 #COVID<U+30FC>19 #coronavirus can pass right through as per what was…", 
"2 BILLION masks = global production capacity in 2.5 MONTHS = quantity of what China imported in 5 WEEKS since Jan…", 
"@CDCgov @CDCDirector @SF_DPH Please remember those with #COPD #LungDisease #HeartDisease when requiring #masks for…", 
"If you have to go out and can’t avoid being around people, wear a mask.  Masks are a complement to social distancin…", 
"@CTVVancouver According to Dr \"doom\" Bonnie Henry, masks aren't of any use to the general public, in fact, she clai…", 
"@maddow Next time you talk about the government stating everyone needs to wear a mask ask a government official whe…", 
"Wear a mask in you are unwell or taking care of a person with suspected 2019-nCoV infection.\nInfo source: WHO…", 
"7/9 For those who need a #COVID19 mask ASAP and have no talent, time or materials to make a mask. We give you the e…", 
"jasminesade_art\nIs taking orders for masks (w/ filter pocket) \nMsg jasminesade_art if interested <U+0001F496> \n.\n.\n.\n.\n.\n.", 
"What China do to cut down the spread dramatically are only to make people stay at home and wear masks!!!!!@PHE_uk…", 
"@CNN hey i thought we were boycotting China\nthen why the Americans need Chinese masks?\ngo fuck yourself \n#BoycottChina #coronavirus", 
"@CNN @CillizzaCNN [covid19],important:\n1.from_air-&gt;mask-&gt;mask_reuse.\n2.from_touch-&gt;clean_hands.\n\nps1.20200328.…", 
"@kr3at #WearMask Everyone  !!!\n\n\nSimply  wearing a mask Nationwide ... SAVES #CZECHOSLOVAKIA…"
), retweetCount = c(1372, 9, NA, 8, 30, NA, NA, NA, NA, NA, 34, 
NA, NA, NA, NA, NA, 192, NA, NA, NA, 50, NA, 221, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 17, 1948, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 53, NA, 1948, NA), favorite_count = c(3488, 
23, NA, 7, 46, NA, NA, NA, NA, NA, 62, NA, NA, NA, NA, NA, 710, 
NA, NA, NA, 48, NA, 506, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
29, 4963, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 164, 
NA, 4963, NA), url = c("twitter.com/33617860/status/1245925124483809280", 
"twitter.com/1106803026/status/1245925141046935552", "twitter.com/421517829/status/1245925143479595008", 
"twitter.com/1245594213795778560/status/1245925159724171264", 
"twitter.com/2178012643/status/1245925173858975744", "twitter.com/1220529001241989120/status/1245925183010963456", 
"twitter.com/1115874631/status/1245925217790124032", "twitter.com/1243781317747077120/status/1245925225327235072", 
"twitter.com/2729830110/status/1245925273230438400", "twitter.com/1240114893178667008/status/1245925291374964736", 
"twitter.com/88875512/status/1245925292972969984", "twitter.com/1245907384993812480/status/1245925320282136576", 
"twitter.com/3431854829/status/1245925357116481536", "twitter.com/1245907384993812480/status/1245925380973871104", 
"twitter.com/1243781317747077120/status/1245925393095217152", 
"twitter.com/1230706447257751552/status/1245925541644992512", 
"twitter.com/4437322348/status/1245925779117985792", "twitter.com/1245907384993812480/status/1245925802442555392", 
"twitter.com/829633267942903808/status/1245925807211663360", 
"twitter.com/403961389/status/1245925829755969536", "twitter.com/17183161/status/1245925869010292736", 
"twitter.com/1408320152/status/1245925960550993920", "twitter.com/1245663286881902592/status/1245926011679600640", 
"twitter.com/244306637/status/1245926036321103872", "twitter.com/24327965/status/1245926059318448128", 
"twitter.com/1164222471639318528/status/1245926089068646400", 
"twitter.com/16328861/status/1245926148967727104", "twitter.com/6125082/status/1.24592618943e+18", 
"twitter.com/3685052935/status/1245926191850065920", "twitter.com/868528766355558400/status/1245926251455365120", 
"twitter.com/1223273206636851200/status/1245926283093012480", 
"twitter.com/16328861/status/1245926292274311168", "twitter.com/1160039103905390592/status/1245926310670565376", 
"twitter.com/1236738668905127936/status/1245926356468162560", 
"twitter.com/400431217/status/1245926363833532416", "twitter.com/1244269086088945664/status/1245926365116809216", 
"twitter.com/850227053139853312/status/1245926366781902848", 
"twitter.com/244314850/status/1245926393822605312", "twitter.com/1244446404178665472/status/1245926398578978816", 
"twitter.com/3184694718/status/1245926421601509376", "twitter.com/82208845/status/1245926438143807488", 
"twitter.com/1216588869530836992/status/1245926569303891968", 
"twitter.com/4770303330/status/1245926579936432128", "twitter.com/1245580876047499264/status/1245926591806361600", 
"twitter.com/904740870817120256/status/1245926610181574656", 
"twitter.com/934146138/status/1245926629022433280", "twitter.com/1223547711468777472/status/1245926703257366528", 
"twitter.com/840838036707393536/status/1245926832618131456", 
"twitter.com/1236738668905127936/status/1245926888087773184", 
"twitter.com/1230706447257751552/status/1245926935042994176"), 
    friendCount = c(1018, 326, 1205, 48, 3690, 1584, 55, 42, 
    580, 11, 3610, 13, 110, 13, 42, 382, 43, 13, 106, 4195, 599, 
    8, 89, 414, 280, 931, 5001, 1602, 1327, 227, 310, 5001, 26, 
    65, 2371, 31, 523, 228, 8, 671, 499, 1324, 333, 5, 852, 5457, 
    7, 48, 65, 382), screenNames = c("DayssiOK", "DrAmbrishMithal", 
    "LuvAminaKausar", "Sunnie09370280", "balajis", "World_In_Mins", 
    "CGTNOfficial", "a7BdaSSeyL4czNw", "ShellBell915", "remedair", 
    "RitasArtCafe", "trumpfacemasks", "SCC_OES", "trumpfacemasks", 
    "a7BdaSSeyL4czNw", "REX38225222", "e2p71828", "trumpfacemasks", 
    "lamsonlinshen", "SteveJumaaa", "patfloTO", "tenforadollar", 
    "sashir_milne", "rdesai711", "agrothey", "foreskinjim1", 
    "rover223", "scanman", "AlDubest2Evry1", "HurtadoMarleen", 
    "johnmik63542947", "rover223", "CowlSolomon", "spacetinyearth", 
    "jmegown52302", "DrPonnarasu", "pankajupa120", "JoaoNewman", 
    "LalalaHK1", "SaturniaC", "NYCMediaMix", "ToscasReturn", 
    "JamesDallas9175", "cornzal", "CEDRdigital", "NadraRae", 
    "SiluMa4", "1Wa49R41L3pVzQj", "spacetinyearth", "REX38225222"
    ), userID = c(33617860, 1106803026, 421517829, 1.24559e+18, 
    2178012643, 1.22e+18, 1115874631, 1.24e+18, 2729830110, 1.24e+18, 
    88875512, 1.24591e+18, 3431854829, 1.24591e+18, 1.24e+18, 
    1.23071e+18, 4437322348, 1.24591e+18, 8.29633e+17, 403961389, 
    17183161, 1408320152, 1.24566e+18, 244306637, 24327965, 1.16422e+18, 
    16328861, 6125082, 3685052935, 8.68529e+17, 1.22327e+18, 
    16328861, 1.16004e+18, 1.24e+18, 400431217, 1.24427e+18, 
    8.50227e+17, 244314850, 1.24445e+18, 3184694718, 82208845, 
    1.22e+18, 4770303330, 1.24558e+18, 9.04741e+17, 934146138, 
    1.22355e+18, 8.40838e+17, 1.24e+18, 1.23071e+18), language = c("en", 
    "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", 
    "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", 
    "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", 
    "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", 
    "en", "en", "en", "en", "en", "en", "en", "en", "en"), replyToScreenName = c("None", 
    "ArvinderSoin", "None", "None", "None", "World_In_Mins", 
    "None", "None", "None", "None", "None", "Read5000YrLeap", 
    "None", "jmcmaccarr", "None", "CNN", "None", "Constitution999", 
    "None", "None", "TwitterSafety", "None", "None", "None", 
    "None", "kittywuv1", "BeauTFC", "None", "None", "None", "None", 
    "theblondeMD", "None", "TIME", "3M", "None", "Rakshitwa", 
    "CNN", "None", "CDCgov", "None", "CTVVancouver", "maddow", 
    "None", "CEDRdigital", "None", "None", "CNN", "CNN", "kr3at"
    ), replyToID = c("None", "1.13442E+18", "None", "None", "None", 
    "1.22053E+18", "None", "None", "None", "None", "None", "154243839", 
    "None", "48150879", "None", "759251", "None", "1.04747E+18", 
    "None", "None", "95731075", "None", "None", "None", "None", 
    "1.21653E+18", "1.05676E+18", "None", "None", "None", "None", 
    "230792524", "None", "14293310", "378197959", "None", "9.81585E+17", 
    "759251", "None", "146569971", "None", "16313405", "16129920", 
    "None", "9.04741E+17", "None", "None", "759251", "759251", 
    "139283160"), retweetUserScreenName = c(NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
    ), retweetUserID = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), followersCount = c(1452, 
    3844, 2398, 1, 179896, 1283, 14036740, 24, 329, 3, 7133, 
    2, 1050, 2, 24, 121, 4, 2, 38, 2533, 235, 2, 5, 148, 2312, 
    265, 1572, 8067, 1265, 167, 13, 1574, 1, 2, 972, 1, 107, 
    7, 0, 73, 295, 1160, 849, 1, 7519, 1749, 0, 4, 2, 121), userMentions = c(NA, 
    "ArvinderSoin", NA, NA, NA, "3M", NA, NA, NA, NA, NA, "Read5000YrLeap", 
    NA, "jmcmaccarr", NA, "CNN", NA, "Constitution999", NA, NA, 
    "TwitterSafety", NA, NA, NA, NA, "kittywuv1", "BeauTFC", 
    NA, NA, NA, NA, "theblondeMD", NA, "TIME", "3M", "WHO", "Rakshitwa", 
    "CNN", NA, "CDCgov", NA, "CTVVancouver", "maddow", NA, NA, 
    NA, NA, "CNN", "CNN", "kr3at"), userMentionsID = c(NA, 1.13442e+18, 
    NA, NA, NA, 378197959, NA, NA, NA, NA, NA, 154243839, NA, 
    48150879, NA, 759251, NA, 1.05e+18, NA, NA, 95731075, NA, 
    NA, NA, NA, 1.21653e+18, 1.05676e+18, NA, NA, NA, NA, 230792524, 
    NA, 14293310, 378197959, 14499829, 9.81585e+17, 759251, NA, 
    146569971, NA, 16313405, 16129920, NA, NA, NA, NA, 759251, 
    759251, 139283160), hashtag1 = c("coronavirus", NA, "corona", 
    "mask", NA, "Florida", "Coronavirus", NA, "coronavirus", 
    "facemask", "coronavirus", "Boycott3M", NA, "Boycott3M", 
    NA, "WearMask", NA, NA, "Covid19", "COVID19", NA, NA, NA, 
    "covid19", NA, "covid19", NA, NA, NA, "homemade", "covid19", 
    NA, "China", NA, "COVID19Pandemic", "recommendations", NA, 
    "Covit19", NA, "COPD", NA, NA, NA, NA, "COVID19", NA, NA, 
    "BoycottChina", NA, "WearMask"), hashtag2 = c(NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA), mediatype = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), mediaURL = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA)), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -50L), spec = structure(list(
    cols = list(createdAt = structure(list(), class = c("collector_character", 
    "collector")), timestamp = structure(list(), class = c("collector_double", 
    "collector")), id_str = structure(list(), class = c("collector_double", 
    "collector")), text = structure(list(), class = c("collector_character", 
    "collector")), retweetCount = structure(list(), class = c("collector_double", 
    "collector")), favorite_count = structure(list(), class = c("collector_double", 
    "collector")), url = structure(list(), class = c("collector_character", 
    "collector")), friendCount = structure(list(), class = c("collector_double", 
    "collector")), screenNames = structure(list(), class = c("collector_character", 
    "collector")), userID = structure(list(), class = c("collector_double", 
    "collector")), language = structure(list(), class = c("collector_character", 
    "collector")), replyToScreenName = structure(list(), class = c("collector_character", 
    "collector")), replyToID = structure(list(), class = c("collector_character", 
    "collector")), retweetUserScreenName = structure(list(), class = c("collector_logical", 
    "collector")), retweetUserID = structure(list(), class = c("collector_logical", 
    "collector")), followersCount = structure(list(), class = c("collector_double", 
    "collector")), userMentions = structure(list(), class = c("collector_character", 
    "collector")), userMentionsID = structure(list(), class = c("collector_double", 
    "collector")), hashtag1 = structure(list(), class = c("collector_character", 
    "collector")), hashtag2 = structure(list(), class = c("collector_logical", 
    "collector")), mediatype = structure(list(), class = c("collector_logical", 
    "collector")), mediaURL = structure(list(), class = c("collector_logical", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))
> groups <- (split(sample_df, (seq(nrow(sample_df))-1) %/% 20)) #here I want 20 rows per file until last row is reached
> for (i in seq_along(groups)) {
+   write.csv(groups[[i]], paste0("sample_output_file", i, ".csv")) #iterate and write file
+ }

1 Ответ

5 голосов
/ 19 апреля 2020

Мы можем создать переменную из createdAt и затем сделать group_split до list data.frame. Здесь мы можем извлечь указанную c подстроку либо с помощью str_replace, удалив первое слово, за которым следует пробел, при этом записывая следующее слово, пробел, несколько цифр и используя его при замене.

library(dplyr)
library(stringr)
sample_df %>% 
  mutate(month_day = str_replace(createdAt, 
           "^\\w+\\s+(\\w+\\s+\\d+).*", "\\1")) %>%
  group_split(month_day)

ПРИМЕЧАНИЕ: mutate не требуется, поскольку month_day можно создать на лету в самом group_split

.
...