Самый быстрый способ добавить заголовки и удалить определенные столбцы из большого текстового файла, используя R - PullRequest
0 голосов
/ 24 января 2019

Я пытаюсь добавить заголовки и одновременно удалить определенные столбцы из текстового файла, потому что они содержат пробелы, которые вызывают проблемы усечения позже в моем ETL. Поскольку эти файлы могут иметь размер до 16 ГБ, я не хочу на самом деле загружать данные в R и записывать их обратно - не думаю, что это даже возможно из-за ограничений памяти.

Образцы данных, которые были значительно упрощены и уменьшены в размере для простоты использования. Скопируйте это в .txt файл "TargetTest.txt":

185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488
185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488
185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488
185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488
185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488
185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488
185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488
185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488
185002 ~SA     ~000620~1195~1195~000~0000~Y~A~             ~S255392488

Мне посчастливилось запустить код из command prompt, используя shell(). Пока что мой код:

usrHeaderNames <- c("PolicyNumber", "ReportSuffix", "ReportAccount", "PlanVariationCode", "ReportCode", "FranchiseCodeOne", "FranchiseCodeTwo", "UNETRegionCode", "FinancialArrangementIndicator", "Filler10", "EmployeeID", "MemberLastName", "EmployeeSex", "EmployeeDateofBirth", "EmployeeZIPCode", "EmployeeStatus", "Filler17", "DependentNumber", "IndividualID", "MemberRelationshipCode", "MemberFirstName", "MemberDateofBirth", "MemberSex", "MedicareEligibilityIndicator", "MemberMarket", "PatientNumber", "EmployeePOSInOutofAreaInd.", "EmployeePPOInOutofAreaInd.", "MemberProductCode", "Filler30", "PHIIndicator", "Filler32", "ClaimReferenceNumber", "DateProcessed", "ElectronicBillingIndicator", "ClaimsOfficeNumber", "TransactionCode", "DateClaimReceived", "ClaimAdjusterNumber", "ProcessingOfficeNumber", "UniqueCheckIdentifier", "TransactionType", "StateTaxEligibilityIndicator", "DocumentControlSerialNumber", "FilmingOfficeNumber", "ProviderType", "ProviderFullName", "ProviderTaxIDPrefix", "ProviderTaxID", "ProviderTaxIDSuffix", "ProviderIPA", "PremiumProviderDerivedBenefitTierLevelIndicator", "ProviderZIPCode", "ProviderSpecialtyCode", "PremiumProviderIndicator", "ProviderMarket", "MPIN", "ProviderNetworkParticipatingInd.", "CoveringPhysicianIndicator", "Filler60", "NationalDrugCode", "Filler62", "CauseCode", "DischargeStatus", "Filler65", "PlaceofService", "ServiceCode", "ServiceCodeModifier", "ProcedureModifier2", "DateofServiceFrom", "DateofServiceTo", "ServiceCount", "Filler73", "CapitatedEncounterIndicator", "HospitalDRG", "Filler76", "BilledAmount", "NotCoveredAmount", "RemarkCode", "ChargeLevelRemarkCode", "Filler81", "ReconsideredNotCoveredAmount", "ReconsiderationRemarkCode", "ClaimLevelRemarkCode", "Filler85", "BenefitsLimitations", "DiscountAmount", "DiscountType", "ProviderContractType", "AllowableExpense", "Deductible", "Copay", "Coinsurance", "GrossBenefitsPayable", "OtherInsuranceAmount", "OtherInsuranceIndicator", "OtherInsuranceType", "MiscellaneousReductionsAmount", "NetPaid", "BenefitPlanComplianceIndicator", "PayeeType", "TaxRecordIndicator", "Out-of-PocketOffsetAmount", "ClaimStatusCode", "OverrideCode", "ServiceOrder", "PayoutSummaryCategory", "Filler108", "CheckSuppressionIndicator", "PCPTaxIDPrefix", "PCPTaxID", "PCPTaxIDSuffix", "ProviderClassificationCode", "RevenueCode-1", "RevenueCode-2", "RevenueCode-3", "RevenueCode-4", "RevenueCode-5", "RevenueCode-6", "RevenueCode-7", "Fillerreservedarea", "HRAAmount", "NPINumber", "PrimaryDiagnosis", "SecondaryDiagnosis", "TertiaryDiagnosis", "ICD-10INDICATOR", "RevenueCode-8", "RevenueCode-9", "RevenueCode-10", "RevenueCode-11", "RevenueCode-12", "RevenueCode-13", "RevenueCode-14", "RevenueCode-15", "RevenueCode-16", "RevenueCode-17", "RevenueCode-18", "RevenueCode-19", "RevenueCode-20", "RevenueCodeCount1", "RevenueCodeCount2", "RevenueCodeCount3", "RevenueCodeCount4", "RevenueCodeCount5", "RevenueCodeCount6", "RevenueCodeCount7", "RevenueCodeCount8", "RevenueCodeCount9", "RevenueCodeCount10", "RevenueCodeCount11", "RevenueCodeCount12", "RevenueCodeCount13", "RevenueCodeCount14", "RevenueCodeCount15", "RevenueCodeCount16", "RevenueCodeCount17", "RevenueCodeCount18", "RevenueCodeCount19", "RevenueCodeCount20", "RevenueSourceChargeAmt1", "RevenueSourceChargeAmt2", "RevenueSourceChargeAmt3", "RevenueSourceChargeAmt4", "RevenueSourceChargeAmt5", "RevenueSourceChargeAmt6", "RevenueSourceChargeAmt7", "RevenueSourceChargeAmt8", "RevenueSourceChargeAmt9", "RevenueSourceChargeAmt10", "RevenueSourceChargeAmt11", "RevenueSourceChargeAmt12", "RevenueSourceChargeAmt13", "RevenueSourceChargeAmt14", "RevenueSourceChargeAmt15", "RevenueSourceChargeAmt16", "RevenueSourceChargeAmt17", "RevenueSourceChargeAmt18", "RevenueSourceChargeAmt19", "RevenueSourceChargeAmt20", "PrimarySurgicalProcedureCode", "SecondarySurgicalProcedureCode", "TertiarySurgicalProcedureCode", "RESERVED")
usrCompleteFolderPath <- "C:/Users/pboswell/Downloads/"
usrColumnIgnore <- c("1", "1", "1", "0", "1", "1", "1", "1", "1", "1", "0", "0", "1", "1", "0", "1", "1", "1", "1", "0", "0", "0", "0", "1", "1", "0", "1", "1", "1", "1", "1", "1", "0", "1", "1", "1", "1", "0", "1", "1", "1", "1", "1", "1", "1", "0", "0", "1", "0", "1", "1", "1", "0", "0", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "0", "0", "0", "0", "0", "0", "1", "1", "1", "0", "1", "0", "1", "1", "1", "1", "1", "1", "1", "1", "1", "0", "0", "1", "0", "0", "0", "0", "1", "0", "0", "1", "1", "0", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "0", "0", "0", "0", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1")
usrWorkingFileName <- "TargetTest.txt"
vcColumnIndex <- which(usrColumnIgnore==0)
vcDelimiter <- "~"
vcOuputFileName <- "TargetTestNew.txt"




cmdScript <- paste0(
  "cd ",gsub("/","\\\\",usrCompleteFolderPath)
  ," && "
  ,"echo ",paste0(usrHeaderNames[vcColumnIndex],collapse=vcDelimiter)," > ",vcOuputFileName
  ," && "
  ,"for /f \"tokens=",paste0(vcColumnIndex,collapse=",")," delims=~\" %1 in (",usrWorkingFileName,") DO echo ",paste0("%",paste0(seq.int(vcColumnIndex),collapse=paste0(vcDelimiter,"%")))," >> ",vcOuputFileName
)

Для части заголовков я смог легко создать файл заголовков, используя echo ____ > ____:

shell(paste0("echo ",paste0(usrHeaderNames[vcColumnIndex],collapse=vcDelimiter)," > ",vcOuputFileName))

, а затем добавить фактические данные, используя type ____ >> ____:

shell(paste0("type ",usrWorkingFileName," >> ",vcOuputFileName))

но я подумал, что мог бы объединить шаги и просто добавить нужные столбцы в файл заголовков, используя подход FOR /F ["options"] %%parameter IN ("Text string to process") DO command:

shell(paste0("for /f \"tokens=",paste0(vcColumnIndex,collapse=",")," delims=~\" %1 in (",usrWorkingFileName,") DO echo ",paste0("%",paste0(seq.int(vcColumnIndex),collapse=paste0(vcDelimiter,"%")))," >> ",vcOuputFileName))

Но это использует цикл for, который ужасен для больших наборов данных. Первоначальный метод headers / append занял 2 минуты на 500 МБ, а новый метод for-loop еще не завершился через 30 минут (я отменил процесс).

1) Нужно ли использовать цикл for для этого в Windows?

2) Лучше ли работает команда Linux awk или cut (т.е. в пакетном режиме)? Если да, есть ли порт для Windows, который я могу использовать, который будет работать с выполнением кода R?

3) Можно ли сделать это, используя другой метод, например, найти и заменить пробел после записи файла?

...