Linux команда для удаления переноса слов для всего файла CSV - PullRequest
1 голос
/ 22 апреля 2020

Это мой пример данных в CSV-файле. Как видите, для ID = '51126' есть столбец, в котором есть данные в формате переноса слов. Данные вводятся с помощью atl + enter. Мне нужно удалить перенос слов и ввести в одну строку для всего файла CSV. В файле много таких переносов слов!

ID,OPPORTUNITY ID,CREATED_DATE,TIR NAME,MS Rep,SRC_SSR_REP,REGION,HP PBM NAME,COMPANY NAME,COMPANY ADDRESS,COMPANY CITY,COMPANY STATE,COMPANY ZIPCODE,COMPANY AMID,COMPANY USER CONTACT NAME,COMPANY USER TITLE,COMPANY USER PHONE,COMPANY USER EMAIL,PARTNER COMPANY NAME,PARTNER REP NAME,PARTNER REP EMAIL,PARTNER LID,WHOLESALER,PURCHASEDGE AC NUMBER,USAGE PERIOD,DEAL TYPE,CLWB WORKED ON,DEAL NUMBER,NAMED TERRITORY SLED,MONO HP SHARE %,COLOR HP SHARE %,TOTAL HP TONER SHARE %,DEAL VALUE MONO,DEAL VALUE COLOR,TOTAL TONER DEAL VALUE,EST DISCOUNT VALUE,REBATE TYPE MONO,REBATE TYPE COLOR,DISCOUNT TYPE,DEAL START DATE,DEAL END DATE,DEAL EXTENDED END DATE,DEAL POSITION,ECLIPSE ID,ECLIPSE DEAL STATUS,ECLIPSE APPROVED DATE,ECLIPSE DEAL APPROVED BY,LOST REASON,USAGE FILE LOCATION,CREARTED BY,MODIFIED BY,MODIFIED DATE,FINALISATION_RECEIVED_DATE,FINALISATION_WORKED_DATE,DEAL_PROCESSED_BY,DEAL_FINALISED_BY,FUNNEL_COMMENT,AV_SENT_DATE,PL_REMAN_VALUE,PL_REMAN_SHARE,FINALISATION_DOC_PATH,TIME ELAPSING ON,APPROVAL SENT DATE,APPROVAL RECEIVED DATE,SECONDARY_WHOLESALER,PREVIOUSECLIPSE_ID,PurchasEdge_(Y/N),HP_TONER_UNITS,PL_REMAN_UNITS,FINALISATION_COMMENTS,RENEWAL_POSITION,PROGRAM_NAME,CUSTOMERONBOARDEDON
51128,OPP-048699,3/23/2020 21:02,Adam Dohm,Cheryl Glenn,Tiffany Debose,MARKET SOURCE,,"Flathead Valley School District (Kalispell, Whitefish, Columbia Falls)",233 1st Ave E,Kalispell,MT,59901,,Joe Biangone,Purchasing,406-758-8392,biangonej@sd5.k12.mt.us,TONERPORT INCORPORATED, ,,10293955,ESSENDANT,,12 months,Renewal,,CL091515474R4-A,SLED,97,100,98,21592,16781,38373,2452,Defend,Defend,Defend,4/15/2020 0:00,4/14/2021 0:00,4/14/2021 0:00,Won,42921984,,,,,/E/Data/Funnel/Submit/FLATHEAD VALLEY SCHOOL DISTRICT USAGE_51128.xlsx,Tiffany Debose,Tiffany Debose,3/26/2020 14:49,3/26/2020 0:00,,Bhavana P V,,,,613.97,1.6,,,,,NA,42085906,N,179,3,3/26 - Deal added on eclipse ,,SMBA,
51126,OPP-048697,3/23/2020 19:52,Xavier Weems,,Tiffany Debose,EAST,Vladimir Jaksic,"Gray Television, Inc.","​Gray Television, Inc.
4370 Peachtree Rd, NE.
​Atlanta, Ga  30319
​

",,GA,30319,DN042973875,Dottie Boudreau,Manager,404-266-8333,dottie@gray.tv,"STAPLES, INC", ,,"10264576,10252948",NA,,12 months,New,,CL200351126,Commercial - Named,84,89,86,16143,7335,23478,3149,Defend,Defend,Defend,,,,AV summary and PPT sent,,,,,,"/E/Data/Funnel/Submit/GRAY TELEVISION, INC USAGE_51126.xlsb",Tiffany Debose,Tiffany Debose,3/26/2020 8:55,,,Deepthi K,,,3/26/2020 0:00,3239.96,13.8,,6/24/2020 0:00,,,NA,,N,168,27,3/24/2020 - sent for specialist approval 3/26/2020 - aV sent,,MCBigDeal,
51125,OPP-048696,3/23/2020 18:01,Xavier Weems,,Tiffany Debose,WEST,Jenni HoGlin,STURM FINANCIAL GROUP,3033 East First Avenue,Denver,CO,80206,,,,,,"STAPLES, INC", ,,"10264576,10252948",NA,,12 months,New,,CL200351125,Commercial - Non Named,42,87,65,10201,14198,24399,6369,Winback,Defend,Winback,,,,AV summary and PPT sent,,,,,,/E/Data/Funnel/Submit/STURM FINANCIAL GROUP USAGE_51125.xlsx,Tiffany Debose,Tiffany Debose,3/24/2020 7:49,,,Teja Ravi,,,3/24/2020 0:00,8417.66,34.5,,6/22/2020 0:00,,,NA,,N,127,67,3/24-AV Summary and PPT sent,,SMBA,

Вывод должен быть таким, как показано ниже. Я ввел только ID = 51126 и 51125 для справки, там тоже будет 51128! Есть 73 столбца!

"ID","OPPORTUNITY ID","CREATED_DATE","TIR NAME","MS Rep","SRC_SSR_REP","REGION","HP PBM NAME","COMPANY NAME","COMPANY ADDRESS","COMPANY CITY","COMPANY STATE","COMPANY ZIPCODE","COMPANY AMID","COMPANY USER CONTACT NAME","COMPANY USER TITLE","COMPANY USER PHONE","COMPANY USER EMAIL","PARTNER COMPANY NAME","PARTNER REP NAME","PARTNER REP EMAIL","PARTNER LID","WHOLESALER","PURCHASEDGE AC NUMBER","USAGE PERIOD","DEAL TYPE","CLWB WORKED ON","DEAL NUMBER","NAMED TERRITORY SLED","MONO HP SHARE %","COLOR HP SHARE %","TOTAL HP TONER SHARE %","DEAL VALUE MONO","DEAL VALUE COLOR","TOTAL TONER DEAL VALUE","EST DISCOUNT VALUE","REBATE TYPE MONO","REBATE TYPE COLOR","DISCOUNT TYPE","DEAL START DATE","DEAL END DATE","DEAL EXTENDED END DATE","DEAL POSITION","ECLIPSE ID","ECLIPSE DEAL STATUS","ECLIPSE APPROVED DATE","ECLIPSE DEAL APPROVED BY","LOST REASON","USAGE FILE LOCATION","CREARTED BY","MODIFIED BY","MODIFIED DATE","FINALISATION_RECEIVED_DATE","FINALISATION_WORKED_DATE","DEAL_PROCESSED_BY","DEAL_FINALISED_BY","FUNNEL_COMMENT","AV_SENT_DATE","PL_REMAN_VALUE","PL_REMAN_SHARE","FINALISATION_DOC_PATH","TIME ELAPSING ON","APPROVAL SENT DATE","APPROVAL RECEIVED DATE","SECONDARY_WHOLESALER","PREVIOUSECLIPSE_ID","PurchasEdge_(Y/N)","HP_TONER_UNITS","PL_REMAN_UNITS","FINALISATION_COMMENTS","RENEWAL_POSITION","PROGRAM_NAME","CUSTOMERONBOARDEDON"
"51126","OPP-048697","3/23/2020 19:52",Xavier Weems","","Tiffany Debose","EAST","Vladimir Jaksic","Gray Television, Inc.","​Gray Television, Inc. 4370 Peachtree Rd, NE. Atlanta, Ga  30319","","GA","30319","DN042973875","Dottie Boudreau","Manager","404-266-8333","dottie@gray.tv","STAPLES, INC","","","10264576,10252948","NA","","12 months","New","","CL200351126","Commercial - Named","84","89","86","16143","7335","23478","3149","Defend","Defend","Defend","","","","AV summary and PPT sent","","","","","","/E/Data/Funnel/Submit/GRAY TELEVISION, INC USAGE_51126.xlsb","Tiffany Debose","Tiffany Debose","3/26/2020 8:55","","","Deepthi K","","","3/26/2020 0:00","3239.96","13.8","","6/24/2020 0:00","","","NA","","N","168","27","3/24/2020 - sent for specialist approval 3/26/2020 - aV sent","","MCBigDeal",""
"51125","OPP-048696","3/23/2020 18:01","Xavier Weems","","Tiffany Debose","WEST","Jenni HoGlin","STURM FINANCIAL GROUP","3033 East First Avenue","Denver","CO","80206","","","","","","STAPLES, INC","","","10264576,10252948","NA","","12 months","New","","CL200351125","Commercial - Non Named","42","87","65","10201","14198","24399","6369","Winback","Defend","Winback","","","","AV summary and PPT sent","","","","","","/E/Data/Funnel/Submit/STURM FINANCIAL GROUP USAGE_51125.xlsx","Tiffany Debose","Tiffany Debose","3/24/2020 7:49","","","Teja Ravi","","","3/24/2020 0:00","8417.66","34.5","","6/22/2020 0:00","","","NA","","N","127","67","3/24-AV Summary and PPT sent","","SMBA",""

Я пробовал приведенный ниже код, чтобы удалить перенос слов!

awk -F '"[^"]+"' 'NF<73{s = s $0; next} s{print s; s=""} 1; END{if (s) print s}' file

, а также

awk -F, 'NF!=73&&!line{line=$0;next} NF!=73&&line{line=line $0} {n=split(line, a, ",")} n==73{print line;line=""}' file.csv

Кажется, что на самом деле ничего не работает!

Пожалуйста, предложите код linux без использования внешних unix пакетов

Ответы [ 3 ]

2 голосов
/ 22 апреля 2020

Попробуйте это

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }'  

Демо

$gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }'  < file1.txt
ID,OPPORTUNITY ID,CREATED_DATE,TIR NAME,MS Rep,SRC_SSR_REP,REGION,HP PBM NAME,COMPANY NAME,COMPANY ADDRESS,COMPANY CITY,COMPANY STATE,COMPANY ZIPCODE,COMPANY AMID,COMPANY USER CONTACT NAME,COMPANY USER TITLE,COMPANY USER PHONE,COMPANY USER EMAIL,PARTNER COMPANY NAME,PARTNER REP NAME,PARTNER REP EMAIL,PARTNER LID,WHOLESALER,PURCHASEDGE AC NUMBER,USAGE PERIOD,DEAL TYPE,CLWB WORKED ON,DEAL NUMBER,NAMED TERRITORY SLED,MONO HP SHARE %,COLOR HP SHARE %,TOTAL HP TONER SHARE %,DEAL VALUE MONO,DEAL VALUE COLOR,TOTAL TONER DEAL VALUE,EST DISCOUNT VALUE,REBATE TYPE MONO,REBATE TYPE COLOR,DISCOUNT TYPE,DEAL START DATE,DEAL END DATE,DEAL EXTENDED END DATE,DEAL POSITION,ECLIPSE ID,ECLIPSE DEAL STATUS,ECLIPSE APPROVED DATE,ECLIPSE DEAL APPROVED BY,LOST REASON,USAGE FILE LOCATION,CREARTED BY,MODIFIED BY,MODIFIED DATE,FINALISATION_RECEIVED_DATE,FINALISATION_WORKED_DATE,DEAL_PROCESSED_BY,DEAL_FINALISED_BY,FUNNEL_COMMENT,AV_SENT_DATE,PL_REMAN_VALUE,PL_REMAN_SHARE,FINALISATION_DOC_PATH,TIME ELAPSING ON,APPROVAL SENT DATE,APPROVAL RECEIVED DATE,SECONDARY_WHOLESALER,PREVIOUSECLIPSE_ID,PurchasEdge_(Y/N),HP_TONER_UNITS,PL_REMAN_UNITS,FINALISATION_COMMENTS,RENEWAL_POSITION,PROGRAM_NAME,CUSTOMERONBOARDEDON
51128,OPP-048699,3/23/2020 21:02,Adam Dohm,Cheryl Glenn,Tiffany Debose,MARKET SOURCE,,"Flathead Valley School District (Kalispell, Whitefish, Columbia Falls)",233 1st Ave E,Kalispell,MT,59901,,Joe Biangone,Purchasing,406-758-8392,biangonej@sd5.k12.mt.us,TONERPORT INCORPORATED, ,,10293955,ESSENDANT,,12 months,Renewal,,CL091515474R4-A,SLED,97,100,98,21592,16781,38373,2452,Defend,Defend,Defend,4/15/2020 0:00,4/14/2021 0:00,4/14/2021 0:00,Won,42921984,,,,,/E/Data/Funnel/Submit/FLATHEAD VALLEY SCHOOL DISTRICT USAGE_51128.xlsx,Tiffany Debose,Tiffany Debose,3/26/2020 14:49,3/26/2020 0:00,,Bhavana P V,,,,613.97,1.6,,,,,NA,42085906,N,179,3,3/26 - Deal added on eclipse ,,SMBA,
51126,OPP-048697,3/23/2020 19:52,Xavier Weems,,Tiffany Debose,EAST,Vladimir Jaksic,"Gray Television, Inc.","​Gray Television, Inc.4370 Peachtree Rd, NE.​Atlanta, Ga  30319​",,GA,30319,DN042973875,Dottie Boudreau,Manager,404-266-8333,dottie@gray.tv,"STAPLES, INC", ,,"10264576,10252948",NA,,12 months,New,,CL200351126,Commercial - Named,84,89,86,16143,7335,23478,3149,Defend,Defend,Defend,,,,AV summary and PPT sent,,,,,,"/E/Data/Funnel/Submit/GRAY TELEVISION, INC USAGE_51126.xlsb",Tiffany Debose,Tiffany Debose,3/26/2020 8:55,,,Deepthi K,,,3/26/2020 0:00,3239.96,13.8,,6/24/2020 0:00,,,NA,,N,168,27,3/24/2020 - sent for specialist approval 3/26/2020 - aV sent,,MCBigDeal,
51125,OPP-048696,3/23/2020 18:01,Xavier Weems,,Tiffany Debose,WEST,Jenni HoGlin,STURM FINANCIAL GROUP,3033 East First Avenue,Denver,CO,80206,,,,,,"STAPLES, INC", ,,"10264576,10252948",NA,,12 months,New,,CL200351125,Commercial - Non Named,42,87,65,10201,14198,24399,6369,Winback,Defend,Winback,,,,AV summary and PPT sent,,,,,,/E/Data/Funnel/Submit/STURM FINANCIAL GROUP USAGE_51125.xlsx,Tiffany Debose,Tiffany Debose,3/24/2020 7:49,,,Teja Ravi,,,3/24/2020 0:00,8417.66,34.5,,6/22/2020 0:00,,,NA,,N,127,67,3/24-AV Summary and PPT sent,,SMBA,
$

1 голос
/ 22 апреля 2020

Предполагая, что символ новой строки в полях и в конце каждой записи равен \n, потому что если он был \n в полях и \r\n в конце каждой записи, как экспортировано в MS-Excel, то это будет тривиально, следующее использует GNU awk для различных расширений (multi-char RS, RT, FPAT и \s).

Это объединит строки:

awk -v RS='"[^"]+"' -v ORS= '{
    gsub(/\n/,"",RT)
    print $0 RT
}'

, и это удалит начальные / конечные пробелы и заключит каждое поле в кавычки:

awk -v FPAT='[^,]*|"[^"]+"' -v OFS=',' '{
    for (i=1;i<=NF;i++) {
        gsub(/^"?\s*|\s*"?$/,"",$i)
        printf "\"%s\"%s", $i, (i<NF ? OFS : ORS)
    }
}'

, чтобы вы могли просто использовать их вместе в трубе:

$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n/,"",RT); print $0 RT}' file |
    awk -v FPAT='[^,]*|"[^"]+"' -v OFS=',' '{for (i=1;i<=NF;i++) {gsub(/^"?\s*|\s*"?$/,"",$i); printf "\"%s\"%s", $i, (i<NF ? OFS : ORS)} }'
"ID","OPPORTUNITY ID","CREATED_DATE","TIR NAME","MS Rep","SRC_SSR_REP","REGION","HP PBM NAME","COMPANY NAME","COMPANY ADDRESS","COMPANY CITY","COMPANY STATE","COMPANY ZIPCODE","COMPANY AMID","COMPANY USER CONTACT NAME","COMPANY USER TITLE","COMPANY USER PHONE","COMPANY USER EMAIL","PARTNER COMPANY NAME","PARTNER REP NAME","PARTNER REP EMAIL","PARTNER LID","WHOLESALER","PURCHASEDGE AC NUMBER","USAGE PERIOD","DEAL TYPE","CLWB WORKED ON","DEAL NUMBER","NAMED TERRITORY SLED","MONO HP SHARE %","COLOR HP SHARE %","TOTAL HP TONER SHARE %","DEAL VALUE MONO","DEAL VALUE COLOR","TOTAL TONER DEAL VALUE","EST DISCOUNT VALUE","REBATE TYPE MONO","REBATE TYPE COLOR","DISCOUNT TYPE","DEAL START DATE","DEAL END DATE","DEAL EXTENDED END DATE","DEAL POSITION","ECLIPSE ID","ECLIPSE DEAL STATUS","ECLIPSE APPROVED DATE","ECLIPSE DEAL APPROVED BY","LOST REASON","USAGE FILE LOCATION","CREARTED BY","MODIFIED BY","MODIFIED DATE","FINALISATION_RECEIVED_DATE","FINALISATION_WORKED_DATE","DEAL_PROCESSED_BY","DEAL_FINALISED_BY","FUNNEL_COMMENT","AV_SENT_DATE","PL_REMAN_VALUE","PL_REMAN_SHARE","FINALISATION_DOC_PATH","TIME ELAPSING ON","APPROVAL SENT DATE","APPROVAL RECEIVED DATE","SECONDARY_WHOLESALER","PREVIOUSECLIPSE_ID","PurchasEdge_(Y/N)","HP_TONER_UNITS","PL_REMAN_UNITS","FINALISATION_COMMENTS","RENEWAL_POSITION","PROGRAM_NAME","CUSTOMERONBOARDEDON"
"51128","OPP-048699","3/23/2020 21:02","Adam Dohm","Cheryl Glenn","Tiffany Debose","MARKET SOURCE","","Flathead Valley School District (Kalispell, Whitefish, Columbia Falls)","233 1st Ave E","Kalispell","MT","59901","","Joe Biangone","Purchasing","406-758-8392","biangonej@sd5.k12.mt.us","TONERPORT INCORPORATED","","","10293955","ESSENDANT","","12 months","Renewal","","CL091515474R4-A","SLED","97","100","98","21592","16781","38373","2452","Defend","Defend","Defend","4/15/2020 0:00","4/14/2021 0:00","4/14/2021 0:00","Won","42921984","","","","","/E/Data/Funnel/Submit/FLATHEAD VALLEY SCHOOL DISTRICT USAGE_51128.xlsx","Tiffany Debose","Tiffany Debose","3/26/2020 14:49","3/26/2020 0:00","","Bhavana P V","","","","613.97","1.6","","","","","NA","42085906","N","179","3","3/26 - Deal added on eclipse","","SMBA",""
"51126","OPP-048697","3/23/2020 19:52","Xavier Weems","","Tiffany Debose","EAST","Vladimir Jaksic","Gray Television, Inc.","Gray Television, Inc.4370 Peachtree Rd, NE.​Atlanta, Ga  30319","","GA","30319","DN042973875","Dottie Boudreau","Manager","404-266-8333","dottie@gray.tv","STAPLES, INC","","","10264576,10252948","NA","","12 months","New","","CL200351126","Commercial - Named","84","89","86","16143","7335","23478","3149","Defend","Defend","Defend","","","","AV summary and PPT sent","","","","","","/E/Data/Funnel/Submit/GRAY TELEVISION, INC USAGE_51126.xlsb","Tiffany Debose","Tiffany Debose","3/26/2020 8:55","","","Deepthi K","","","3/26/2020 0:00","3239.96","13.8","","6/24/2020 0:00","","","NA","","N","168","27","3/24/2020 - sent for specialist approval 3/26/2020 - aV sent","","MCBigDeal",""
"51125","OPP-048696","3/23/2020 18:01","Xavier Weems","","Tiffany Debose","WEST","Jenni HoGlin","STURM FINANCIAL GROUP","3033 East First Avenue","Denver","CO","80206","","","","","","STAPLES, INC","","","10264576,10252948","NA","","12 months","New","","CL200351125","Commercial - Non Named","42","87","65","10201","14198","24399","6369","Winback","Defend","Winback","","","","AV summary and PPT sent","","","","","","/E/Data/Funnel/Submit/STURM FINANCIAL GROUP USAGE_51125.xlsx","Tiffany Debose","Tiffany Debose","3/24/2020 7:49","","","Teja Ravi","","","3/24/2020 0:00","8417.66","34.5","","6/22/2020 0:00","","","NA","","N","127","67","3/24-AV Summary and PPT sent","","SMBA",""

В противном случае см. Какой самый надежный способ эффективно анализировать CSV с помощью awk? о том, как сделать то, что вы хотите, с помощью одного вызова любого awk.

0 голосов
/ 22 апреля 2020

Это может сработать для вас (GNU sed):

sed -E ':a;N;s/^([^"]*("[^"]*"[^"]*)*"[^"\n]*)\n/\1/;ta;P;D' file |
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^",]*),/\1\n/;ta;s/"//g;s/[^,]*/"&"/g;y/\n/,/'

Решение состоит из двух частей:

  1. Удалите все новые строки между двойными кавычками.
  2. Окружайте любые поля, разделенные запятыми, двойными кавычками

При первом вызове sed добавляются следующие строки (удаляет промежуточные символы новой строки) до тех пор, пока строка не получит сбалансированный набор двойных кавычек. Первая из этих строк печатается, а остальные обрабатываются вместе со следующей строкой, пока не будут напечатаны все строки.

Второй вызов заменяет любые запятые в двойных кавычках на новые строки, все двойные кавычки удаляются и все не поля запятых в двойных кавычках. Символы новой строки заменяются запятыми.

...