строка gsub перед первым числом с прописными и строчными буквами - PullRequest
0 голосов
/ 28 января 2019

Удалить все после первого номера.Данные, которые у меня есть, выглядят так:

[1] NA                                   "ITEM 1. BUSINESS"                  
[3] "ITEM 1A. RISK FACTORS"              "ITEM 1B. UNRESOLVED STAFF COMMENTS"
[5] "ITEM 2. PROPERTIES"                 "ITEM 3. LEGAL PROCEEDINGS"       

Я пытаюсь сохранить, чтобы у меня было

NA           ITEM1
ITEM1A      ITEM1B
ITEM2       ITEM3

(или даже оставляя пробелы между пунктом 1, пунктом 2 и т. Д.)

Я безуспешно пробовал следующее.

x <- toupper(x)
x <- gsub("[^[:alnum:][:space:]]","", x)
x <- gsub(" ", "", x)
x <- substr(x, start = 1, stop = 7)
x <- gsub("\\[digits]*","", x)

Также пробовал:

    y <- str_extract(x, "Item")
y <- str_extract(toupper(words$item), "ITEM")

Данные:

c(NA, "ITEM 1. BUSINESS", "ITEM 1A. RISK FACTORS", "ITEM 1B. UNRESOLVED STAFF COMMENTS", 
"ITEM 2. PROPERTIES", "ITEM 3. LEGAL PROCEEDINGS", "ITEM 4. MINE SAFETY DISCLOSURES", 
"ITEM 5. MARKET FOR REGISTRANT’S COMMON EQUITY, RELATED STOCKHOLDER MATTERS AND ISSUER PURCHASES OF EQUITY SECURITIES", 
"ITEM 6. SELECTED FINANCIAL DATA ", "ITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS ", 
"ITEM 7A. QUANTITATIVE AND QUALITATIVE DISCLOSURES ABOUT MARKET RISK", 
"ITEM 8. FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA", "ITEM 9. CHANGES IN AND DISAGREEMENTS WITH ACCOUNTANTS ON ACCOUNTING AND FINANCIAL DISCLOSURE", 
"ITEM 9A. CONTROLS AND PROCEDURES", "ITEM 9B.  OTHER INFORMATION", 
"ITEM 10. DIRECTORS, EXECUTIVE OFFICERS AND CORPORATE GOVERNANCE", 
"ITEM 11. EXECUTIVE COMPENSATION", "ITEM 12. SECURITY OWNERSHIP OF CERTAIN BENEFICIAL OWNERS AND MANAGEMENT AND RELATED STOCKHOLDER MATTERS", 
"ITEM 13. CERTAIN RELATIONSHIPS AND RELATED TRANSACTIONS, AND DIRECTOR INDEPENDENCE", 
"ITEM 14. PRINCIPAL ACCOUNTING FEES AND SERVICES", "ITEM 15. EXHIBITS, FINANCIAL STATEMENT SCHEDULE", 
"Item 1.    Business", "Item 1A.    Risk Factors", "Item 1B.    Unresolved Staff Comments", 
"Item 2.    Properties", "Item 3.    Legal Proceedings", "Item 4.    Mine Safety Disclosure", 
"Item 5.    Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities", 
"Item 6.    Selected Financial Data", "Item 7.    Management’s Discussion and Analysis of Financial Condition and Results of Operations", 
"Item 7A.    Quantitative and Qualitative Disclosures About Market Risk", 
"Item 8.    Financial Statements and Supplementary Data", "Item 9.    Changes in and Disagreements with Accountants on Accounting and Financial Disclosure", 
"Item 9A.    Controls and Procedures", "Item 9B.    Other Information", 
"Item 10.    Directors, Executive Officers and Corporate Governance", 
"Item 11.    Executive Compensation", "Item 12.    Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters", 
"Item 13.    Certain Relationships and Related Transactions, and Director Independence", 
"Item 14.    Principal Accountant Fees and Services", "Item 15.    Exhibits and Financial Statement Schedules(a)(1) and (2).  The following documents have been included in Part II, Item 8. Report of Ernst & Young LLP, Independent Registered Public Accounting Firm, on Financial Statements Consolidated Statements of Financial Position — As of December 31, 2017 and 2016 Consolidated Statements of Income — Years Ended December 31, 2017, 2016 and 2015 Consolidated Statements of Comprehensive Income — Years Ended December 31, 2017, 2016 and 2015 Consolidated Statements of Shareholders’ Equity — Years Ended December 31, 2017, 2016 and 2015 Consolidated Statements of Cash Flows — Years Ended December 31, 2017, 2016 and 2015 Notes to Consolidated Financial Statements", 
"Item 1.  Business.", "Item 1A.  Risk Factors.", "Item 1B.  Unresolved Staff Comments.", 
"Item 2.  Properties.", "Item 3.  Legal Proceedings.", "Item 4.  Mine Safety Disclosures.", 
"Item 5.  Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities.", 
"Item 6.  Selected Financial Data.", "Item 7.  Management's Discussion and Analysis of Financial Condition and Results of Operations. ", 
"Item 7A.  Quantitative and Qualitative Disclosures About Market Risk.", 
"Item 8.  Financial Statements and Supplementary Data.", "Item 9.  Changes in and Disagreements with Accountants on Accounting and Financial Disclosure.", 
"Item 9A.  Controls and Procedures.", "Item 9B.  Other Information.", 
"Item 10.  Directors, Executive Officers and Corporate Governance.", 
"Item 11.  Executive Compensation.", "Item 12.  Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters.", 
"Item 13.  Certain Relationships and Related Transactions, and Director Independence.", 
"Item 14.  Principal Accounting Fees and Services.", "Item 15.  Exhibits, Financial Statement Schedules.", 
"Item 16. Form 10-K Summary.", "Item 4.    Mine Safety Disclosures", 
"Item 4A.    Executive Officers", "Item 5.    Market for the Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities", 
"Item 6.    Selected Financial Data", "Item 7.   Management's Discussion and Analysis of Financial Condition and Results of Operations", 
"Item 8.   Financial Statements and Supplementary Data", "Item 15.    Exhibits, Financial Statement Schedules"
)

Ответы [ 2 ]

0 голосов
/ 28 января 2019

Вот еще один способ сделать это.Мы можем использовать флаг \\U вместе с perl = TRUE, чтобы использовать все заглавные буквы:

s1 <- gsub("^(.*?)\\..*","\\U\\1", test, perl = T)
s2 <- gsub("\\s+", "", s1)

[1] NA       "ITEM1"  "ITEM1A" "ITEM1B" "ITEM2"  "ITEM3"  
 "ITEM4"  "ITEM5"  "ITEM6"  "ITEM7"  "ITEM7A"

Мое первое выражение разбивает «элемент» в зависимости от того, где находится период.

0 голосов
/ 28 января 2019

Мы можем использовать sub для захвата одного или нескольких символов, которые не являются числом, за которым следуют числа в качестве группы, при замене используйте обратную ссылку (\\1) захваченной группы.

x1 <- sub("^([^0-9]+[0-9]+[A-Za-z]*).*", "\\1", x)
x1
#[1] NA        "ITEM 1"  "ITEM 1A" "ITEM 1B" "ITEM 2"  "ITEM 3"  "ITEM 4"  "ITEM 5"  "ITEM 6"  "ITEM 7"  "ITEM 7A" "ITEM 8"  "ITEM 9" 
#[14] "ITEM 9A" "ITEM 9B" "ITEM 10" "ITEM 11" "ITEM 12" "ITEM 13" "ITEM 14" "ITEM 15" "Item 1"  "Item 1A" "Item 1B" "Item 2"  "Item 3" 
#[27] "Item 4"  "Item 5"  "Item 6"  "Item 7"  "Item 7A" "Item 8"  "Item 9"  "Item 9A" "Item 9B" "Item 10" "Item 11" "Item 12" "Item 13"
#[40] "Item 14" "Item 15" "Item 1"  "Item 1A" "Item 1B" "Item 2"  "Item 3"  "Item 4"  "Item 5"  "Item 6"  "Item 7"  "Item 7A" "Item 8" 
#[53] "Item 9"  "Item 9A" "Item 9B" "Item 10" "Item 11" "Item 12" "Item 13" "Item 14" "Item 15" "Item 16" "Item 4"  "Item 4A" "Item 5" 
#[66] "Item 6"  "Item 7"  "Item 8"  "Item 15"

Если мы хотим удалить все пробелы, удалите пробел с помощью sub

x2 <- sub("\\s+", "", toupper(x1))
head(x2)
#[1] NA       "ITEM1"  "ITEM1A" "ITEM1B" "ITEM2"  "ITEM3" 
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...