У меня плотный график, чтобы придумать регулярное выражение на Python, чтобы сопоставлять названия компаний во многих возможных заявлениях об авторских правах, например:
Copyright © 2019 Apple Inc. All rights reserved.
© 2019 Quid, Inc. All Rights Reserved.
© 2009 Database Designs
© 2019 Rediker Software, All Rights Reserved
©2019 EVOSUS, INC. ALL RIGHTS RESERVED
© 2019 Walmart. All Rights Reserved.
© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.
Copyright © 1978-2019 Berkshire Hathaway Inc.
© 2019 McKesson Corporation
© 2019 UnitedHealth Group. All rights reserved.
© Copyright 1999 - 2019 CVS Health
Copyright 2019 General Motors. All Rights Reserved.
© 2019 Ford Motor Company
©2019 AT&T Intellectual Property. All rights reserved.
© 2019 GENERAL ELECTRIC
Copyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.
© 2019 Verizon
© 2019 Fannie Mae
Copyright © 2018 Jonas Construction Software Inc. All rights reserved.
All Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved
© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121
© 2019 JPMorgan Chase & Co.
Copyright © 1995 - 2018 Boeing. All Rights Reserved.
© 2019 Bank of America Corporation. All rights reserved.
© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801
©2019 Cardinal Health. All rights reserved.
То, что я знаю о регулярных выражениях, является лишь очень простой вещью и на данный момент недостаточно для быстрого нахождения хорошего решения.
Из того, что мне кажется, по крайней мере для этих примеров, требования для правильного ввода названия компании следующие:
If there's a '©' or 'Copyright' in the sentence:
After '©' or 'Copyright' - look for a year, e.g. '2019', or a year range, e.g. '1995 - 2018' or '2003-2019' (spaces are to catch as well]):
If there's a dot somewhere after this year/year range, capture the text until the dot. E.g. in 'Copyright © 1978-2019 Berkshire Hathaway Inc.' capture 'Berkshire Hathaway Inc'
If there's no dot but there's the sentence 'All rights reserved', capture from the year/year range until there and also ignore any possible non-alphanumeric characters that precede it, such as spaces and commas. E.g. from '© 2019 Rediker Software, All Rights Reserved' capture 'Rediker Software'
If there's no dot nor the sentence 'All rights reserved', capture from the year/year range until the end. E.g. from '© 2019 Verizon' Capture 'Verizon'
Какой-нибудь совет для хорошего регулярного выражения для этого?