Perl несколько регулярных выражений в одном файле HTML - PullRequest
0 голосов
/ 21 октября 2011

Поэтому мне нужно выполнить несколько регулярных выражений perl для одного файла HTML и сохранить каждое значение в массиве.

HTML-файл выглядит как

<a href="/jobs_qa">Job QA</a>

Title:
Commercial Bank 
<p></p>
City:
TX   
State:
TX  
Country:

<p></p>
Full Description:
<p></p>
<p> Citi North America Consumer Banking group serves customers through Retail Banking, Credit Cards, Personal Banking and Wealth Management, Small Business Banking and Commercial Banking.     </p>

<p>Commercial Bank Head - Houston-11030087</p>

<p>Description </p>

<p>POSITION SUMMARY</p>

<p>Lead the sales, relationship, and credit management for commercial banking customers in a given marketplace.  Build and motivate talented relationship teams to effectively penetrate the market and gain market share.  Current business segment includes those clients with revenues from $20 to $500+ million annually.    Clients in this segment typically require more complex product offerings and customized credit decisions made in the field.</p>

<p> </p>

<p> </p>



<p>Qualifications </p>

<p>EXPERIENCE
<br />-MBA or equivalent experience
<br />-Minimum 10 years business and/or commercial banking with increasing levels of responsibility

<p> </p>


<a href="http://www.mysite.com/jobs/">http://www.e.com/jobs/commercial-bank-head-houston-citi-houston-tx</a>
<hr>
Title:
Sr Business Relationship 
<p></p>
City:
CO   
State:
CO  
Country:

<p></p>
Full Description:
<p></p>
<p>Effectively acquires, manages and grows profitable account relationships with an extensive percentage of moderately complex and medium sized business customers that have annual gross sales of generally more than $2MM and less than $20MM. Ensures the overall success & growth of an assigned portfolio by deepening relationships of existing customers and through the acquisition of new customers. 
<p></p>
<a href="http://www.mysite.com/jobs/">http://www.e.com/jobs/sr-business-relationship-mgr-wells-fargo-avon-co</a>
<hr>
Title:
Implementation Associate
<p></p>
City:
WI   
State:
WI  
Country:

<p></p>
Full Description:
<p></p>
<p>Works with project managers and project teams to determine implementation strategy, methods and plans for initiatives that typically impact single systems, workflows or products with low risk and complexity or where work is completed under guidance. Coordinates development of business requirements. Develops standard communication and training plans and materials. Implements communications and training plans. Tracks implementation tasks and budgets, identifies and reports issues or escalates as needed and reports project status. Documents or updates best practices, workflows or procedures. May also be responsible to miscellaneous business administrative initiatives.2+ years experience in one or more of the following: administrative support; project management; implementation; or participation in project teams as part of on-going responsibilities in a postion supporting the line of business.Relevant project management and/or implementation experience- Proven organizational, motivational, time management, prioritization, detail orientation
<br /> and multi-tasking skills. 
<br />- Proven oral and written communication skills to support each line of business. 
<br />- Experience with PC applications - Word, Excel, Access, Power Point and Visio.</p>
<p></p>
<a href="http://www.mysite.com/jobs/">http://www.e.com/jobs/implementation-associate-wells-fargo-milwaukee-wi</a>
<hr>
Title:
......... ... ..... ........ 

...............

И так далее, т.е. я хочу сгруппировать весь контент от заголовка к заголовку. т.е. $array[0]= "Title: Commercial Bank <p></p>City:TX ........."
и $array[1]= "Title: Sr Business Relationship <p></p> " и т. д. и т. п.

У меня было бы приблизительно 300 таких значений.

Мне также понадобятся теги HTML внутри них. Поскольку мне нужно проверить правильность использования тегов. Я бы не знал содержимого между тегами

То, что я пробовал, это Попытка:

my $i=0;
my @array;
while ($html =~ m/.*(Title:.*?)Title:/ig)
{
    $array[$i]=$1;
    $i++;
}

foreach (@array)
{
    print "$_";
}

Но ничто не становится абсолютно подобранным. Пожалуйста, совет ....

1 Ответ

5 голосов
/ 21 октября 2011

Не используйте регулярные выражения для разбора HTML. Используйте анализатор HTML. Есть много на CPAN. Один из моих любимых - HTML :: TokeParser :: Simple .

HTML :: Tidy и средство проверки W3 могут помочь вам проверить документы HTML.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...