Если вы указываете свой браузер на http://careers.boozallen.com/search?q=software+engineer+CA
и проверяете HTML, вы увидите HTML-код следующим образом:
<tr class="dbOutputRow2">
<td style="width: 400px;" class="colTitle" headers="hdrTitle"><span class="jobTitle"><a href="http://careers.boozallen.com/job/San-Diego-Network-Engineer%2C-Senior-Job-CA-92101/1645793/">Network Engineer, Senior Job</a></span></td>
<td style="width: auto;" class="colLocation" headers="hdrLocation"><span class="jobLocation">San Diego, CA, US</span></td>
<td style="width: 155px;" class="colDate" headers="hdrDate" nowrap="nowrap"><span class="jobDate">Jan 5, 2012</span></td>
Информация, которую вы ищете, находится в <span>
тегах с class
атрибуты, равные jobTitle
, jobLocation
или jobDate
.
Вот как можно очистить эти биты, используя lxml :
import urllib2
import lxml.html as LH
url = 'http://careers.boozallen.com/search?q=software+engineer+CA'
doc = LH.parse(urllib2.urlopen(url))
def text_content(iterable):
for elt in iterable:
yield elt.text_content()
data = text_content(doc.xpath('''//span[@class = "jobTitle"
or @class = "jobLocation"
or @class = "jobDate"]'''))
for title, location, date in zip(*[data]*3):
print(title,location,date)
выход
('Title', 'Location', 'Date')
('Network Engineer, Senior Job', 'San Diego, CA, US', 'Jan 5, 2012')
('Network Integration Engineer, Mid Job', 'San Diego, CA, US', 'Jan 12, 2012')
('Systems Engineer, Senior Job', 'San Diego, CA, US', 'Jan 31, 2012')
('Enterprise Architect, Senior Job', 'Washington, DC, US', 'Jan 23, 2012')
...