LXML / Python - цикл по списку l xml .etree._Element - PullRequest
0 голосов
/ 06 апреля 2020

Я пытаюсь l oop по списку из 5 l xml ._ Элемента.

Вот выдержка из части html, которая меня интересует:

    <div style="" id="ember140" class="pv-deferred-area ember-view">  <div class="pv-deferred-area__content">
        <!---->

  </div>
</div>
<div id="oc-background-section" class="pv-oc ember-view">              <span class="background-details">
              <div id="ember217" class="ember-view"><section id="ember218" class="pv-profile-section pv-profile-section--reorder-enabled background-section artdeco-container-card ember-view"><div id="ember219" class="pv-profile-section-pager ember-view">    <section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
  <h2 class="pv-profile-section__card-heading">
    Expérience
  </h2>

<a data-control-name="add_position" href="/in/gregoire-de-kermel/edit/position/new/" id="ember221" class="pv-profile-section__header-add-action add-position artdeco-button artdeco-button--tertiary artdeco-button--circle ember-view">      <li-icon type="plus-icon" role="img" aria-label="Ajouter un nouveau poste"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" data-supported-dps="24x24" fill="currentColor" width="24" height="24" focusable="false">
  <path d="M21 13h-8v8h-2v-8H3v-2h8V3h2v8h8v2z"></path>
</svg></li-icon>
</a></header>


  <ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more">
<li id="ember223" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view">        <section id="1571672557" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view">  <div class="display-flex justify-space-between full-width">
    <div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/reputation-squad/" id="ember226" class="full-width ember-view">          <div class="pv-entity__logo company-logo">
  <img src="https://media-exp1.licdn.com/dms/image/C4D0BAQE__TgCl2fyUw/company-logo_100_100/0?e=1593648000&amp;v=beta&amp;t=VLSKEVUbJDcULtQwEdrHrH5Gxwq_j7tk2HczgAKn7YU" loading="lazy" alt="Reputation Squad" id="ember228" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image loaded ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
  <h3 class="t-16 t-black t-bold">Data Scientist</h3>
  <p class="visually-hidden">Nom de l’entreprise</p>
  <p class="pv-entity__secondary-title t-14 t-black t-normal">
      Reputation Squad
        <span class="pv-entity__secondary-title separator">Contrat en alternance</span>
  </p>
    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates d’emploi</span>
      <span>janv. 2020 – Aujourd’hui</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Durée d’emploi</span>
        <span class="pv-entity__bullet-item-v2">4 mois</span>
      </h4>
  </div>

  <h4 class="pv-entity__location t-14 t-black--light t-normal block">
    <span class="visually-hidden">Lieu</span>
    <span>Région de Paris, France</span>
  </h4>

<!---->
</div>

</a>
<!---->    </div>

      <div class="pv-entity__actions">
<a data-control-name="edit_position" href="/in/gregoire-de-kermel/edit/position/1571672557/" id="ember230" class="pv-profile-section__edit-action pv-profile-section__hoverable-action artdeco-button artdeco-button--tertiary artdeco-button--circle ember-view">          <li-icon type="pencil-icon" role="img" aria-label="Modifier le poste Data Scientist"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" data-supported-dps="24x24" fill="currentColor" width="24" height="24" focusable="false">
  <path d="M21.71 5L19 2.29a1 1 0 00-.71-.29 1 1 0 00-.7.29L4 15.85 2 22l6.15-2L21.71 6.45a1 1 0 00.29-.74 1 1 0 00-.29-.71zM6.87 18.64l-1.5-1.5L15.92 6.57l1.5 1.5zM18.09 7.41l-1.5-1.5 1.67-1.67 1.5 1.5z"></path>
</svg></li-icon>
</a><!---->      </div>
  </div>
</section>
</li><li id="ember232" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view">        <section id="1516596236" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view">  <div class="display-flex justify-space-between full-width">
    <div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/credit-agricole-de-la-touraine-et-du-poitou-crto-/" id="ember235" class="full-width ember-view">          <div class="pv-entity__logo company-logo">
  <img src="https://media-exp1.licdn.com/dms/image/C560BAQHz0qZ2RutURA/company-logo_100_100/0?e=1593648000&amp;v=beta&amp;t=uzqwKV9Un5c_b7X3Xo7vqA2KXcQkmBRDWpMUO5Bu1Gc" loading="lazy" alt="Crédit Agricole de la Touraine et du Poitou" id="ember237" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image loaded ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section mb2">
  <h3 class="t-16 t-black t-bold">Data Scientist</h3>
  <p class="visually-hidden">Nom de l’entreprise</p>
  <p class="pv-entity__secondary-title t-14 t-black t-normal">
      Crédit Agricole de la Touraine et du Poitou
        <span class="pv-entity__secondary-title separator">Contrat en alternance</span>
  </p>
    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates d’emploi</span>
      <span>sept. 2019 – janv. 2020</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Durée d’emploi</span>
        <span class="pv-entity__bullet-item-v2">5 mois</span>
      </h4>
  </div>

  <h4 class="pv-entity__location t-14 t-black--light t-normal block">
    <span class="visually-hidden">Lieu</span>
    <span>Région de Poitiers, France</span>
  </h4>

<!---->
</div>

</a>
        <div id="ember239" class="pv-entity__extra-details t-14 t-black--light ember-view"><p style="line-height:2rem;max-height:8rem;" id="ember240" class="pv-entity__description t-14 t-black t-normal inline-show-more-text inline-show-more-text--is-collapsed ember-view">• Web scraping (Python)<br>• Etude de profilage client (SAS)<br>• Mise en place d'un projet de système de recommandation (Hadoop, SAS, Python)

<!----></p><!----></div>
    </div>

      <div class="pv-entity__actions">
<a data-control-name="edit_position" href="/in/gregoire-de-kermel/edit/position/1516596236/" id="ember241" class="pv-profile-section__edit-action pv-profile-section__hoverable-action artdeco-button artdeco-button--tertiary artdeco-button--circle ember-view">          <li-icon type="pencil-icon" role="img" aria-label="Modifier le poste Data Scientist"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" data-supported-dps="24x24" fill="currentColor" width="24" height="24" focusable="false">
  <path d="M21.71 5L19 2.29a1 1 0 00-.71-.29 1 1 0 00-.7.29L4 15.85 2 22l6.15-2L21.71 6.45a1 1 0 00.29-.74 1 1 0 00-.29-.71zM6.87 18.64l-1.5-1.5L15.92 6.57l1.5 1.5zM18.09 7.41l-1.5-1.5 1.67-1.67 1.5 1.5z"></path>
</svg></li-icon>
</a><!---->      </div>
  </div>
</section>
</li><li id="ember243" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view">        <section id="1427380111" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view">  <div class="display-flex justify-space-between full-width">
    <div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/weblagence/" id="ember246" class="full-width ember-view">          <div class="pv-entity__logo company-logo">
  <img src="https://media-exp1.licdn.com/dms/image/C560BAQHOw0tfMPSiWA/company-logo_100_100/0?e=1593648000&amp;v=beta&amp;t=NqZ8eTVFqA2MK4B1ZFUSE7NgTL_ZPqBIMrexzcYnNok" loading="lazy" alt="WebL'Agence" id="ember248" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image loaded ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section mb2">
  <h3 class="t-16 t-black t-bold">Python &amp; React Native developer junior</h3>
  <p class="visually-hidden">Nom de l’entreprise</p>
  <p class="pv-entity__secondary-title t-14 t-black t-normal">
      WebL'Agence
<!---->  </p>
    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates d’emploi</span>
      <span>janv. 2019 – août 2019</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Durée d’emploi</span>
        <span class="pv-entity__bullet-item-v2">8 mois</span>
      </h4>
  </div>

  <h4 class="pv-entity__location t-14 t-black--light t-normal block">
    <span class="visually-hidden">Lieu</span>
    <span>Région de Paris, France</span>
  </h4>

<!---->
</div>

</a>
        <div id="ember250" class="pv-entity__extra-details t-14 t-black--light ember-view"><p style="line-height:2rem;max-height:8rem;" id="ember251" class="pv-entity__description t-14 t-black t-normal inline-show-more-text inline-show-more-text--is-collapsed ember-view">• Création d’applications mobiles (React-Native)<br>• Développement d’un modèle d’évaluation de startup «&nbsp;early-stage&nbsp;»<br>• Web scraping (Selenium Python)<br>• Gestionnaire d’un projet de Machine-Learing/OCR+ (externalisation auprès de prestataires externes et utilisation de AWS textract)

    <span class="inline-show-more-text__link-container-collapsed">
        <span>…</span>
      <button class="inline-show-more-text__button link" aria-expanded="false" data-ember-action="" data-ember-action-341="341">
        voir plus
      </button>
    </span>

<!----></p><!----></div>
    </div>

      <div class="pv-entity__actions">
<a data-control-name="edit_position" href="/in/gregoire-de-kermel/edit/position/1427380111/" id="ember252" class="pv-profile-section__edit-action pv-profile-section__hoverable-action artdeco-button artdeco-button--tertiary artdeco-button--circle ember-view">          <li-icon type="pencil-icon" role="img" aria-label="Modifier le poste Python &amp;amp; React Native developer junior"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" data-supported-dps="24x24" fill="currentColor" width="24" height="24" focusable="false">
  <path d="M21.71 5L19 2.29a1 1 0 00-.71-.29 1 1 0 00-.7.29L4 15.85 2 22l6.15-2L21.71 6.45a1 1 0 00.29-.74 1 1 0 00-.29-.71zM6.87 18.64l-1.5-1.5L15.92 6.57l1.5 1.5zM18.09 7.41l-1.5-1.5 1.67-1.67 1.5 1.5z"></path>
</svg></li-icon>
</a><!---->      </div>
  </div>
</section>
</li><li id="ember254" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view">        <section id="708026390" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view">  <div class="display-flex justify-space-between full-width">
    <div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/search/results/all/?keywords=Gauthier%20Associ%C3%A9s" id="ember257" class="full-width ember-view">          <div class="pv-entity__logo company-logo">
  <img src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" loading="lazy" alt="Gauthier Associés" id="ember259" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ghost-company loaded ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section mb2">
  <h3 class="t-16 t-black t-bold">Business Financial Analyst</h3>
  <p class="visually-hidden">Nom de l’entreprise</p>
  <p class="pv-entity__secondary-title t-14 t-black t-normal">
      Gauthier Associés
<!---->  </p>
    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates d’emploi</span>
      <span>juil. 2015 – juin 2019</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Durée d’emploi</span>
        <span class="pv-entity__bullet-item-v2">4 ans</span>
      </h4>
  </div>

  <h4 class="pv-entity__location t-14 t-black--light t-normal block">
    <span class="visually-hidden">Lieu</span>
    <span>Smarves</span>
  </h4>

<!---->
</div>

</a>
        <div id="ember261" class="pv-entity__extra-details t-14 t-black--light ember-view"><p style="line-height:2rem;max-height:8rem;" id="ember262" class="pv-entity__description t-14 t-black t-normal inline-show-more-text inline-show-more-text--is-collapsed ember-view">It started as a 2 months internship in which my tasks were to:<br>• Analysed the company's profitability<br>•   Created the official corporate document on profitability<br>•   Designed and administered a corporate customer satisfaction survey<br><br>Ever since, I am doing yearly financial and business analysis under my own business. It has been now 4 years that I am working with this company, with more and more responsibilities over the time such reporting and analysing the company's investing holdings' profitability.

    <span class="inline-show-more-text__link-container-collapsed">
        <span>…</span>
      <button class="inline-show-more-text__button link" aria-expanded="false" data-ember-action="" data-ember-action-342="342">
        voir plus
      </button>
    </span>

<!----></p><!----></div>
    </div>

      <div class="pv-entity__actions">
<a data-control-name="edit_position" href="/in/gregoire-de-kermel/edit/position/708026390/" id="ember263" class="pv-profile-section__edit-action pv-profile-section__hoverable-action artdeco-button artdeco-button--tertiary artdeco-button--circle ember-view">          <li-icon type="pencil-icon" role="img" aria-label="Modifier le poste Business Financial Analyst"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" data-supported-dps="24x24" fill="currentColor" width="24" height="24" focusable="false">
  <path d="M21.71 5L19 2.29a1 1 0 00-.71-.29 1 1 0 00-.7.29L4 15.85 2 22l6.15-2L21.71 6.45a1 1 0 00.29-.74 1 1 0 00-.29-.71zM6.87 18.64l-1.5-1.5L15.92 6.57l1.5 1.5zM18.09 7.41l-1.5-1.5 1.67-1.67 1.5 1.5z"></path>
</svg></li-icon>
</a><!---->      </div>
  </div>
</section>
</li><li id="ember265" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view">        <section id="813743952" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view">  <div class="display-flex justify-space-between full-width">
    <div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/gsma/" id="ember268" class="full-width ember-view">          <div class="pv-entity__logo company-logo">
  <img src="https://media-exp1.licdn.com/dms/image/C560BAQGmHE5IziHPfw/company-logo_100_100/0?e=1593648000&amp;v=beta&amp;t=uonj7aae0F9Qr9Z7uDJAjX358njW5zCqaCrhF-m5wJU" loading="lazy" alt="GSMA" id="ember270" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image loaded ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section mb2">
  <h3 class="t-16 t-black t-bold">Intern Analyst - Network 2020</h3>
  <p class="visually-hidden">Nom de l’entreprise</p>
  <p class="pv-entity__secondary-title t-14 t-black t-normal">
      GSMA
<!---->  </p>
    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates d’emploi</span>
      <span>mai 2016 – juil. 2016</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Durée d’emploi</span>
        <span class="pv-entity__bullet-item-v2">3 mois</span>
      </h4>
  </div>

  <h4 class="pv-entity__location t-14 t-black--light t-normal block">
    <span class="visually-hidden">Lieu</span>
    <span>London, Royaume-Uni</span>
  </h4>

<!---->
</div>

</a>
        <div id="ember272" class="pv-entity__extra-details t-14 t-black--light ember-view"><p style="line-height:2rem;max-height:8rem;" id="ember273" class="pv-entity__description t-14 t-black t-normal inline-show-more-text inline-show-more-text--is-collapsed ember-view">Initially, I had the opportunity to use small and large data sets to reconcile, analyse and present to key stakeholders – developing strong excel capability in the process.<br><br>Further developing these skills, I had the opportunity to deliver a project of work (end-to-end) from developing communications for data requests, clean data collection and storage processes and then developing a methodology for estimating market size metrics in a sustainable reporting process to the Network 2020 programme.<br><br>Summary of milestones:<br>•    Reconciling, analysing and presenting information to key stakeholders<br>•  Request, store and model market estimations<br>•    Researching, collection and storing and communicating key business metrics

    <span class="inline-show-more-text__link-container-collapsed">
        <span>…</span>
      <button class="inline-show-more-text__button link" aria-expanded="false" data-ember-action="" data-ember-action-343="343">
        voir plus
      </button>
    </span>

<!----></p><!----></div>
    </div>

      <div class="pv-entity__actions">
<a data-control-name="edit_position" href="/in/gregoire-de-kermel/edit/position/813743952/" id="ember274" class="pv-profile-section__edit-action pv-profile-section__hoverable-action artdeco-button artdeco-button--tertiary artdeco-button--circle ember-view">          <li-icon type="pencil-icon" role="img" aria-label="Modifier le poste Intern Analyst - Network 2020"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" data-supported-dps="24x24" fill="currentColor" width="24" height="24" focusable="false">
  <path d="M21.71 5L19 2.29a1 1 0 00-.71-.29 1 1 0 00-.7.29L4 15.85 2 22l6.15-2L21.71 6.45a1 1 0 00.29-.74 1 1 0 00-.29-.71zM6.87 18.64l-1.5-1.5L15.92 6.57l1.5 1.5zM18.09 7.41l-1.5-1.5 1.67-1.67 1.5 1.5z"></path>
</svg></li-icon>
</a><!---->      </div>
  </div>
</section>
</li>  </ul>

  <div id="ember275" class="pv-experience-section__see-more pv-profile-section__actions-inline ember-view"><button class="pv-profile-section__see-more-inline pv-profile-section__text-truncate-toggle link link-without-hover-state" aria-expanded="false">Afficher 1 expérience de plus
<li-icon aria-hidden="true" type="chevron-down-icon" class="pv-profile-section__toggle-detail-icon" size="small"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" data-supported-dps="16x16" fill="currentColor" width="16" height="16" focusable="false">
  <path d="M8 9l5.93-4L15 6.54l-6.15 4.2a1.5 1.5 0 01-1.69 0L1 6.54 2.07 5z"></path>
</svg></li-icon></button>

<!----></div>

Я сохранил извлечение в файле html и открыл его следующим образом:

def parse_html_file(filename):
    f = open(filename, encoding="utf8").read()
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO(f), parser)
    return tree

tree = parse_html_file('test.html')

В моем списке 5 элементов "pv-profile-section__section-info section-info pv-profile-section__section-info--has-more".

Цель состоит в том, чтобы извлечь название работы, название компании и тип контракта.

До сих пор я сделал следующее:

job_location = tree.xpath(
    './/li[@class="pv-entity__position-group-pager pv-profile-section__list-item ember-view"]')

di = {}

for i in job_location:
    try:
        di['name'] = tree.xpath(
            '//h3[@class="t-16 t-black t-bold"]/text()')
    except:
        di['name'] = 'None'
    try:
        di['name'] = tree.xpath(
            '//h3[@class="t-16 t-black t-bold"]/text()')
    except:
        di['name'] = 'None'    
    try:
        di['contract'] = tree.xpath(
            '//span[@class="pv-entity__secondary-title separator"]/text()')
    except:
        di['contract'] = 'None'

print(di)

Кажется, это работает, но сейчас длина переменных "job" и "company" равна 5, а "contract_type" равна 2. Я хотел бы напечатать что-то, что внутри исходного l oop нет атрибута contract_type, как для последнего элемента. Когда ничего нет, я бы хотел отобразить «Нет» для типа контракта.

Что у меня есть:

{'name': ['Data Scientist', 'Data Scientist', 'Python & React Native developer junior', 'Business Financial Analyst', 'Intern Analyst - Network 2020'], 'contract': ['Contrat en alternance', 'Contrat en alternance']}

Что бы я хотел получить:

{'name': ['Data Scientist', 'Data Scientist', 'Python & React Native Developer младший', 'Business Financial Analyst', 'Intern Analyst - Network 2020'], 'contract': [ 'Contrat en alternance', 'Contrat en alternance', '', '', '']}

Ребята, не могли бы вы дать мне подсказку по этому заданию?

1 Ответ

0 голосов
/ 07 апреля 2020

Если я вас правильно понимаю, то что-то вроде этого вы можете искать.

Он в основном создает два списка: длинный (для «имени») и короткий (для «контракта»), затем добавляет самый короткий список к длине самого длинного списка с помощью «NA» (или что угодно) и, наконец, добавляет два словаря равной длины в словарь.

Обратите внимание, что job_location теперь начинается с parent узла в вашем вопросе (что, вероятно, вызвало некоторые путаницы).

job_location = tree.xpath('.//ul[@class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more"]')
for jl in job_location:
    desc = jl.xpath('//h3[@class="t-16 t-black t-bold"]/text()')
    cont = jl.xpath('//span[@class="pv-entity__secondary-title separator"]/text()')
    name = []
    contract = []
    di = {}

    for d in desc:
        name.append(d)
    for c in cont:
        contract.append(c)
    contract += ['NA'] * (len(name) - len(contract)) #this is where the padding takes place
    di['name'] = name
    di['contract'] = contract
print(di)

Вывод:

{'name': ['Data Scientist', 'Data Scientist', 'Python & React Native developer junior', 'Business Financial Analyst', 'Intern Analyst - Network 2020'], 'contract': ['Contrat en alternance', 'Contrat en alternance', 'NA', 'NA', 'NA']}
...