Добавление списка в колонку панд - PullRequest
0 голосов
/ 26 августа 2018

Мне нужно добавить список idlist к столбцу в моей таблице с именем EventID. Список должен быть добавлен по порядку, так как я взял идентификаторы по порядку из исходного HTML-файла.

Прямо сейчас мой вывод выглядит так:

     EventID                   EventDate                                          EventName  AmntTickets              PriceRange
0  103577924  Thu, 10/11/2018  8:20 p.m.  Philadelphia Eagles at New York Giants  MetLif...         6655  $134.50  to  $2,222.50
1  103577924  Thu, 10/11/2018  8:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...          929   $20.39  to  $3,602.50
     EventID                   EventDate                                          EventName  AmntTickets              PriceRange
0  103577925  Thu, 10/11/2018  8:20 p.m.  Philadelphia Eagles at New York Giants  MetLif...         6655  $134.50  to  $2,222.50
1  103577925  Thu, 10/11/2018  8:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...          929   $20.39  to  $3,602.50

Мне нужно, чтобы это выглядело так:

     EventID                   EventDate                                          EventName  AmntTickets              PriceRange
0  103577924  Thu, 10/11/2018  8:20 p.m.  Philadelphia Eagles at New York Giants  MetLif...         6655  $134.50  to  $2,222.50
1  103577925  Thu, 10/11/2018  8:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...          929   $20.39  to  $3,602.50

Мой код:

import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
import pprint
import re

with open("htmltabletest.html", encoding="utf-8") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    dfs = pd.read_html(soup.prettify())
    df = dfs[0]
    dfz=df.rename(columns = {'Event date  Time (local)':'EventDate'}).rename(columns = {'Event name  Venue':'EventName'}).rename(columns = {'Tickets  listed':'AmntTickets'}).rename(columns = {'Price  range':'PriceRange'}).rename(columns = {'Unnamed: 0':'EventID'})
    idlist = []
    for se in soup.find_all('span', id=re.compile(r'min')):
        se = (str(se))
        seeme1 = se.replace('<span id="se-','')
        seeme, sep, tail = seeme1.partition('-')
        idlist.append(seeme)
    for p in idlist:
        dfz = dfz.assign(EventID=p)
        print(dfz)

Мой HTML-файл (htmltabletest.html):

<table class="dataTable st-alternateRows" id="eventSearchTable">
<thead>
<tr>
<th id="th-es-rb"><div class="dt-th"> </div></th>
<th id="th-es-ed"><div class="dt-th"><span class="th-divider"> </span>Event date<br/>Time (local)</div></th>
<th id="th-es-en"><div class="dt-th"><span class="th-divider"> </span>Event name<br/>Venue</div></th>
<th id="th-es-ti"><div class="dt-th"><span class="th-divider"> </span>Tickets<br/>listed</div></th>
<th id="th-es-pr"><div class="dt-th es-lastCell"><span class="th-divider"> </span>Price<br/>range</div></th>
</tr>
</thead>
<tbody class="" id="eventSearchTbody"><tr class="even" id="r-se-103577924">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577924-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577924-eventDateTime">Thu, 10/11/2018<br/>8:20 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577924&amp;sectionId=0" id="se-103577924-eventName" target="_blank">Philadelphia Eagles at New York Giants</a></div><div id="se-103577924-venue">MetLife Stadium, East Rutherford, NJ</div></td>
<td id="se-103577924-nrTickets">6655</td>
<td class="es-lastCell nowrap" id="se-103577924-priceRange"><span id="se-103577924-minPrice">$134.50</span>  to<br/><span id="se-103577924-maxPrice">$2,222.50</span></td>
</tr><tr class="odd" id="r-se-103577925">
<td class="nowrap"><input class="es-selectedEvent" id="se-103577925-check" name="selectEvent" type="radio"/></td>
<td class="nowrap" id="se-103577925-eventDateTime">Thu, 10/11/2018<br/>8:21 p.m.</td>
<td><div><a class="ellip" href="services/priceanalysis?eventId=103577925&amp;sectionId=0" id="se-103577925-eventName" target="_blank">PARKING PASSES ONLY Philadelphia Eagles at New York Giants</a></div><div id="se-103577925-venue">MetLife Stadium Parking Lots, East Rutherford, NJ</div></td>
<td id="se-103577925-nrTickets">929</td>
<td class="es-lastCell nowrap" id="se-103577925-priceRange"><span id="se-103577925-minPrice">$20.39</span>  to<br/><span id="se-103577925-maxPrice">$3,602.50</span></td>
</tr></tbody>
</table>

1 Ответ

0 голосов
/ 26 августа 2018

Если длина кадра данных dfz равна длине списка, idlist .

Вы можете полностью удалить последний цикл for. Вместо этого вы можете использовать

dfz ["EventID"] = idlist

import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
import pprint
import re

with open("testfile.html") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    dfs = pd.read_html(soup.prettify())
    df = dfs[0]
    dfz=df.rename(columns = {'Event date  Time (local)':'EventDate'}).rename(columns = {'Event name  Venue':'EventName'}).rename(columns = {'Tickets  listed':'AmntTickets'}).rename(columns = {'Price  range':'PriceRange'}).rename(columns = {'Unnamed: 0':'EventID'})
    idlist = []
    for se in soup.find_all('span', id=re.compile(r'min')):
        se = (str(se))
        seeme1 = se.replace('<span id="se-','')
        seeme, sep, tail = seeme1.partition('-')
        idlist.append(seeme)
    dfz["EventID"] = idlist
    print(dfz)

Тогда вы получите запрошенный вами фрейм данных.

     EventID                   EventDate                                          EventName  AmntTickets              PriceRange
0  103577924  Thu, 10/11/2018  8:20 p.m.  Philadelphia Eagles at New York Giants  MetLif...         6655  $134.50  to  $2,222.50
1  103577925  Thu, 10/11/2018  8:21 p.m.  PARKING PASSES ONLY Philadelphia Eagles at New...          929   $20.39  to  $3,602.50

Если датафрейм dfz и idlist списка имеют неодинаковую длину. И вы можете использовать приведенный ниже код, чтобы добавить данные для неравной длины списков.

import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
import pprint
import re

with open("testfile.html") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'lxml')
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    dfs = pd.read_html(soup.prettify())
    df = dfs[0]
    dfz=df.rename(columns = {'Event date  Time (local)':'EventDate'}).rename(columns = {'Event name  Venue':'EventName'}).rename(columns = {'Tickets  listed':'AmntTickets'}).rename(columns = {'Price  range':'PriceRange'}).rename(columns = {'Unnamed: 0':'EventID'})
    idlist = []
    for se in soup.find_all('span', id=re.compile(r'min')):
        se = (str(se))
        seeme1 = se.replace('<span id="se-','')
        seeme, sep, tail = seeme1.partition('-')
        idlist.append(seeme)

    for ind, row in dfz.iterrows():
        try:
            dfz.EventID.iloc[ind] = idlist[ind]
        except Exception as e:
            pass
    print(dfz)
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...