Очистка спецификаций c tag html с использованием BeautifulSoup и selenium - PullRequest
3 голосов
/ 16 июня 2020

Я обязан очистить определенные теги данных в html,

, которые я складываю здесь. это мой код:

Я хочу получить такой результат: http://www.sharecsv.com/s/9fd1d7ae78a6a9ffdc06f0b2dd33e9c7/Doaj.csv

Помогите мне, пожалуйста

1 Ответ

2 голосов
/ 18 июня 2020

Возможно, вам придется немного подправить, но это поможет вам:

import os
import requests
import re
from bs4 import BeautifulSoup
import json
import shutil
import pandas as pd

url = 'https://doaj.org/public-data-dump'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

options = {1:'Journal',2:'Articles'}
choice = int(input('What would you like to search for?\n1: Journals\n2: Articles\nEnter 1 or 2 -> '))

link = 'https://doaj.org' + soup.find('a', text=re.compile(r'Download all %s' %options[choice]))['href']

def download_url(keyword, link, save_path, chunk_size=128):
    output = []
    before = os.listdir(save_path)
    r = requests.get(link, stream=True)
    filename = save_path + '/temp.tar.gz'
    with open(filename, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)

    extract_path = save_path
    shutil.unpack_archive(filename, extract_path)

    os.remove(filename)
    after = os.listdir(save_path)

    newFolder = save_path + '/' + list(set(after) - set(before))[0]
    jsonFiles = os.listdir(newFolder)

    for idx, file in enumerate(jsonFiles):
        print ('Filtering for keyword "%s": File %s of %s' %(keyword,idx+1, len(jsonFiles)))
        with open(newFolder + '/' + file) as json_file:
            jsonData = json.load(json_file) 

        for each in jsonData:
            if 'keywords' in each['bibjson']:
                keywordsList = each['bibjson']['keywords']
                if any(keyword in x for x in keywordsList):
                    output.append(each)

    shutil.rmtree(newFolder)

    return output






save_path = os.getcwd()
keyword = 'covid'
jsonData = download_url(keyword, link, save_path)


titleList = []
authorList = []
yearList = []
linkList = []

for each in jsonData:
    w=1
    try:
        title = each['bibjson']['title']
        titleList.append(title)
    except:
        titleList.append('')

    try:
        authors = ', '.join([ x['name'] for x in each['bibjson']['author'] ])
        authorList.append(authors)
    except:
        authorList.append('')

    try:
        link = each['bibjson']['link'][0]['url']
        linkList.append(link)
    except:
        linkList.append('')

    try:
        year = each['bibjson']['year']
    except:
        year = ''

    try:
        volume = each['bibjson']['journal']['volume']
    except:
        volume = ''

    try:
        number = each['bibjson']['journal']['number']
    except:
        number = ''

    try:
        startPage = each['bibjson']['start_page']
    except:
        startPage = ''

    try:
        endPage = each['bibjson']['end_page']
    except:
        endPage = ''

    yearStr = '%s;%s(%s):%s-%s' %(year, volume, number,startPage, endPage)
    yearList.append(yearStr)


df = pd.DataFrame({'Title':titleList,
                   'Author':authorList,
                   'Year Post':yearList,
                   'Link Full Text':linkList})    

Вывод:

print (df.head(10).to_string())
                                               Title                                             Author             Year Post                                     Link Full Text
0  Alternative Labeling Programs and Purchasing B...  Giovanna Sacchi, Vincenzina Caputo, Rodolfo M....   2015;7(6):7397-7416             http://www.mdpi.com/2071-1050/7/6/7397
1  On a knife’s edge of a COVID-19 pandemic: is c...                                 C. Raina MacIntyre          2020;30(1):-  https://www.phrp.com.au/issues/march-2020-volu...
2  Characteristics of and Public Health Responses...                     Sheng-Qun Deng, Hong-Juan Peng        2020;9(2):575-             https://www.mdpi.com/2077-0383/9/2/575
3  Going viral – Covid-19 impact assessment: A pe...                         Saurabh Bobdey, Sougat Ray       2020;22(1):9-12  http://www.marinemedicalsociety.in/article.asp...
4           Chapter of agroecology put into practice                                   Cláudia de Souza     2014;5(3):126-130  http://periodicos.unb.br/index.php/sust/articl...
5  Outbreak of Novel Coronavirus (SARS-Cov-2): Fi...  Emanuele Amodio, Francesco Vitale, Livia Cimin...         2020;8(1):51-              https://www.mdpi.com/2227-9032/8/1/51
6  On the Coronavirus (COVID-19) Outbreak and the...                       Zaheer Allam, David S. Jones         2020;8(1):46-              https://www.mdpi.com/2227-9032/8/1/46
7  What to Do When A Patient Infected With COVID-...                         Erdinç Kamer, Tahsin Çolak        2020;30(1):1-8  http://cms.galenos.com.tr/Uploads/Article_3654...
8           COVID-19. Punto de vista del cardiólogo.  Adrian Naranjo Dominguez, Alexander Valdés Martín  2020;26(1):e951-e951  http://www.revcardiologia.sld.cu/index.php/rev...
9  Insights into the Recent 2019 Novel Coronaviru...  Hossam  M. Ashour, Walid  F. Elkhatib, Md.  Ma...        2020;9(3):186-             https://www.mdpi.com/2076-0817/9/3/186
...