Регулярные выражения фильтруют все по тегу article - PullRequest
1 голос
/ 29 марта 2020

Я пытаюсь получить весь контент из статьи, размещенной в: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01975-8 Я обнаружил, что информация находится в теге

<article><div...><..> information.... <></article>

Я пытаюсь что-то вроде этого:

art_sections<-regexpr("<article (.*)?>(.[0-9]*)</article>",thepage)

но я не могу достать информацию ..

Пожалуйста, если вы знаете, как я могу решить эту проблему.

Ответы [ 2 ]

0 голосов
/ 29 марта 2020

Попробуйте извлечь весь текст (только текст) из статьи, используя пакет rvest. Однако все теги HTML (включая ссылку, изображение и т. Д. c) удаляются.

# install.packages("rvest")

library(rvest)
url <- "https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01975-8"
article <- url %>% 
  read_html %>%
  html_node(css = 'article') %>%
  html_text
article

# Method\n    \n        \n            \n                Open Access\n            \n        \n    \n    \n\n                            Published: 19 March 2020\n                        VALOR2: characterization of large-scale structural variants using linked-reads\n                        Fatih Karaoğlanoğlu1 na1, Camir Ricketts2,4, Ezgi Ebren1, Marzieh Eslami Rasekh3, Iman Hajirasouliha4,5 & Can Alkan1,6 \n                            \n    Genome Biology\n\n                            volume 21, Article number: 72 (2020)\n            Cite this article\n                        \n                        \n    \n        \n            \n                        576 Accesses\n                    \n                \n                \n                    \n                        1 Citations\n                    \n                \n                \n                    \n                        \n                            12 Altmetric\n                        \n                    \n                \n                \n                    Metrics details\n                \n            \n    \n\n                        \n                        \n                        \n                    \n\n                    AbstractMost existing methods for structural variant detection focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced structural variants with no gain or loss of genomic segments, for example, inversions and translocations, is a particularly challenging task. Furthermore, there are very few algorithms to predict the insertion locus of large interspersed segmental duplications and characterize translocations. Here, we propose novel algorithms to characterize large interspersed segmental duplications, inversions, deletions, and translocations using linked-read sequencing data. We redesign our earlier algorithm, VALOR, and implement our new algorithms in a new software package, called VALOR2.BackgroundAlterations of DNA content and organization larger than 50 bp, commonly referred to as genomic structural variations (SVs) [1], are among the major drivers of evolution [2, 3] and diseases of genomic origin [4]. Despite decades of research, they remain difficult to accurately characterize contributing to our lack of full understanding of the etiology of complex diseases, termed missing heritability [5].High-throughput sequencing

Regex решение для извлечения всего содержимого между тегами <article> (включая текст и другие теги HTML)

html <- paste(readLines(url), collapse = " ")
article <- sub(".*(<article.*?>.*</article>).*", "\\1", html)
article

# <article itemscope itemtype=\"http://schema.org/ScholarlyArticle\" lang=\"en\">                     <div class=\"c-article-header\">                                                   <ul class=\"c-article-identifiers\" data-test=\"article-identifier\">                                  <li class=\"c-article-identifiers__item\" data-test=\"article-category\">Method</li>                           <li class=\"c-article-identifiers__item\">                 <span class=\"c-article-identifiers__open\" data-test=\"open-access\">Open Access</span>             </li>                                                 <li class=\"c-article-identifiers__item\"><a href=\"#article-info\" data-track=\"click\" data-track-action=\"publication date\" data-track-category=\"article body\" data-track-label=\"link\">Published: <time datetime=\"2020-03-19\" itemprop=\"datePublished\">19 March 2020</time></a></li>                         </ul>                          <h1 class=\"c-article-title u-h1\" data-test=\"article-title\" data-article-title=\"\" itemprop=\"name headline\">VALOR2: characterization of large-scale structural variants using linked-reads</h1>                         <ul class=\"c-author-list js-list-authors js-etal-collapsed\" data-etal=\"25\" data-etal-small=\"3\" data-test=\"authors-list\"><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-1\">Fatih KaraoÄŸlanoÄŸlu</a></span><sup class=\"u-js-hide\"><a href=\"#Aff1\">1</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Bilkent University\" /><meta itemprop=\"address\" content=\"grid.18376.3b, 0000 0001 0723 2427, Department of Computer Engineering, Bilkent University, Ankara, 06800, Turkey\" /></span></sup><sup class=\"u-js-hide\">Â <a href=\"#na1\">na1</a></sup>, </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-2\">Camir Ricketts</a></span><sup class=\"u-js-hide\"><a href=\"#Aff2\">2</a>,<a href=\"#Aff4\">4</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Cornell University\" /><meta itemprop=\"address\" content=\"grid.5386.8, 000000041936877X, Tri-Institutional Computational Biology &amp; Medicine Program, Cornell University, 1300 York Ave, New York, 10065, NY, USA\" /></span><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Weill Cornell Medicine\" /><meta itemprop=\"address\" content=\"grid.5386.8, 000000041936877X, Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, 1300 York Ave, New York, 10065, NY, USA\" /></span></sup>, </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-3\">Ezgi Ebren</a></span><sup class=\"u-js-hide\"><a href=\"#Aff1\">1</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Bilkent University\" /><meta itemprop=\"address\" content=\"grid.18376.3b, 0000 0001 0723 2427, Department of Computer Engineering, Bilkent University, Ankara, 06800, Turkey\" /></span></sup>, </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-4\">Marzieh Eslami Rasekh</a></span><sup class=\"u-js-hide\"><a href=\"#Aff3\">3</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Boston University\" /><meta itemprop=\"address\" content=\"grid.189504.1, 0000 0004 1936 7558, Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, 02215, MA, USA\" /></span></sup>, </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-5\" data-corresp-id=\"c1\">Iman Hajirasouliha<svg width=\"16\" height=\"16\" class=\"u-icon\"><use xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"#global-icon-email\"></use></svg></a></span><sup class=\"u-js-hide\"><a href=\"#Aff4\">4</a>,<a href=\"#Aff5\">5</a><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Weill Cornell Medicine\" /><meta itemprop=\"address\" content=\"grid.5386.8, 000000041936877X, Department of Physiology and Biophysics, Institute for Computational Biomedicine, Weill Cornell Medicine, 1300 York Ave, New York, 10065, NY, USA\" /></span><span itemprop=\"affiliation\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Organization\" class=\"u-visually-hidden\"><meta itemprop=\"name\" content=\"Weill Cornell Medicine\" /><meta itemprop=\"address\" content=\"grid.5386.8, 000000041936877X, Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, 1300 York Ave, New York, 10065, NY, USA\" /></span></sup> &amp; </li><li class=\"c-author-list__item\" itemprop=\"author\" itemscope=\"itemscope\" itemtype=\"http://schema.org/Person\"><span itemprop=\"name\"><a data-test=\"author-name\" data-track=\"click\" data-track-action=\"open author\" data-track-category=\"article body\" data-track-label=\"link\" href=\"#auth-6\" data-corresp-id=\"c2\">Can Alkan<svg width=\"16\" height=\"16\" class=\"u-icon\"><use xmlns:xlink=\"http://www.w3.org/1999/xlink\" xlink:href=\"#global-icon-email\"></use></svg></a>
0 голосов
/ 29 марта 2020

Это не вопрос регулярных выражений, а вопрос о списании веб-страниц с использованием библиотеки R, например, rvest.

Ниже приведен пример кода и несколько ссылок для начала работы:

#Loading the rvest package
library('rvest')
#Specifying the url for desired website to be scraped
url <- 'https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01975-8'
#Reading the HTML code from the website
webpage <- read_html(url)
article_html <- html_nodes(webpage,'article')
#Converting the ranking data to text
html_text(article_html)

Наконец, чтобы очистить текст, взгляните на stringr т.е.

library(stringr)
str_replace_all(x, "[\r\n]" , "")
...