Огромная проблема, чтобы разобрать эту "простую" HTML-страницу - PullRequest
2 голосов
/ 28 ноября 2010

Я пытаюсь разобрать http://www.google.com/finance?q=INDEXDJX:.DJI и не могу добиться этого, не могу понять почему:

symbol_list: ["GOOG" "AAPL" "MSFT" "INDEXDJX:.DJI"]
foreach symbol symbol_list [
  url0: rejoin [http://www.google.com/finance/historical?q= symbol]
  ;stock-data: read/lines url
  dir: make-dir/deep to-rebol-file "askpoweruser/stock-download/google/"
  either none? filename: find symbol ":" [filename: symbol
  url: rejoin [url0 "&output=csv"]
    content: read url
    out-string: copy rejoin ["Time;Open;High;Low;Close;Volume" newline]
    reversed-quotes: reverse parse/all content ",^/"

    foreach [v c l h o d] reversed-quotes [
        either not (error? try [d: to-date d]) [
            d: rejoin [d/year "-" d/month "-" d/day]
            append out-string rejoin [d ";" o ";" h ";" l ";" c ";" v newline]


    write to-rebol-file rejoin [dir symbol ".csv"] out-string
    filename: next next filename
    out: copy []
    for i 0 1 1 [
    p: i
    url: rejoin [url0 "&start=" (p * 200) "&num=" ((p + 1) * 200)]
    content: read url
    rule: [to "<table" thru "<table" to ">" thru ">"
    to "<table" thru "<table" to ">" thru ">"
    to "<table" thru "<table" to ">" thru ">"
    copy quotes to </table> to end
    parse content rule

parse load/markup quotes [
    some [set tag tag! (probe tag) | set x string! (
        if (not none? tag) [
        if ((left-range tag 3) = "<td") [
            replace/all (replace/all x "^/" "") "," ""
            append out x
    ;write/lines to-rebol-file rejoin [dir filename "_" p ".html"] quotes

    write to-rebol-file rejoin [dir filename "_temp" ".txt"] mold out
    remove/part out 2
    out-string: copy rejoin ["Time;Open;High;Low;Close;Volume" newline]
    out: reverse out
    insert/only out "" 1

    foreach [x v c l h o d] out [

        either not (error? try [d: to-date d]) [
            d: rejoin [d/year "-" d/month "-" d/day]
            append out-string rejoin [d ";" o ";" h ";" l ";" c ";" v newline]
            probe d
write/lines to-rebol-file rejoin [dir filename ".csv"] out-string


Наконец, я делаю это другим способом (см. Мой собственный ответ ниже), используя синтаксический анализ вместо загрузки / разметки, который на первый взгляд кажется более простым, но Google HTML кажется не очень добрым, поэтому я передумал:

parse quotes [
    some [to "<td" thru "<td" to ">" thru ">" [copy x to "<" | copy x to end] (append out replace/all x "^/" "")]
    to end

образец вывода:


1 Ответ

2 голосов
/ 28 ноября 2010

Наконец-то я отказался от использования load / markup и напрямую использовал parse, теперь он работает:

symbol_list: ["GOOG" "AAPL" "MSFT" "INDEXDJX:.DJI"]
foreach symbol symbol_list [
  url0: rejoin [http://www.google.com/finance/historical?q= symbol]
  ;stock-data: read/lines url
  dir: make-dir/deep to-rebol-file "askpoweruser/stock-download/google/"
  either none? filename: find symbol ":" [filename: symbol
  url: rejoin [url0 "&output=csv"]
    content: read url
    out-string: copy rejoin ["Time;Open;High;Low;Close;Volume" newline]
    reversed-quotes: reverse parse/all content ",^/"

    foreach [v c l h o d] reversed-quotes [
        either not (error? try [d: to-date d]) [
            d: rejoin [d/year "-" d/month "-" d/day]
            append out-string rejoin [d ";" o ";" h ";" l ";" c ";" v newline]


    write to-rebol-file rejoin [dir symbol ".csv"] out-string
    filename: next next filename
    out: copy []
    for i 0 1 1 [
    p: i
    url: rejoin [url0 "&start=" (p * 200) "&num=" ((p + 1) * 200)]
    content: read url
    rule: [to "<table" thru "<table" to ">" thru ">"
    to "<table" thru "<table" to ">" thru ">"
    to "<table" thru "<table" to ">" thru ">"
    copy quotes to </table> to end
    parse content rule

parse quotes [
    some [to "<td" thru "<td" to ">" thru ">" [copy x to "<" | copy x to end] (append out replace/all x "^/" "")]
    to end
    ;write/lines to-rebol-file rejoin [dir filename "_" p ".html"] quotes

    write to-rebol-file rejoin [dir filename "_temp" ".txt"] mold out
    ;remove/part out 2
    out-string: copy rejoin ["Time;Open;High;Low;Close;Volume" newline]
    out: reverse out    

    foreach [v c l h o d] out [
       d: parse/all d " ,"
       d: to-date rejoin [d/4 "-" d/1 "-" d/2]
       d: rejoin [d/year "-" d/month "-" d/day]
       append out-string rejoin [d ";" o ";" h ";" l ";" c ";" v newline]
    write to-rebol-file rejoin [dir filename ".csv"] out-string
