Как мне разобрать файл XML в столбчатом формате? - PullRequest
1 голос
/ 26 октября 2019

Я хочу разобрать файл XML по частям, чтобы он не вышел из памяти, и проанализировать его в виде столбчатого хранилища то есть ключ1: значение1, ключ2: значение2, ключ3: значение3 и т. Д. . для каждой строки.

В настоящее время я читаю файл следующим образом:

string parseFieldFromLine(const string &line, const string &key)
{
    // We're looking for a thing that looks like:
    // [key]="[value]"
    // as part of a larger string.
    // We are given [key], and want to return [value].

    // Find the start of the pattern
    string keyPattern = key + "=\"";
    ssize_t idx = line.find(keyPattern);

    // No match
    if (idx == -1)
        return "";

    // Find the closing quote at the end of the pattern
    size_t start = idx + keyPattern.size();

    size_t end = start;
    while (line[end] != '"')
    {
        end++;
    }

    // Extract [value] from the overall string and return it
    // We have (start, end); substr() requires,
    // so we must compute, (start, length).
    return line.substr(start, end - start);
}

map<string, User> users;

void readUsers(const string &filename)
{
    ifstream fin;
    fin.open(filename.c_str());

    string line;
    while (getline(fin, line))
    {
        User u;
        u.Id = parseFieldFromLine(line, "Id");
        u.DisplayName = parseFieldFromLine(line, "DisplayName");
        users[u.Id] = u;
    }
}

Как видите, я вызываю функцию, которая находит подстроку в строке. Это ошибка в том смысле, что если у меня есть файл (строка), который искажен, я получил бы неожиданные значения, приводящие к тихим сбоям.

Я читал об использовании парсеров XML, но плохо знаком с C ++, я не могуопределить, какой из них будет работать лучше всего в формате «ключ-значение», учитывая небольшие знания о тестировании работы / эффективности. Мои текущие данные i / p выглядят так:

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="509" CreationDate="2009-04-30T06:49:01.807" Score="13" ViewCount="903" Body="&lt;p&gt;Our nightly full (and periodic differential) backups are becoming quite large, due mostly to the amount of indexes on our tables; roughly half the backup size is comprised of indexes.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;We're using the &lt;strong&gt;Simple&lt;/strong&gt; recovery model for our backups.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Is there any way, through using &lt;code&gt;FileGroups&lt;/code&gt; or some other file-partitioning method, to &lt;strong&gt;exclude&lt;/strong&gt; indexes from the backups?&lt;/p&gt;&#xA;&#xA;&lt;p&gt;It would be nice if this could be extended to full-text catalogs, as well.&lt;/p&gt;&#xA;" OwnerUserId="3" LastEditorUserId="919" LastEditorDisplayName="" LastEditDate="2009-05-04T02:11:16.667" LastActivityDate="2009-05-10T15:22:39.707" Title="How to exclude indexes from backups in SQL Server 2008" Tags="&lt;sql-server&gt;&lt;backup&gt;&lt;sql-server-2008&gt;&lt;indexes&gt;" AnswerCount="3" CommentCount="0" FavoriteCount="3" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="1238" CreationDate="2009-04-30T07:04:18.883" Score="18" ViewCount="1951" Body="&lt;p&gt;We've struggled with the RAID controller in our database server, a &lt;a href=&quot;http://www.pc.ibm.com/europe/server/index.html?nl&amp;amp;cc=nl&quot; rel=&quot;nofollow&quot;&gt;Lenovo ThinkServer&lt;/a&gt; RD120. It is a rebranded Adaptec that Lenovo / IBM dubs the &lt;a href=&quot;http://www.redbooks.ibm.com/abstracts/tips0054.html#ServeRAID-8k&quot; rel=&quot;nofollow&quot;&gt;ServeRAID 8k&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;We have patched this &lt;a href=&quot;http://www.redbooks.ibm.com/abstracts/tips0054.html#ServeRAID-8k&quot; rel=&quot;nofollow&quot;&gt;ServeRAID 8k&lt;/a&gt; up to the very latest and greatest:&lt;/p&gt;&#xA;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;RAID bios version&lt;/li&gt;&#xA;&lt;li&gt;RAID backplane bios version&lt;/li&gt;&#xA;&lt;li&gt;Windows Server 2008 driver&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&#xA;&lt;p&gt;This RAID controller has had multiple critical BIOS updates even in the short 4 month time we've owned it, and the &lt;a href=&quot;ftp://ftp.software.ibm.com/systems/support/system%5Fx/ibm%5Ffw%5Faacraid%5F5.2.0-15427%5Fanyos%5F32-64.chg&quot; rel=&quot;nofollow&quot;&gt;change history&lt;/a&gt; is just.. well, scary. &lt;/p&gt;&#xA;&#xA;&lt;p&gt;We've tried both write-back and write-through strategies on the logical RAID drives. &lt;strong&gt;We still get intermittent I/O errors under heavy disk activity.&lt;/strong&gt; They are not common, but serious when they happen, as they cause SQL Server 2008 I/O timeouts and sometimes failure of SQL connection pools.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;We were at the end of our rope troubleshooting this problem. Short of hardcore stuff like replacing the entire server, or replacing the RAID hardware, we were getting desperate.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;When I first got the server, I had a problem where drive bay #6 wasn't recognized. Switching out hard drives to a different brand, strangely, fixed this -- and updating the RAID BIOS (for the first of many times) fixed it permanently, so I was able to use the original &quot;incompatible&quot; drive in bay 6. On a hunch, I began to assume that &lt;a href=&quot;http://www.newegg.com/Product/Product.aspx?Item=N82E16822136143&quot; rel=&quot;nofollow&quot;&gt;the Western Digital SATA hard drives&lt;/a&gt; I chose  were somehow incompatible with the ServeRAID 8k controller.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Buying 6 new hard drives was one of the cheaper options on the table, so I went for &lt;a href=&quot;http://www.newegg.com/Product/Product.aspx?Item=N82E16822145215&quot; rel=&quot;nofollow&quot;&gt;6 Hitachi (aka IBM, aka Lenovo) hard drives&lt;/a&gt; under the theory that an IBM/Lenovo RAID controller is more likely to work with the drives it's typically sold with.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Looks like that hunch paid off -- we've been through three of our heaviest load days (mon,tue,wed) without a single I/O error of any kind. Prior to this we regularly had at least one I/O &quot;event&quot; in this time frame. &lt;strong&gt;It sure looks like switching brands of hard drive has fixed our intermittent RAID I/O problems!&lt;/strong&gt;&lt;/p&gt;&#xA;&#xA;&lt;p&gt;While I understand that IBM/Lenovo probably tests their RAID controller exclusively with their own brand of hard drives, I'm disturbed that a RAID controller would have such subtle I/O problems with particular brands of hard drives.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;So my question is, &lt;strong&gt;is this sort of SATA drive incompatibility common with RAID controllers?&lt;/strong&gt; Are there some brands of drives that work better than others, or are &quot;validated&quot; against particular RAID controller? I had sort of assumed that all commodity SATA hard drives were alike and would work reasonably well in any given RAID controller (of sufficient quality).&lt;/p&gt;&#xA;" OwnerUserId="1" LastActivityDate="2011-03-08T08:18:15.380" Title="Do RAID controllers commonly have SATA drive brand compatibility issues?" Tags="&lt;raid&gt;&lt;ibm&gt;&lt;lenovo&gt;&lt;serveraid8k&gt;" AnswerCount="8" FavoriteCount="2" />
  <row Id="3" PostTypeId="1" AcceptedAnswerId="104" CreationDate="2009-04-30T07:48:06.750" Score="26" ViewCount="692" Body="&lt;ul&gt;&#xA;&lt;li&gt;How do you keep your servers up to date?&lt;/li&gt;&#xA;&lt;li&gt;When using a package manager like &lt;a href=&quot;http://wiki.debian.org/Aptitude&quot; rel=&quot;nofollow&quot;&gt;Aptitude&lt;/a&gt;, do you keep an upgrade / install history, and if so, how do you do it?&lt;/li&gt;&#xA;&lt;li&gt;When installing or upgrading packages on multiple servers, are there any ways to speed the process up as much as possible?&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;" OwnerUserId="22" LastEditorUserId="22" LastEditorDisplayName="" LastEditDate="2009-04-30T08:05:02.217" LastActivityDate="2009-06-05T04:01:09.423" Title="Best practices for keeping UNIX packages up to date?" Tags="&lt;unix&gt;&lt;package-management&gt;&lt;server-management&gt;" AnswerCount="11" FavoriteCount="14" />
  <row Id="4" PostTypeId="2" ParentId="3" CreationDate="2009-04-30T07:49:58.027" Score="10" ViewCount="" Body="&lt;p&gt;Regarding your third question: I always run a local repository. Even if it's only for one machine, it saves time in case I need to reinstall (I generally use something like aptitude autoclean), and for two machines, it almost always pays off.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;For the clusters I admin, I don't generally keep explicit logs: I let the package manager do it for me. However, for those machines (as opposed to desktops), I don't use automatic installations, so I do have my notes about what I intended to install to all machines.&lt;/p&gt;&#xA;" OwnerUserId="28" LastActivityDate="2009-04-30T07:49:58.027" CommentCount="1" />
  <row Id="5" PostTypeId="2" ParentId="2" CreationDate="2009-04-30T07:56:20.070" Score="4" ViewCount="" Body="&lt;p&gt;I don't think it's common per se. However, as soon as you start using enterprise storage controllers, whether that be SAN's or standalone RAID controllers, you'll generally want to adhere to their compatibility list rather closely.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;You may be able to save some bucks on the sticker price by buying a cheap range of disks, but that's probably one of the last areas I'd want to save money on - given the importance of data in most scenarios.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;In other words, explicit incompatibility is very uncommon, but explicit compatibility adherence is recommendable.&lt;/p&gt;&#xA;" OwnerUserId="24" LastActivityDate="2009-04-30T07:56:20.070" />
  <row Id="6" PostTypeId="1" AcceptedAnswerId="537" CreationDate="2009-04-30T07:57:06.247" Score="8" ViewCount="2648" Body="&lt;p&gt;Our database currently only has one FileGroup, PRIMARY, which contains roughly 8GB of data (table rows, indexes, full-text catalog).&lt;/p&gt;&#xA;&#xA;&lt;p&gt;When is a good time to split this into secondary data files?  What are some criteria that I should be aware of?&lt;/p&gt;&#xA;" OwnerUserId="3" LastActivityDate="2009-07-08T07:23:49.527" Title="In SQL Server, when should you split your PRIMARY Data FileGroup into secondary data files?" Tags="&lt;sql-server&gt;&lt;files&gt;&lt;filegroups&gt;" AnswerCount="3" FavoriteCount="1" />
  <row Id="7" PostTypeId="1" AcceptedAnswerId="17" CreationDate="2009-04-30T07:57:09.117" Score="12" ViewCount="529" Body="&lt;p&gt;What enterprise virus-scanning systems do you recommend?&lt;/p&gt;&#xA;" OwnerUserId="32" LastActivityDate="2009-04-30T11:51:09.290" Title="What is the best enterprise virus-scanning system?" Tags="&lt;antivirus&gt;" AnswerCount="8" CommentCount="3" FavoriteCount="2" />
  <row Id="8" PostTypeId="2" ParentId="3" CreationDate="2009-04-30T07:57:15.653" Score="0" ViewCount="" Body="&lt;p&gt;You can have a local repository and configure all servers to point to it for updates. Not only you get speed of local downloads, you also get to control which official updates you want installed on your infrastructure in order to prevent any compatibility issues.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;On the Windows side of things, I've used &lt;a href=&quot;http://technet.microsoft.com/en-us/wsus/default.aspx&quot; rel=&quot;nofollow&quot;&gt;Windows Server Update Services&lt;/a&gt; with very satisfying results.&lt;/p&gt;&#xA;" OwnerUserId="36" LastActivityDate="2009-04-30T07:57:15.653" />

Другой файл:

<?xml version="1.0" encoding="utf-8"?>
<users>
  <row Id="1" Reputation="4220" CreationDate="2009-04-30T07:08:27.067" DisplayName="Jeff Atwood" EmailHash="51d623f33f8b83095db84ff35e15dbe8" LastAccessDate="2011-09-03T13:30:29.990" WebsiteUrl="http://www.codinghorror.com/blog/" Location="El Cerrito, CA" Age="40" AboutMe="&lt;p&gt;&lt;img src=&quot;http://img377.imageshack.us/img377/4074/wargames1xr6.jpg&quot; width=&quot;250&quot;&gt;&lt;/p&gt;&#xA;&#xA;&lt;p&gt;&lt;a href=&quot;http://www.codinghorror.com/blog/archives/001169.html&quot; rel=&quot;nofollow&quot;&gt;Stack Overflow Valued Associate #00001&lt;/a&gt;&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Wondering how our software development process works? &lt;a href=&quot;http://www.youtube.com/watch?v=08xQLGWTSag&quot; rel=&quot;nofollow&quot;&gt;Take a look!&lt;/a&gt;&lt;/p&gt;&#xA;" Views="3562" UpVotes="1995" DownVotes="31" />
  <row Id="2" Reputation="697" CreationDate="2009-04-30T07:08:27.067" DisplayName="Geoff Dalgas" EmailHash="b437f461b3fd27387c5d8ab47a293d35" LastAccessDate="2011-09-05T22:14:06.527" WebsiteUrl="http://stackoverflow.com" Location="Corvallis, OR" Age="34" AboutMe="&lt;p&gt;Developer on the StackOverflow team.  Find me on&lt;/p&gt;&#xA;&#xA;&lt;p&gt;&lt;a href=&quot;http://www.twitter.com/SuperDalgas&quot; rel=&quot;nofollow&quot;&gt;Twitter&lt;/a&gt;&#xA;&lt;br&gt;&lt;br&gt;&#xA;&lt;a href=&quot;http://blog.stackoverflow.com/2009/05/welcome-stack-overflow-valued-associate-00003/&quot; rel=&quot;nofollow&quot;&gt;Stack Overflow Valued Associate #00003&lt;/a&gt; &lt;/p&gt;&#xA;" Views="291" UpVotes="46" DownVotes="2" />
  <row Id="3" Reputation="259" CreationDate="2009-04-30T07:08:27.067" DisplayName="Jarrod Dixon" EmailHash="2dfa19bf5dc5826c1fe54c2c049a1ff1" LastAccessDate="2011-09-01T20:43:27.743" WebsiteUrl="http://stackoverflow.com" Location="New York, NY" Age="32" AboutMe="&lt;p&gt;&lt;a href=&quot;http://blog.stackoverflow.com/2009/01/welcome-stack-overflow-valued-associate-00002/&quot; rel=&quot;nofollow&quot;&gt;Developer on the Stack Overflow team&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Was dubbed &lt;strong&gt;SALTY SAILOR&lt;/strong&gt; by Jeff Atwood, as filth and flarn would oft-times fly when dealing with a particularly nasty bug!&lt;/p&gt;&#xA;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Twitter me: &lt;a href=&quot;http://twitter.com/jarrod_dixon&quot; rel=&quot;nofollow&quot;&gt;jarrod_dixon&lt;/a&gt;&lt;/li&gt;&#xA;&lt;li&gt;Email me: jarrod.m.dixon@gmail.com&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;" Views="210" UpVotes="259" DownVotes="4" />

1 Ответ

0 голосов
/ 26 октября 2019

Полагаю, вы ищете SAX-парсер , который не читает весь документ сразу (как это сделал бы DOM-парсер ), но дает вам возможностьопределить обратные вызовы для определенных событий (например, начало нового XML-элемента). Из-за того, что вы обрабатываете элемент за элементом, который звучит как идеальный матч для вас.

Должен признать, что я никогда не выполнял синтаксический анализ XML в C ++, но две его библиотеки звучат как идеальное решение для вашей задачи:

  • expat
  • продолжение max
  • xerces : Раньше был стандартом де-факто в Java в начале 2000-х, но его опередили другие библиотеки. Тем не менее, реализация C ++ по-прежнему поддерживается.
...