We launched new forums in March 2019—join us there. In a hurry for help with your website? Get Help Now!
    • 50230
    • 2 Posts
    My site will have genealogical records. Old birh, death, marriage records transcribed from various sources. My goal is to have them display in various ways, the most important being a sortable table and I need advanced search capabilities as in the ability to search one or more fields/columns. The records are fairly small and simple. Surname, Given name, Date, Place etc. 10-20 fields. They are getting transcribed into excel so the universal output from there would be csv. There's a new standard being adopted for genealogical data and it's xml so at some point I'll be converting csv to xml. The standard is for a complete family tree but individuals/person is a part of that. Here's a sample of what a record would look like.
    <person id="BBB-BBBB">
            <source description="#EEE-EEEE"/>
            <gender type="http://gedcomx.org/Male"/>
            <name id="789">
                <nameForm>
                    <fullText>George Washington</fullText>
                    <part type="http://gedcomx.org/Given" value="George"/>
                    <part type="http://gedcomx.org/Surname" value="Washington"/>
                </nameForm>
            </name>
            <fact type="http://gedcomx.org/Birth" id="123">
                <date>
                    <original>February 22, 1732</original>
                    <formal>+1732-02-22</formal>
                </date>
                <place description="#888">
                    <original>pope's creek, westmoreland, virginia, united states</original>
                </place>
            </fact>
            <fact type="http://gedcomx.org/Death" id="456">
                <date>
                    <original>December 14, 1799</original>
                    <formal>+1799-12-14T22:00:00</formal>
                </date>
                <place description="#999">
                    <original>mount vernon, fairfax county, virginia, united states</original>
                </place>
            </fact>
        </person>

    So my question at last is: Would it be best to have that xml go into a database(MySQL) or could I have the xml docs on the server and work with them as they are? The issues are display in sortable tables on the site and be searchable by field. There will be tens or hundreds of thousands of these eventually. They can/would be categorized either by State or County. There's 3143 counties in the U.S. I would think it would be nice to leave them as xml since it is or will be the native format. It seems like advsearch, being based on solr would be capable of dealing with xml. I'm not a coder but I'm learning and I do realize it will take quite a bit no matter how I go. Serializing, indexing, schemas etc. Just looking for some ideas as to what direction to take so I don't spend months heading down the wrong path.
    Thanks, John

    PS, the above record is George Washington of course and there would be one for Martha and then one for relation which links them together and more relations for parent/child. It is possible that someday I would be using the relation records and at that point the whole thing becomes relational and that's what the entire spec/standard is for.
      • 44580
      • 189 Posts
      Based on "There will be tens or hundreds of thousands of these eventually" I would strongly suggest storing in a database. However the gedcomx xml schema looks to be quite complex and not necessarily all that "relational" so defining a suitable MySQL schema for your needs will be critical. This looks like it might be a good place to start: http://cgit.drupalcode.org/family/tree/schema.mysql?id=40234abb0fbacd0a5871477995a2af0a7d226854 (ignore that it is for Drupal, it's the schema that's important).

      Once you have your database, it should not be difficult to develop an ETL (Extract, Transform, Load) script to load your source data (be it XML or CSV) into the database.

      Good luck. It's a big job whichever way you look at it...
        • 3749
        • 24,544 Posts
        The advantage of XML is that a record can have a variable number of fields. You could just store it as XML or compressed-XML.

        It would be tempting to just store the XML in the content field of a resource (with the person's name as the pagetitle), then write a function that converts it for display.

        The down side of that would be that there would be no way to do an efficient search.

        With that many records coming down the road, I think you're going to want to design a complex relational DB and pull the data out into fields. I would put it in the DB.
          Did I help you? Buy me a beer
          Get my Book: MODX:The Official Guide
          MODX info for everyone: http://bobsguides.com/modx.html
          My MODX Extras
          Bob's Guides is now hosted at A2 MODX Hosting
          • 50230
          • 2 Posts
          I probably mispoke when I said there would be a great quantity of individual records. That's the way I've been thinking of it as that's how it's being transcribed and at some point displayed but in reality it could be in one big file as the gedcom spec is or in my case it could be in 51 files, one for each State plus DC or one for each County for a total of 3143 files. The sample I gave is actually two records. One for birth and one for death. I converted our tree to the new gedcomx and with 2500 individuals it ended up 4mb in size so I can't see doing one file. I think by State would be best. The site nav would be broken down by State at least anyway. Xml is actually used for huge datasets/datasources in the business world. Xquery is a querying language for it and similar to sql query language. Xslt can be used to transform xml into xhtml. It may be a matter of storing xml in a database as native xml databases are a whole technology and pretty much all run on java so jdk would have to be installed on a server for that. I'm not wanting to get into all that. I'm hoping Jason Coward, being a data guru will see this thread. I just see it as advantageous to store the data as gedcom xml to make it compatible with other genealogy solutions as they come available. Familysearch.org is already implementing it but I have no idea as to how they're going about it. I'm not a developer so I only understand about half of what I've read on xml, xslt, xpath, xquery and some others.
          Yes, it is going to be a big project no matter what. My poor sister, the genealogist will be doing all the transcribing/typing. She's got hundreds of books filled with these records.
            • 3749
            • 24,544 Posts
            Whatever format you keep the data in, you could probably export it to many other formats without too much trouble. There's an interesting comment about the gedcom spec here: http://www.familyhistorysa.info/dbDesign.html.

            My concern would be that the XML data would not lend itself to rapid searches, especially if it's in multiple files, unless it is indexed. There may be a tool that indexes it, but so far I haven't seen any reference to one. But then, maybe you don't need rapid searches.

            I wish you luck, my family tree on my father's side is a dead end about 3 generations back. No one in the family has gotten past that and I spent a couple days working on it in the Library of Congress and ended up at the same dead end.
              Did I help you? Buy me a beer
              Get my Book: MODX:The Official Guide
              MODX info for everyone: http://bobsguides.com/modx.html
              My MODX Extras
              Bob's Guides is now hosted at A2 MODX Hosting