Tuesday, March 15, 2016

Status Update on my Pubmed for HealthIT Standards Project

So my capstone project has taken several twists and turns since I first conceived of it, but is still a going concern.  I am very nearly to the point of having many IHE profiles loaded into my index, and will very quickly thereafter have a large number of HL7 standards and implementation guides also loaded.

What have I learned thus far?

  • The lack of a standardized publication format for standards makes it challenging to build an index of them but not impossible.
  • You need deep understanding of the content to parse anything useful from it.  I'm not the only person who could do this, but you'd certainly need someone with both standards expertise, as well as a good deal of experience dealing with structured documentation.  Fortunately, I've got a good bit of that in my background before I ever got deeply involved in standards.
  • PDF is no longer the bane of standards existence, but there is still enough pain that you need to invest pretty significantly to get anything useful out.  Fortunately, I managed to find Apache's PDFbox in time to rescue me from days of copy and paste (instead I traded days of coding, but I can guarantee I won't be sorry).
Agile is clearly the way to approach this, even to the point of building out the index content.  I'm starting with the basic stuff, and will support full text searching of titles and abstracts, but will probably take a bit longer to get to some more complex coded detail (such as coding system or actor types).  For that, I'm going to see what looks to be useful as I go.

I expect in a couple of weeks to have the first build available for public testing, so that people can readily see what is available.  I had hoped to be finished with that at the end of this term (coming very soon), but automating extraction of metadata from the PDF took both longer than I though, and shorter than I thought.

In the original case, I never thought it would be possible for me to do that, but when I looked at copying/pasting data on a dozen plus profiles or more from a half dozen or more documents, I was daunted by the manual task.  Few would have the patience or experience to to do that drudgery. When I discovered PDFbox, and realized that I could automate much of it, I HAD to do it.  For one, the quality of the indexing would jump by at least an order of magnitude, and the speed by two orders (once I got everything coded up).  There's enough us of PDF in the standards world that this would be a godsend.

The ideal solution (and the one I had originally envisioned) would be to get SDOs to agree to a standard format, but that's not really something I can fit into a six month project.  I might get one to move in that direction, but four or more?  Not at the same time.  I have to give them a reason to do so, and thus, I changed my plans a bit.  Let's give people a taste of what such an index can do, and see if then they could be convinced.

So, stay tuned, I should be tossing it out there on the web real soon.

0 comments:

Post a Comment