Tuesday, August 18, 2015

Heuristic methods and ad hoc tools for little (big) data

I started off my career in software development writing mailing list data entry and management tools for a small marketing/communications firm in Tallahassee, Florida.  I never realized that the skills I would learn at that time to clean up a mailing list would be so useful later in my career.  The process we used for cleaning up the lists was to use various ad hoc tools, sed/grep/sort/cut and pattern matching processes to parse and organize the lists in various ways.  Then we'd start looking for patterns of things we could clean up, mostly ways to deduplicate the list, or correct systemic data entry problems.  Later I would work alongside some folks who would take documents (very large files) from electronically formatted reference works using macros from troff or nroff, and turn them into SGML (the predecessor to XML for those of you before my generation) using very similar techniques.

I would later see similar techniques used in natural language processing, patient matching (very similar to mailing list deduplication, almost identical in fact), and a variety of other uses.

Last night I used one of these techniques again, find an outlier, determine the cause for it, and then systematically look for others like it based on the cause.  Those other cases often don't stand out without understanding the problem.  I find myself amused that techniques I learned doing very simple computer programming, drudgery almost, very early in my career, still find their ways back into my daily work doin high-falutin archy-teckture.

Such a simple process really, use a simple and fast tool to fix 80% of the problem, do it again on the remaining mess, do it once more, and then manually review the last 0.8% for anything else.  Five thousand data items is quite a bit to process manually.  In the grander scheme of things it is still little data, but for the person (me) who has to do it, it can feel like big data.  But after applying these techniques you can often finish such a problem in much shorter time than you'd think.

   Keith

1 comment: