Perl regex for Unicode

I work in the main library at BYU doing data manipulation. I recently finished a project where I went through the authority database and searched for every foreign (non-Latin), vernacular script. In other words, if the Cyrillic alphabet was used, we wanted to know. Same for Hebrew, Greek, etc. My problem came in figuring out how to read Unicode input into Perl and then use some of the more Unicode-specific regular expressions to parse the data.

I don’t know if a presentation is really in order on this (I don’t consider myself an expert, nor is there a large body of work to present on), but I figured that the information should get out there somehow, especially because there are not an abundance of online resources with the information. The gist of it is that you have to decode the input file first, then use your regex, and finally encode the output file. Nowhere except one place was I finding that information. So please let me know if this is something of interest, or if I should merely write a little blurb on it elsewhere and we can all move on with our lives.

Categories: General |

About Colby

BYU IT student. Interested in Perl, regex, and ancient near eastern languages (specifically Biblical Hebrew and Aramaic).