What is My Encoding?

December 13th, 2009

I’ve setup a neat little one-page website using the Bottle micro-framework. The general idea of the site is you paste in a URL or string, and it will make a guess as to what encoding it uses. The heavy lifting of this web app is done by Mark Pilgrim’s Universal Encoding Detector.

The impetus for this, was I had strings stored in a latin-1 database, but had no clue what they meant, what encoding they were in, or anything of the sort. The final resolution to my particular issue was searching for the character codes and finding them in various PHP sanitization scripts.

This was a great experiment into the whole micro-framework idea. Overall, the process was quite nice. I didn’t have to worry about database integration, so I didn’t have that to worry about. It was only serving a page and the template layer. In the end, I used bottle’s built-in templating language, but if this site were any more than 2 pages, I would definitely have pulled in jinja2 to replace it. The reasons I would have chosen jinja are that its designed to be a stand-alone templating language and I’m still interested in what separates it from django’s own templating engine (aside from speed).

One interesting issue I ran into while developing this, is when pasting in a string with unicode strings, eg “Caf\xc3\xa9“, the resulting string was being replaced into “Caf\\xc3\\xa9“. I tried a few things which didn’t work but thanks to theatrus, I think, in #python on freenode for the help, we found that decoding the string with “string_escape” as the type, it returned the unescaped string.


Comments are disabled. If that bothers you, please contact me on twitter at @justinlilly and let me know.