Convert Unicode Symbols & Punctuation to ASCII using ColdFusion/Java

Posted on

symbolsToASCII is a ColdFusion UDF (user-defined function) to convert Unicode symbols and punctuation to ASCII7. I was previously using ConvertSpecialChars from CFLib, but it didn’t include enough mapped characters.

I found some documentation from NIH’s Lexical Systems Group website that documented their approach to “Map Symbols & Punctuation to ASCII“. They state that “converting Unicode punctuation and symbols to ASCII punctuation and symbols is imperative in NLP for preserving the original documents. Their java code implementation is simply “perform mapping if the character is in the punctuation & symbols mapping table“.

Their approach makes a lot of sense. When I’m performing a search using a SQL query or a Verity collection, the HTML5 input field doesn’t auto-corrupt “dumb quotes” to “smart quotes” like Microsoft Word does. If stored content has characters that are HTML-encoded, wouldn’t extra logic be required to account for potential substitutions containing high ASCII characters as well as ‘, ’, “ and ”?



Usage: symbolsToASCII(required string inputString)

<cfset testString = '#CHR(8220)#I don#CHR(8217)#t know what you mean by #CHR(8216)#glory,#CHR(8217)# #CHR(8221)# Alice said.'>

<cfoutput>
<textarea style="width:95%; height:300px;">
Original: #TestString#

symbolsToASCII: #symbolsToASCII(testString)#
</textarea>
Enter fullscreen mode

Exit fullscreen mode



Try it online at TryCF.com

https://trycf.com/gist/6f35220d47caa7fdbf75eb884ff1cec7



Source code


<cfscript>
/* 20200604 Map Symbols & Punctuation to ASCII
Convert the Unicode punctuation and symbols to ASCII punctuation and symbols is imperative in Natural language processing (NLP) for preserving the original documents.
Based on mapping from Lexical Systems Group: https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2013/docs/designDoc/UDF/unicode/NormOperations/mapSymbolToAscii.html
Blog: https://dev.to/gamesover/convert-symbols-punctuation-to-ascii-using-coldfusion-java-3l6a
TryCF: https://trycf.com/gist/6f35220d47caa7fdbf75eb884ff1cec7 */
string function symbolsToASCII(required string inputString){
var TempContent = javacast(string, arguments.inputString);
TempContent = TempContent.replaceAll([u00B4u02B9u02BCu02C8u0301u2018u2019u201Bu2032u2034u2037], chr(39)); /* apostrophe (‘) */
TempContent = TempContent.replaceAll([u00ABu00BBu02BAu030Bu030Eu201Cu201Du201Eu201Fu2033u2036u3003u301Du301E], chr(34)); /* quotation mark (“) */
TempContent = TempContent.replaceAll([u00ADu2010u2011u2012u2013u2014u2212u2015], chr(45)); /* hyphen (-) */
TempContent = TempContent.replaceAll([u01C3u2762], chr(33)); /* exclamation mark (!) */
TempContent = TempContent.replaceAll([u266F], chr(35)); /* music sharp sign (#) */
TempContent = TempContent.replaceAll([u066Au2052], chr(37)); /* percent sign (%) */
TempContent = TempContent.replaceAll([u066Du204Eu2217u2731u00D7], chr(42)); /* asterisk (*) */
TempContent = TempContent.replaceAll([u201AuFE51uFF64u3001], chr(44)); /* comma (,) */
TempContent = TempContent.replaceAll([u00F7u0338u2044u2215], chr(47)); /* slash (/) */
TempContent = TempContent.replaceAll([u0589u05C3u2236], chr(58)); /* colon (:) */
TempContent = TempContent.replaceAll([u203D], chr(63)); /* question mark (?) */
TempContent = TempContent.replaceAll([u27E6], chr(91)); /* opening square bracket ([) */
TempContent = TempContent.replaceAll([u20E5u2216], chr(92)); /* backslash () */
TempContent = TempContent.replaceAll([u301B], chr(93)); /* closing square bracket ([) */
TempContent = TempContent.replaceAll([u02C4u02C6u0302u2038u2303], chr(94)); /* caret (^) */
TempContent = TempContent.replaceAll([u02CDu0331u0332u2017], chr(95)); /* underscore (_) */
TempContent = TempContent.replaceAll([u02CBu0300u2035], chr(96)); /* grave accent (`) */
TempContent = TempContent.replaceAll([u2983], chr(123)); /* opening curly bracket ({) */
TempContent = TempContent.replaceAll([u01C0u05C0u2223u2758], chr(124)); /* vertical bar / pipe (|) */
TempContent = TempContent.replaceAll([u2016], #chr(124)##chr(124)#); /* double vertical bar / double pipe (||) */
TempContent = TempContent.replaceAll([u02DCu0303u2053u223Cu301C], chr(126)); /* tilde (~) */
TempContent = TempContent.replaceAll([u2039u2329u27E8u3008], chr(60)); /* less-than sign (<) */
TempContent = TempContent.replaceAll([u2264u2266], #chr(60)##chr(61)#); /* less-than equal-to sign (<=) */
TempContent = TempContent.replaceAll([u203Au232Au27E9u3009], chr(62)); /* greater-than sign (>) */
TempContent = TempContent.replaceAll([u2265u2267], #chr(62)##chr(61)#); /* greater-than equal-to sign (>=) */
TempContent = TempContent.replaceAll([u200Bu2060uFEFF], chr(32)); /* space ( ) */
TempContent = TempContent.replaceAll(u2153, 1/3);
TempContent = TempContent.replaceAll(u2154, 2/3);
TempContent = TempContent.replaceAll(u2155, 1/5);
TempContent = TempContent.replaceAll(u2156, 2/5);
TempContent = TempContent.replaceAll(u2157, 3/5);
TempContent = TempContent.replaceAll(u2158, 4/5);
TempContent

Leave a Reply

Your email address will not be published.