Stillnet Studios

Cleaning up non ascii chars for XML or other uses

Recently I’ve had a couple occasions where I needed to clean up some data that had non-ascii or high-ascii characters in it. Usually this happens when the data originates from MS Word or Excel. The first time I was producing XML, and was getting errors when I tried to validate my feed. Thats when I noticed I wasn’t using XMLFormat(), which of couse I should be.

I added XMLFormat() around my data, but was still getting errors. Evidently XMLFormat() still leaves in a lot of characters that are just plain illegal in XML. Here is a function I wrote to give me clean data.

function MyXMLFormat(input) {
	input = XMLFormat(input);

	// then clean up the stuff XMLFormat doesn't fix.
	for (i=1;i LT Len(input);i=i+1) {
		code = Asc(Mid(input,i,1));
		// note: 9=tab, 10=line feed, 13=carriage return
		if ( (code LT 32 OR code GT 126) AND (code NEQ 9 AND code NEQ 10 AND code NEQ 13) ) {
			//writeOutput("Just took out ascii code #code# in string #input#");
			input = RemoveChars(input,i,1);
			input = Insert("&###code#;",input,i);
		}
	}

	return Trim(input);
}

The most common characters I encountered were:

ASCII code	Description
11	vertical tab
8220	left double quote
8221	right double quote
8216	left single quote
8217	right single quote
8211	en dash
8212	em quote
8226	bullet
8230	horizontal ellipsis
8482	trademark

This entry was posted on 14 September 2008 at 4:35 pm and is filed under ColdFusion, Web Development. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

3 Comments

Phill says:

I have also encountered this same issue and hope that Adobe pick this up and correct it soon.
15 September 2008, 3:33 am
Sami Hoda says:

Nice, thanks!
15 September 2008, 1:27 pm
chris says:

Thanks very much for this. Saved me a lot of trouble. However, I think the "i" in the Insert statement needs to be "i-1" as we've just removed the "i"th character on the previous line.
6 April 2009, 10:39 am

Cleaning up non ascii chars for XML or other uses

3 Comments

Phill says:

Sami Hoda says:

chris says:

Pages

Article Categories

Recent Posts

Blogroll

Donate