Recently I’ve had a couple occasions where I needed to clean up some data that had non-ascii or high-ascii characters in it. Usually this happens when the data originates from MS Word or Excel. The first time I was producing XML, and was getting errors when I tried to validate my feed. Thats when I noticed I wasn’t using XMLFormat(), which of couse I should be.

I added XMLFormat() around my data, but was still getting errors. Evidently XMLFormat() still leaves in a lot of characters that are just plain illegal in XML. Here is a function I wrote to give me clean data.

function MyXMLFormat(input) {
	input = XMLFormat(input);

	// then clean up the stuff XMLFormat doesn't fix.
	for (i=1;i LT Len(input);i=i+1) {
		code = Asc(Mid(input,i,1));
		// note: 9=tab, 10=line feed, 13=carriage return
		if ( (code LT 32 OR code GT 126) AND (code NEQ 9 AND code NEQ 10 AND code NEQ 13) ) {
			//writeOutput("Just took out ascii code #code# in string #input#");
			input = RemoveChars(input,i,1);
			input = Insert("&###code#;",input,i);
		}
	}

	return Trim(input);
}

The most common characters I encountered were:

ASCII code   Description
11 vertical tab
8220 left double quote
8221 right double quote
8216 left single quote
8217 right single quote
8211 en dash
8212 em quote
8226 bullet
8230 horizontal ellipsis
8482 trademark

3 Comments

  1. Phill says:

    I have also encountered this same issue and hope that Adobe pick this up and correct it soon.

  2. Sami Hoda says:

    Nice, thanks!

  3. chris says:

    Thanks very much for this. Saved me a lot of trouble. However, I think the "i" in the Insert statement needs to be "i-1" as we've just removed the "i"th character on the previous line.