Recently I’ve had a couple occasions where I needed to clean up some data that had non-ascii or high-ascii characters in it. Usually this happens when the data originates from MS Word or Excel. The first time I was producing XML, and was getting errors when I tried to validate my feed. Thats when I noticed I wasn’t using XMLFormat(), which of couse I should be.
I added XMLFormat() around my data, but was still getting errors. Evidently XMLFormat() still leaves in a lot of characters that are just plain illegal in XML. Here is a function I wrote to give me clean data.
function MyXMLFormat(input) {
input = XMLFormat(input);
// then clean up the stuff XMLFormat doesn't fix.
for (i=1;i LT Len(input);i=i+1) {
code = Asc(Mid(input,i,1));
// note: 9=tab, 10=line feed, 13=carriage return
if ( (code LT 32 OR code GT 126) AND (code NEQ 9 AND code NEQ 10 AND code NEQ 13) ) {
//writeOutput("Just took out ascii code #code# in string #input#");
input = RemoveChars(input,i,1);
input = Insert("&###code#;",input,i);
}
}
return Trim(input);
}
The most common characters I encountered were:
ASCII code | Description |
11 | vertical tab |
8220 | left double quote |
8221 | right double quote |
8216 | left single quote |
8217 | right single quote |
8211 | en dash |
8212 | em quote |
8226 | bullet |
8230 | horizontal ellipsis |
8482 | trademark |
Phill says:
I have also encountered this same issue and hope that Adobe pick this up and correct it soon.
15 September 2008, 3:33 amSami Hoda says:
Nice, thanks!
15 September 2008, 1:27 pmchris says:
Thanks very much for this. Saved me a lot of trouble. However, I think the "i" in the Insert statement needs to be "i-1" as we've just removed the "i"th character on the previous line.
6 April 2009, 10:39 am