My name
is
Jon Skeet

How to find out if a stream complies with the charset encoding ISO 8859 1

I have a problem whereby I need to be able to detect whether a byte array contains characters which comply with ISO-8859-1 encoding.

I have found the following question useful Java : How to determine the correct charset encoding of a stream however none of the answers appear to fully answer my question.

I have attempted to use the TikaEncodingDetector as shown below

public static Charset guessCharset(final byte[] content) throws IOException {
    final InputStream isx = new ByteArrayInputStream(content);
    return Charset.forName(new TikaEncodingDetector().guessEncoding(isx));
}

Unfortunately this approach makes different predictions based about the content of the byte array. E.g. an array containing 'h','e','l','l','o' is determined to be ISO-8859-1. 'w','o','r','l','d' comes out as IBM500, 'a','b','c','d','e' results in UTF-8.

All I want to know is, does my byte array correctly validate to the ISO-8859-1 standard. I would be grateful for suggestions on the best way to carry out this task.

I have a problem whereby I need to be able to detect whether a byte array contains characters which comply with ISO-8859-1 encoding.

Well every stream of binary data can be viewed as "valid" in ISO-8859-1, as it's simply a single-byte-per-character scheme mapping bytes 0-255 to U+0000 to U+00FF in a trivial way. Compare that with UTF-8 or UTF-16, where certain byte sequences are simply invalid.

So a method to determine whether a stream contained valid ISO-8859-1 could just return true - but that doesn't mean that the original text was encoded in ISO-8859-1... it may be meaningless to a human when decoded with ISO-8859-1, but still valid.

If you know that the original plain text won't include certain characters (e.g. unprintable control characters) you could detect that quite simply just by checking whether any byte in the stream was blacklisted. More advanced detection might check for unexpected patterns - but it becomes very heuristic, and may be tightly coupled to what the original source text is expected to be like.

See more on this question at Stackoverflow