Message: Invalid byte 1 of 1 byte UTF 8 sequence in hadoop

I'm parsing XML using Hadoop, and I got the code from here.

But I'm getting the following error:

FINISH_TIME="1385387129970" HOSTNAME="DEV140" ERROR="java.io.IOException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[18,3] Message: Invalid byte 1 of 1-byte UTF-8 sequence.

But my XML is encoded with UTF-8 only . So how can I handle it?

Jon Skeet
people
quotationmark

I suspect this is the problem - it's at least a problem:

XMLStreamReader reader =
    XMLInputFactory.newInstance().createXMLStreamReader(new
        ByteArrayInputStream(document.getBytes()));

That call to getBytes will use the platform default encoding, rather than UTF-8.

You could specify "utf-8" as the encoding name - but it would be simpler to create a StringReader:

XMLStreamReader reader = XMLInputFactory.newInstance()
    .createXMLStreamReader(new StringReader(document));

Of course that may not be the only error, but it's at least something to look at.

people

See more on this question at Stackoverflow