I have come across a situation where I am reading a some log file and then counting the number of lines I encountered via the following code snippet.
byte[] c = new byte[1024];
long count = 0;
int readChars = 0;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
My problem is that when I try to read a file (CSV, Syslog, or any other wild format), it runs just fine and gives me the right result. But when I try to run a file that was generated via a mac, it goes hay-wire and simply reports back that a single line was read.
Now my log file is large, I know that it has quite a few thousand lines of logs, but it just read a single line. I opened this file in Sublime and I could see all the separate lines, however when I viewed this file via VIM, It displayed only a single a file with a character '^M' at the end of each line ( My guess it that it is using this as the line terminator).
A sample of two lines is below. You can see that vim is displaying the ^M character where it should have been a new line
15122,25Dec2013,19:42:25,192.168.5.1,log,allow,,eth0,outbound,Application Control,,Network,Bob(+),Bob(+),,,,59857d77,,,,,,,,570033,,,,,,,,,,,,,192.168.5.7,176.32.96.190,tcp,80,56305,15606,554427,60461741,**,,,,,,,1,**,**,**,**,**,**,**,**,**,Other: Wget/1.13.4 (linux-gnu),Other: Server,192.168.5.7,60461741:1,,,,,,**,**,**,,,**,,,,^M359,23Dec2013,18:54:03,192.168.5.1,log,allow,,eth0,outbound,Application Control,,Network,Charlie(+),Charlie(+),,,,c0fa2dac,,,,,,,,1171362,,,,,,,,,,,,,192.168.5.6,205.251.242.54,tcp,80,45483,31395,1139967,60340847,**,,,,,,,2,**,**,**,**,**,**,**,**,**,Other: Wget/1.13.4 (linux-gnu),Other: Server,192.168.5.6,60340847:1,,,,,,,**,**,**,,,**,,,,^M
Any suggestion as to how to tackle this problem ?
The first problem even before you get to line breaks is that you're reading bytes and then treating those as characters. You're effectively assuming an encoding of ISO-8859-1 which may well not be correct. You should be using an InputStreamReader
instead.
Then there's the issue of operating systems having different line breaks... use BufferedReader.readLine()
to read a line in a way that handles line breaks of \n
, \r
or \r\n
.
So your code would become:
int count = 0;
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(is, charset))) {
while (reader.readLine() != null) {
count++;
}
}
See more on this question at Stackoverflow