My application use CRC32 to check two contents or two files are same or not. But when I try it use to generate unique id, I see the problem, with the two different string, the CRC32 can be same. Here is my Java code. Thanks in advance.
public static String getCRC32(String content) {
byte[] bytes = content.getBytes();
Checksum checksum = new CRC32();
checksum.update(bytes, 0, bytes.length);
return String.valueOf(checksum.getValue());
}
public static void main(String[] args){
System.out.println(getCRC32("b5a7b602ab754d7ab30fb42c4fb28d82"));
System.out.println(getCRC32("d19f2e9e82d14b96be4fa12b8a27ee9f"));
}
Yes, that's what CRCs are like. They're not unique IDs. They're likely to be different for different inputs, but they don't have to be. After all, you're providing more than 32 bits of input, so you can't expect to have more than 232 different inputs to all produce different CRCs.
A longer cryptographic hash (e.g. SHA-256) is far more likely to give different outputs for different inputs, but it's still not impossible (and can't be, due to the amount of input data vs output data). The big difference between a CRC and a cryptographic hash is that a CRC is relatively easy to "steer" if you want to - it's not terribly hard to find collisions, and it's used to protect against accidental data corruption. Cryptographic hashes are designed to protect against deliberate data corruption by some attacker - so it's hard to deliberately create a value targeting a specific hash.
As an aside, your use of String.getBytes()
without specifying a charset is problematic - it uses the platform-default encoding, so if you run the same code on two machines with the same input, you can get different results. I would strongly encourage you to use a fixed encoding (e.g. UTF-8).
See more on this question at Stackoverflow