I see words such as súbito, autónomo. Why aren't they proper. I had a problem while entering all Russian characters via JDBC into the MySQL database. The problem there was that the Russian characters were appearing as ???? instead of the words. That got fixed when I changed the JDBC URL to have UTF-8 encoding
jdbc:mysql://localhost/metaphor_repository?characterEncoding=utf8"
Doing the same does not fix the problem here.
public void readPatterns() throws FileNotFoundException, IOException, InstantiationException, ClassNotFoundException, IllegalAccessException, SQLException {
//Code to initialize database and stuff
PreparedStatement preparedStatement = null;
String key1 = null;
String databaseURL = "jdbc:mysql://localhost/metaphor_repository?characterEncoding=utf8";
String databaseUser = "root";
String databasePassword = "D0samrD9";
String dbName = "metaphor_repository";
Connection conn = null;
Class.forName("com.mysql.jdbc.Driver").newInstance();
conn = DriverManager.getConnection(databaseURL, databaseUser, databasePassword);
System.out.println("CONNECTED");
String insertTableSQL = "INSERT INTO source_domain_spanish_oy2_jul2014_2(filename, seed, words, frequency, type, after_before) VALUES(?,?,?,?,?,?);";
String foldername = "/Desktop/Espana/AdjectiveBefore/";
File Folder = new File(foldername);
File[] ListOfFiles = Folder.listFiles();
for (int x = 0; x < ListOfFiles.length; x++) {
File file = new File(ListOfFiles[x].getAbsolutePath());
InputStream in = new FileInputStream(file);
InputStreamReader reader1 = new InputStreamReader(in);
BufferedReader br = new BufferedReader(reader1);
String fileData = new String();
String filename = ListOfFiles[x].getName().toUpperCase();
int total;
BufferedWriter out;
FileWriter fstream;
BufferedWriter outLog;
String fileName = new String("/Desktop/Espana/AdjectiveBeforeResult/" + ListOfFiles[x].getName());
fstream = new FileWriter(fileName);
out = new BufferedWriter(fstream);
while ((fileData = br.readLine()) != null) {
Map<String, Integer> sortedMapDesc = searchDatabase(fileData);;
//Code Written By Aniruth to extract some info: seed, before_after
String seed = fileData;
String before_after = seed.split("\\[")[0];
seed = seed.replaceAll("\\(v.\\)", "");
seed = seed.replaceAll("\\(n.\\)", "");
seed = seed.substring(seed.indexOf("]") + 1, seed.indexOf("."));
seed = seed.substring(seed.indexOf("[") + 1, seed.indexOf("]"));
seed = seed.replaceAll("'", "");
seed = seed.trim();
seed = seed.toUpperCase();
Set<String> keySet = sortedMapDesc.keySet();
total = 0;
Iterator<String> keyItr = keySet.iterator();
out.write("++++++++++++++++++++++++++++++++++++++++++\n");
if (sortedMapDesc.isEmpty()) {
out.write(fileData + "\n");
out.write(fileData + "returned zero results \n");
out.flush();
} else {
out.write(fileData + "\n");
int i = 1;
String spaceString = " ";
while (keyItr.hasNext()) {
key1 = keyItr.next();
for (int k = 0; k < 40 - key1.length(); k++) {
spaceString = spaceString + " ";
}
total = total + sortedMapDesc.get(key1);
out.write(i + ":" + "'" + filename + "'" + ":" + "'" + seed + "'" + ":" + "'" + key1.replaceAll("'", "") + "'" + ":" + sortedMapDesc.get(key1) + ":" + "'" + "ADJ" + "'" + ":" + "'" + before_after + "'" + "\n");
//Code to add to the databases
preparedStatement = conn.prepareStatement(insertTableSQL);
preparedStatement.setString(1, filename);
preparedStatement.setString(2, seed);
preparedStatement.setString(3, key1);
if (sortedMapDesc.get(key1) != null) {
preparedStatement.setInt(4, sortedMapDesc.get(key1));
} else {
preparedStatement.setInt(4, 0);
}
preparedStatement.setString(5, "ADJ");
preparedStatement.setString(6, before_after);
System.out.println("Checking Prepared Statement:" + preparedStatement);
preparedStatement.executeUpdate();
System.out.println("Record Inserted :| ");
preparedStatement.close();
//System.out.println(out.toString());
i++;
spaceString = " ";
}
out.flush();
}
}
}
conn.close();
}
Well this is probably the first problem:
InputStreamReader reader1 = new InputStreamReader(in);
That's loading the file using the platform default encoding, which may or may not be appropriate for the file in question.
Likewise later:
fstream = new FileWriter(fileName);
Again, that will use the platform default encoding.
Always be explicit about your encoding - UTF-8 is usually a good choice, if you're in a position to choose.
Next, work out where issues are actually coming up. Log the exact UTF-16 code units in your strings, as integers, and try to spot when they go from "good" to "bad" (if they're ever good in the first place). See my blog post on diagnosing this sort of issue for more details. Something like this is useful:
public static void dumpString(String text) {
for (int i = 0; i < text.length(); i++) {
int codeUnit = text.charAt(i);
System.out.printf("%d: %c %04x%n", i, (char) codeUnit, codeUnit);
}
}
(Adjust to your logging infrastructure etc, of course.)
See more on this question at Stackoverflow