Ok, I optimized to code and removed those cumbersome extra space. Now my code reads the strings from the file and generate the table directly.
public class StringGenerator2 { private class Frequency{ String s;//hashcode of the string int count;//occurrence public Frequency(String s, int c){ this.s = s; this.count = c; } } Map> table; Random r; public StringGenerator2(String fileName) throws FileNotFoundException, IOException{ table = new HashMap > (); r = new Random(); generateTable(fileName); assert(isUniqueList()); } public void generateTable(String fileName) throws FileNotFoundException, IOException{ BufferedReader reader = new BufferedReader(new FileReader(fileName)); String line; String last = null; while((line = reader.readLine()) != null){ String[] row = line.split(" "); for(String s : row){ String puntua = null; if(!Character.isLetter(s.charAt(s.length() - 1))){ puntua = s.substring(s.length() - 1); s = s.substring(0, s.length() - 1); } if(!table.containsKey(s)) table.put(s, new ArrayList ()); if(last != null) add(table.get(last), s); if(puntua != null) add(table.get(s), puntua); last = s; } } reader.close(); mapFrequency(); } private void add(List next, String s){ int index = -1; for(int i = 0; i < next.size(); i++){ if(next.get(i).s.equals(s)){ index = i; break; } } if(index != -1) next.get(index).count++; else next.add(new Frequency(s, 1)); } private void mapFrequency(){ for(Entry > e : table.entrySet()){ List next = e.getValue(); for(int i = 1; i < next.size(); i++){ next.get(i).count += next.get(i - 1).count; } } } private boolean isUniqueList(){ Set words = new HashSet (); for(List next : table.values()){ words.clear(); for(Frequency f : next){ if(!words.add(f.s)) return false; } } return true; } public void generator(String outputName, int length) throws IOException{ if(table.size() == 0){ System.out.println("No training set found!"); return; } FileWriter writer = new FileWriter(outputName); int index = 0; String last = null; int countWord = 0;//number of words in one line while(index < length){ String s = null; if(last == null || !table.containsKey(last)){ //generate a random string from the key set Object[] keys = table.keySet().toArray(); s = (String) keys[r.nextInt(keys.length)]; } else s = getNext(table.get(last)); writer.append(s).append(" "); countWord++; if(countWord == 15){ writer.append("\n"); countWord = 0; } last = s; index++; } writer.append(".\n"); writer.flush(); writer.close(); } private String getNext(List next){ int size = next.size(); int nextIndex = r.nextInt(next.get(size - 1).count) + 1; for(Frequency f : next){ if(nextIndex <= f.count) return f.s; } return next.get(r.nextInt(size)).s; } }
Too much fun for this morning. I saw this problem posted on Career Cup yesterday. I couldn't understand what the problem really meant (see here), especially after I took a look at Wikipedia's couldn't-be-more-confusing explanation on Markov chain. But Google never let me down. I found this blog, which explained it in a clearer way. In short, using Markov chain to generate a new String is like using Bayesian to predict when the first snow in next year will happen: based on the probability of when the first snow in last couple years happened.
In short, given a training set of strings, we create a table of the probability of all the next strings after a given string. Then when we need to generate a new string, we randomly select the first string from our training set, then pick the next string based on the probability of each "next" string in the table given the last generated string.
To get the next string based on the pre-calculated probability, I used this method. Basically when you get the probability distribution. You generate a random number, and based on the corresponding range in the PDF, you select the string.
I have to say that my implementation is not a good one. First, I used a list to store all strings into the memory, it definitely will cause a problem if there are too many strings. However, I don't know how to check if the current string is already in the table if I don't put everything into the memory.
Next, I calculate the frequency, to do that, I used another map, which is used to count the occurrence of each "next" string, then use two arrays to sum up, and finally use a list of structure (string, accumulated frequency) to store all "next" strings, given one string key in the table. See how much extra space I have used, this is definitely not good.
Moreover, when we hit a punctuation, I added the punctuation into the "next" list but didn't include it as a key.
To actually generate the string is easy, as I mentioned before, just keep randomly generate the next string in the "next" string list based on the frequency.
There are lots of ways to optimize this implementation. But I think this is enough for interview/learning purpose.
package markovChain; import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import java.util.*; import java.util.Map.Entry; public class StringGenerator { ListtrainingSet; String file; public StringGenerator(String file) throws FileNotFoundException, IOException{ this.file = file; trainingSet = new ArrayList (); read(); } /** * read training set and buffered it into a list * @throws FileNotFoundException * @throws IOException */ public void read() throws FileNotFoundException, IOException{ BufferedReader reader = new BufferedReader(new FileReader(file)); String line; while((line = reader.readLine()) != null){ String[] row = line.split(" "); for(String s : row) trainingSet.add(s); } reader.close(); } /** * Generate frequency table * @param output * @param length * @throws IOException */ private Map > generateTable(){ Map > frequencyTable = new HashMap > (); frequencyTable.put(trainingSet.get(0), new ArrayList ()); for(int i = 1; i < trainingSet.size(); i++){ String s = trainingSet.get(i); String p = null; if(!Character.isLetter(s.charAt(s.length() - 1))){ p = s.substring(s.length() - 1, s.length()); s = s.substring(0, s.length() - 1); } if(!frequencyTable.containsKey(s)){ frequencyTable.put(s, new ArrayList ()); if(p != null) frequencyTable.get(s).add(p); } String last = trainingSet.get(i - 1); if(!Character.isLetter(last.charAt(last.length() - 1))) last = last.substring(0, last.length() - 1); frequencyTable.get(last).add(s); } Map > nextFrequency = getFrequency(frequencyTable); return nextFrequency; } private Map > getFrequency(Map > frequencyTable){ Map > countFre = new HashMap >(); for(Entry > e : frequencyTable.entrySet()){ String key = e.getKey(); List next = e.getValue(); Map f = new HashMap (); for(String s : next){ if(!f.containsKey(s)) f.put(s, 1); else f.put(s, f.get(s) + 1); } List nextFrequency = mapFrequency(f); countFre.put(key, nextFrequency); } return countFre; } private List mapFrequency(Map f){ String[] array = new String[f.size()]; int[] c = new int[f.size()]; int index = 0; for(Entry ec : f.entrySet()){ array[index] = ec.getKey(); c[index] = ec.getValue(); index++; } for(int i = 1; i < c.length; i++){ c[i] += c[i - 1]; } List rst = new ArrayList (); for(int i = 0; i < c.length; i++) rst.add(new Frequency(array[i], c[i])); return rst; } /** * generate the string * @param output * @param length * @throws IOException */ public void Generator(String output, int length) throws IOException{ if(trainingSet.size() == 0){ System.out.println("No training set found!"); return; } Map > nextFrequency = generateTable(); Random r = new Random(); FileWriter writer = new FileWriter(output); int i = 0; String last = null; int countWord = 0; while(i < length){ String s = null; if(last == null || !nextFrequency.containsKey(last)) s = trainingSet.get(r.nextInt(trainingSet.size())); else s = getNext(nextFrequency.get(last)); writer.append(s).append(" "); countWord++; if(countWord == 15){ writer.append("\n"); countWord = 0; } last = s; i++; } writer.append(".\n"); writer.flush(); writer.close(); } private String getNext(List nextFre){ int size = nextFre.size(); Random r = new Random(); int next = r.nextInt(nextFre.get(size - 1).count) + 1; for(Frequency f : nextFre){ if(next <= f.count) return f.s; } return nextFre.get(r.nextInt(size)).s; } /** * the structure that contains the string and its count * @author shirleyyoung * */ private class Frequency{ String s; int count; public Frequency(String s, int c){ this.s = s; count = c; } } }
Test
In order to show my averseness to the research that I am working on. I decide to use some paragraphs from the papers I am reading as the test case.
"The dynamics of flexible polymers in shear is of great practical interest because this type of flow occurs whenever a fluid flows past a surface.Macroscopic, non-Newtonian rheological properties of polymer solutions, suchas flow-dependent viscosities and normalstresses, result from microscopic stresses thatarise when polymeric molecules are stretchedand affect the solvent motion. Thus, mucheffort has been directed at predicting the moleculardynamics of polymers in shear flows. However, it has been difficult to rigorouslytest these predictions because the dynamicsof a polymer molecule in shear havenot been observed directly. Experimental effortshave mainly focused on measuring bulkrheological properties or on measuring thescattering of light or neutrons by polymersolutions. Here we describe how single-molecule imaging techniques can beused to study the configurations of polymersin shear flow so that the detailed molecular predictions of theories can be tested.In short, Shirley doesn't care about how polymer tumbles in shear flow!Polymer dynamics are of central importance in materials science,mechanical engineering, biology and medicine. The dynamics ofmacromolecular solutions and melts in shear flow are typicallystudied using bulk experimental methods such as light andneutron scattering and birefringence. But the effect of shearon the conformation and dynamics of individual polymers isstill not well understood. Here we describe observations of thereal-time dynamics of individual, flexible polymers fluorescentlylabelled DNA molecules under a shear flow. The shearedpolymers exhibit many types of extended conformation with anoverall orientation ranging from parallel to perpendicular withrespect to the flow direction. For shear rates much smaller thanthe inverse of the relaxation time of the molecule, the relativepopulations of these two main types of conformation arecontrolled by the rate of the shear flow. These results questionthe adequacy of assumptions made in standard models of polymerdynamics."
Here is the 300 strings generated from the input.
"mainly focused on measuring bulk rheological properties or on measuring the rate of extended conformation
and dynamics of individual flexible polymers in shear is still not been difficult to rigorously
test these predictions of shear on measuring bulk rheological properties or on measuring bulk rheological
properties of assumptions made in shear is still not been difficult to rigorously test these
two main types of polymer solutions Here we describe how polymer solutions , the detailed
molecular dynamics of the molecular predictions of light and normal stresses result from microscopic stresses
that the shear on measuring the molecular predictions of polymers exhibit many types of individual
flexible polymers fluorescently labelled DNA molecules are controlled by the shear is of extended conformation
with an overall orientation ranging from microscopic stresses result from microscopic stresses that arise when
polymeric molecules are of polymers in shear flow Polymer dynamics of flow occurs whenever a
shear flows However , by the molecule in shear rates much effort has been observed
directly Experimental efforts have mainly focused on the effect of conformation with an overall orientation
ranging from microscopic stresses that arise when polymeric molecules under a surface . flexible polymers
exhibit many types of light and normal stresses , rheological properties of the adequacy of
a shear flow so that the detailed molecular dynamics of conformation are of polymers in
shear flow occurs whenever a surface . under a shear flow Polymer dynamics of assumptions
made in shear flow are controlled by the dynamics of extended conformation and neutron scattering
and affect the molecule the effect of great practical interest because the scattering and affect
the scattering of polymers is still not been difficult to study the solvent motion Thus
, much smaller than the inverse of shear flow Polymer dynamics of conformation and medicine."
Well, not that bad.
No comments:
Post a Comment