Thursday, March 26, 2015

Markov String Generator

Update 2015 - 03 - 31:

Ok, I optimized to code and removed those cumbersome extra space. Now my code reads the strings from the file and generate the table directly.


public class StringGenerator2 {
 private class Frequency{
  String s;//hashcode of the string
  int count;//occurrence
  public Frequency(String s, int c){
   this.s = s;
   this.count = c;
  }
 }
 Map> table;
 Random r;
 public StringGenerator2(String fileName) throws FileNotFoundException, IOException{
  table = new HashMap> ();
  r = new Random();
  generateTable(fileName);
  assert(isUniqueList());
 }
 public void generateTable(String fileName) throws FileNotFoundException, IOException{
  BufferedReader reader = new BufferedReader(new FileReader(fileName));
  String line;
  String last = null;
  while((line = reader.readLine()) != null){
   String[] row = line.split(" ");
   for(String s : row){
    String puntua = null;
    if(!Character.isLetter(s.charAt(s.length() - 1))){
     puntua = s.substring(s.length() - 1);
     s = s.substring(0, s.length() - 1);
    }
    if(!table.containsKey(s))
     table.put(s, new ArrayList ());
    if(last != null)
     add(table.get(last), s);
    if(puntua != null)
     add(table.get(s), puntua);
    last = s;
   }
  }
  reader.close();
  mapFrequency();
 }
 private void add(List next, String s){
  int index = -1;
  for(int i = 0; i < next.size(); i++){
   if(next.get(i).s.equals(s)){
    index = i;
    break;
   }
  }
  if(index != -1)
   next.get(index).count++;
  else
   next.add(new Frequency(s, 1));
 }
 private void mapFrequency(){
  for(Entry> e : table.entrySet()){
   List next = e.getValue();
   for(int i = 1; i < next.size(); i++){
    next.get(i).count += next.get(i - 1).count;
   }
  }
 }
 private boolean isUniqueList(){
  Set words = new HashSet ();
  for(List next : table.values()){
   words.clear();
   for(Frequency f : next){
    if(!words.add(f.s))
     return false;
   }
  }
  return true;
 }
 public void generator(String outputName, int length) throws IOException{
  if(table.size() == 0){
   System.out.println("No training set found!");
   return;
  }
  FileWriter writer = new FileWriter(outputName);
  int index = 0;
  String last = null;
  int countWord = 0;//number of words in one line
  while(index < length){
   String s = null;
   if(last == null || !table.containsKey(last)){
    //generate a random string from the key set
    Object[] keys = table.keySet().toArray();
    s = (String) keys[r.nextInt(keys.length)];
   } else
    s = getNext(table.get(last));
   writer.append(s).append(" ");
   countWord++;
   if(countWord == 15){
    writer.append("\n");
    countWord = 0;
   }
   last = s;
   index++;
  }
  writer.append(".\n");
  writer.flush();
  writer.close();
 }
 private String getNext(List next){
  int size = next.size();
  int nextIndex = r.nextInt(next.get(size - 1).count) + 1;
  for(Frequency f : next){
   if(nextIndex <= f.count)
    return f.s;
  }
  return next.get(r.nextInt(size)).s;
 }
  
}

Too much fun for this morning. I saw this problem posted on Career Cup yesterday. I couldn't understand what the problem really meant (see here), especially after I took a look at Wikipedia's couldn't-be-more-confusing explanation on Markov chain. But Google never let me down. I found this blog, which explained it in a clearer way. In short, using Markov chain to generate a new String is like using Bayesian  to predict when the first snow in next year will happen: based on the probability of when the first snow in last couple years happened.

In short, given a training set of strings, we create a table of the probability of all the next strings after a given string. Then when we need to generate a new string, we randomly select the first string from our training set, then pick the next string based on the probability of each "next" string in the table given the last generated string.

To get the next string based on the pre-calculated probability, I used this method. Basically when you get the probability distribution. You generate a random number, and based on the corresponding range in the PDF, you select the string.

I have to say that my implementation is not a good one. First, I used a list to store all strings into the memory, it definitely will cause a problem if there are too many strings. However, I don't know how to check if the current string is already in the table if I don't put everything into the memory.

Next, I calculate the frequency, to do that, I used another map, which is used to count the occurrence of each "next" string, then use two arrays to sum up, and finally use a list of structure (string, accumulated frequency) to store all "next" strings, given one string key in the table. See how much extra space I have used, this is definitely not good.

Moreover, when we hit a punctuation, I added the punctuation into the "next" list but didn't include it as a key.

To actually generate the string is easy, as I mentioned before, just keep randomly generate the next string in the "next" string list based on the frequency.

There are lots of ways to optimize this implementation. But I think this is enough for interview/learning purpose.


package markovChain;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.*;
import java.util.Map.Entry;
public class StringGenerator {
 List trainingSet;
 String file;
 public StringGenerator(String file) throws FileNotFoundException, IOException{
  this.file = file;
  trainingSet = new ArrayList ();
  read();
 }
 /**
  * read training set and buffered it into a list
  * @throws FileNotFoundException
  * @throws IOException
  */
 public void read() throws FileNotFoundException, IOException{
  BufferedReader reader = new BufferedReader(new FileReader(file));
  String line;
  while((line = reader.readLine()) != null){
   String[] row = line.split(" ");
   for(String s : row)
    trainingSet.add(s);
  }
  reader.close();  
 }
 /**
  * Generate frequency table
  * @param output
  * @param length
  * @throws IOException
  */
 private Map> generateTable(){
  Map> frequencyTable = new HashMap> ();
  frequencyTable.put(trainingSet.get(0), new ArrayList ());
  for(int i = 1; i < trainingSet.size(); i++){
   String s = trainingSet.get(i);
   String p = null;
   if(!Character.isLetter(s.charAt(s.length() - 1))){
    p = s.substring(s.length() - 1, s.length());
    s = s.substring(0, s.length() - 1);
   }
   if(!frequencyTable.containsKey(s)){
    frequencyTable.put(s, new ArrayList ());
    if(p != null)
     frequencyTable.get(s).add(p);
   }
   String last = trainingSet.get(i - 1);
   if(!Character.isLetter(last.charAt(last.length() - 1)))
    last = last.substring(0, last.length() - 1);
   frequencyTable.get(last).add(s); 
  }
  Map> nextFrequency = getFrequency(frequencyTable);
  return nextFrequency;
 }
 private Map> getFrequency(Map> frequencyTable){
  Map> countFre = new HashMap>();
  for(Entry> e : frequencyTable.entrySet()){
   String key = e.getKey();
   List next = e.getValue();
   Map f = new HashMap();
   for(String s : next){
    if(!f.containsKey(s))
     f.put(s, 1);
    else
     f.put(s, f.get(s) + 1);
   }
   List nextFrequency = mapFrequency(f);
   countFre.put(key, nextFrequency); 
  }
  return countFre;
 }
 private List mapFrequency(Map f){
  String[] array = new String[f.size()];
  int[] c = new int[f.size()];
  int index = 0;
  for(Entry ec : f.entrySet()){
   array[index] = ec.getKey();
   c[index] = ec.getValue();
   index++;
  }
  for(int i = 1; i < c.length; i++){
   c[i] += c[i - 1];
  }
  List rst = new ArrayList ();
  for(int i = 0; i < c.length; i++)
   rst.add(new Frequency(array[i], c[i]));
  return rst;
 }
 
 
 /**
  * generate the string
  * @param output
  * @param length
  * @throws IOException
  */
 public void Generator(String output, int length) throws IOException{
  if(trainingSet.size() == 0){
   System.out.println("No training set found!");
   return;
  }
  Map> nextFrequency = generateTable();
  Random r = new Random();
  FileWriter writer = new FileWriter(output);
  int i = 0;
  String last = null;
  int countWord = 0;
  while(i < length){
   String s = null;
   if(last == null || !nextFrequency.containsKey(last))
    s = trainingSet.get(r.nextInt(trainingSet.size()));
   else
    s = getNext(nextFrequency.get(last));
   writer.append(s).append(" ");
   countWord++;
   if(countWord == 15){
    writer.append("\n");
    countWord = 0;
   }
   last = s;
   i++;
  }
  writer.append(".\n");
  writer.flush();
  writer.close();
 }
 private String getNext(List nextFre){
  int size = nextFre.size();
  Random r = new Random();
  int next = r.nextInt(nextFre.get(size - 1).count) + 1;
  for(Frequency f : nextFre){
   if(next <= f.count)
    return f.s;
  }
  return nextFre.get(r.nextInt(size)).s;
 }
 /**
  * the structure that contains the string and its count
  * @author shirleyyoung
  *
  */
 private class Frequency{
  String s;
  int count;
  public Frequency(String s, int c){
   this.s = s;
   count = c;
  }
 } 
}


Test
In order to show my averseness to the research that I am working on. I decide to use some paragraphs from the papers I am reading as the test case.

"The dynamics of flexible polymers in shear is of great practical interest because this type of flow occurs whenever a fluid flows past a surface. 
Macroscopic, non-Newtonian rheological properties of polymer solutions, such
as flow-dependent viscosities and normal
stresses, result from microscopic stresses that
arise when polymeric molecules are stretched
and affect the solvent motion. Thus, much
effort has been directed at predicting the molecular
dynamics of polymers in shear flows. However, it has been difficult to rigorously
test these predictions because the dynamics
of a polymer molecule in shear have
not been observed directly. Experimental efforts
have mainly focused on measuring bulk
rheological properties or on measuring the
scattering of light or neutrons by polymer
solutions. Here we describe how single-molecule imaging techniques can be
used to study the configurations of polymers
in shear flow so that the detailed molecular predictions of theories can be tested.
In short, Shirley doesn't care about how polymer tumbles in shear flow!
Polymer dynamics are of central importance in materials science,
mechanical engineering, biology and medicine. The dynamics of
macromolecular solutions and melts in shear flow are typically
studied using bulk experimental methods such as light and
neutron scattering and birefringence. But the effect of shear
on the conformation and dynamics of individual polymers is
still not well understood. Here we describe observations of the
real-time dynamics of individual, flexible polymers fluorescently
labelled DNA molecules under a shear flow. The sheared
polymers exhibit many types of extended conformation with an
overall orientation ranging from parallel to perpendicular with
respect to the flow direction. For shear rates much smaller than
the inverse of the relaxation time of the molecule, the relative
populations of these two main types of conformation are
controlled by the rate of the shear flow. These results question
the adequacy of assumptions made in standard models of polymer
dynamics."

Here is the 300 strings generated from the input.

"mainly focused on measuring bulk rheological properties or on measuring the rate of extended conformation 
and dynamics of individual flexible polymers in shear is still not been difficult to rigorously 
test these predictions of shear on measuring bulk rheological properties or on measuring bulk rheological 
properties of assumptions made in shear is still not been difficult to rigorously test these 
two main types of polymer solutions Here we describe how polymer solutions , the detailed 
molecular dynamics of the molecular predictions of light and normal stresses result from microscopic stresses 
that the shear on measuring the molecular predictions of polymers exhibit many types of individual 
flexible polymers fluorescently labelled DNA molecules are controlled by the shear is of extended conformation 
with an overall orientation ranging from microscopic stresses result from microscopic stresses that arise when 
polymeric molecules are of polymers in shear flow Polymer dynamics of flow occurs whenever a 
shear flows However , by the molecule in shear rates much effort has been observed 
directly Experimental efforts have mainly focused on the effect of conformation with an overall orientation 
ranging from microscopic stresses that arise when polymeric molecules under a surface . flexible polymers 
exhibit many types of light and normal stresses , rheological properties of the adequacy of 
a shear flow so that the detailed molecular dynamics of conformation are of polymers in 
shear flow occurs whenever a surface . under a shear flow Polymer dynamics of assumptions 
made in shear flow are controlled by the dynamics of extended conformation and neutron scattering 
and affect the molecule the effect of great practical interest because the scattering and affect 
the scattering of polymers is still not been difficult to study the solvent motion Thus 
, much smaller than the inverse of shear flow Polymer dynamics of conformation and medicine."

Well, not that bad. 

No comments:

Post a Comment