The Fake Geek's blog: Markov String Generator

Update 2015 - 03 - 31:

Ok, I optimized to code and removed those cumbersome extra space. Now my code reads the strings from the file and generate the table directly.

public class StringGenerator2 {
 private class Frequency{
  String s;//hashcode of the string
  int count;//occurrence
  public Frequency(String s, int c){
   this.s = s;
   this.count = c;
  }
 }
 Map> table;
 Random r;
 public StringGenerator2(String fileName) throws FileNotFoundException, IOException{
  table = new HashMap> ();
  r = new Random();
  generateTable(fileName);
  assert(isUniqueList());
 }
 public void generateTable(String fileName) throws FileNotFoundException, IOException{
  BufferedReader reader = new BufferedReader(new FileReader(fileName));
  String line;
  String last = null;
  while((line = reader.readLine()) != null){
   String[] row = line.split(" ");
   for(String s : row){
    String puntua = null;
    if(!Character.isLetter(s.charAt(s.length() - 1))){
     puntua = s.substring(s.length() - 1);
     s = s.substring(0, s.length() - 1);
    }
    if(!table.containsKey(s))
     table.put(s, new ArrayList ());
    if(last != null)
     add(table.get(last), s);
    if(puntua != null)
     add(table.get(s), puntua);
    last = s;
   }
  }
  reader.close();
  mapFrequency();
 }
 private void add(List next, String s){
  int index = -1;
  for(int i = 0; i < next.size(); i++){
   if(next.get(i).s.equals(s)){
    index = i;
    break;
   }
  }
  if(index != -1)
   next.get(index).count++;
  else
   next.add(new Frequency(s, 1));
 }
 private void mapFrequency(){
  for(Entry> e : table.entrySet()){
   List next = e.getValue();
   for(int i = 1; i < next.size(); i++){
    next.get(i).count += next.get(i - 1).count;
   }
  }
 }
 private boolean isUniqueList(){
  Set words = new HashSet ();
  for(List next : table.values()){
   words.clear();
   for(Frequency f : next){
    if(!words.add(f.s))
     return false;
   }
  }
  return true;
 }
 public void generator(String outputName, int length) throws IOException{
  if(table.size() == 0){
   System.out.println("No training set found!");
   return;
  }
  FileWriter writer = new FileWriter(outputName);
  int index = 0;
  String last = null;
  int countWord = 0;//number of words in one line
  while(index < length){
   String s = null;
   if(last == null || !table.containsKey(last)){
    //generate a random string from the key set
    Object[] keys = table.keySet().toArray();
    s = (String) keys[r.nextInt(keys.length)];
   } else
    s = getNext(table.get(last));
   writer.append(s).append(" ");
   countWord++;
   if(countWord == 15){
    writer.append("\n");
    countWord = 0;
   }
   last = s;
   index++;
  }
  writer.append(".\n");
  writer.flush();
  writer.close();
 }
 private String getNext(List next){
  int size = next.size();
  int nextIndex = r.nextInt(next.get(size - 1).count) + 1;
  for(Frequency f : next){
   if(nextIndex <= f.count)
    return f.s;
  }
  return next.get(r.nextInt(size)).s;
 }
  
}

Too much fun for this morning. I saw this problem posted on Career Cup yesterday. I couldn't understand what the problem really meant (see here), especially after I took a look at Wikipedia's couldn't-be-more-confusing explanation on Markov chain. But Google never let me down. I found this blog, which explained it in a clearer way. In short, using Markov chain to generate a new String is like using Bayesian to predict when the first snow in next year will happen: based on the probability of when the first snow in last couple years happened.

In short, given a training set of strings, we create a table of the probability of all the next strings after a given string. Then when we need to generate a new string, we randomly select the first string from our training set, then pick the next string based on the probability of each "next" string in the table given the last generated string.

To get the next string based on the pre-calculated probability, I used this method. Basically when you get the probability distribution. You generate a random number, and based on the corresponding range in the PDF, you select the string.

I have to say that my implementation is not a good one. First, I used a list to store all strings into the memory, it definitely will cause a problem if there are too many strings. However, I don't know how to check if the current string is already in the table if I don't put everything into the memory.

Next, I calculate the frequency, to do that, I used another map, which is used to count the occurrence of each "next" string, then use two arrays to sum up, and finally use a list of structure (string, accumulated frequency) to store all "next" strings, given one string key in the table. See how much extra space I have used, this is definitely not good.

Moreover, when we hit a punctuation, I added the punctuation into the "next" list but didn't include it as a key.

To actually generate the string is easy, as I mentioned before, just keep randomly generate the next string in the "next" string list based on the frequency.

There are lots of ways to optimize this implementation. But I think this is enough for interview/learning purpose.

package markovChain;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.*;
import java.util.Map.Entry;
public class StringGenerator {
 List trainingSet;
 String file;
 public StringGenerator(String file) throws FileNotFoundException, IOException{
  this.file = file;
  trainingSet = new ArrayList ();
  read();
 }
 /**
  * read training set and buffered it into a list
  * @throws FileNotFoundException
  * @throws IOException
  */
 public void read() throws FileNotFoundException, IOException{
  BufferedReader reader = new BufferedReader(new FileReader(file));
  String line;
  while((line = reader.readLine()) != null){
   String[] row = line.split(" ");
   for(String s : row)
    trainingSet.add(s);
  }
  reader.close();  
 }
 /**
  * Generate frequency table
  * @param output
  * @param length
  * @throws IOException
  */
 private Map> generateTable(){
  Map> frequencyTable = new HashMap> ();
  frequencyTable.put(trainingSet.get(0), new ArrayList ());
  for(int i = 1; i < trainingSet.size(); i++){
   String s = trainingSet.get(i);
   String p = null;
   if(!Character.isLetter(s.charAt(s.length() - 1))){
    p = s.substring(s.length() - 1, s.length());
    s = s.substring(0, s.length() - 1);
   }
   if(!frequencyTable.containsKey(s)){
    frequencyTable.put(s, new ArrayList ());
    if(p != null)
     frequencyTable.get(s).add(p);
   }
   String last = trainingSet.get(i - 1);
   if(!Character.isLetter(last.charAt(last.length() - 1)))
    last = last.substring(0, last.length() - 1);
   frequencyTable.get(last).add(s); 
  }
  Map> nextFrequency = getFrequency(frequencyTable);
  return nextFrequency;
 }
 private Map> getFrequency(Map> frequencyTable){
  Map> countFre = new HashMap>();
  for(Entry> e : frequencyTable.entrySet()){
   String key = e.getKey();
   List next = e.getValue();
   Map f = new HashMap();
   for(String s : next){
    if(!f.containsKey(s))
     f.put(s, 1);
    else
     f.put(s, f.get(s) + 1);
   }
   List nextFrequency = mapFrequency(f);
   countFre.put(key, nextFrequency); 
  }
  return countFre;
 }
 private List mapFrequency(Map f){
  String[] array = new String[f.size()];
  int[] c = new int[f.size()];
  int index = 0;
  for(Entry ec : f.entrySet()){
   array[index] = ec.getKey();
   c[index] = ec.getValue();
   index++;
  }
  for(int i = 1; i < c.length; i++){
   c[i] += c[i - 1];
  }
  List rst = new ArrayList ();
  for(int i = 0; i < c.length; i++)
   rst.add(new Frequency(array[i], c[i]));
  return rst;
 }
 
 
 /**
  * generate the string
  * @param output
  * @param length
  * @throws IOException
  */
 public void Generator(String output, int length) throws IOException{
  if(trainingSet.size() == 0){
   System.out.println("No training set found!");
   return;
  }
  Map> nextFrequency = generateTable();
  Random r = new Random();
  FileWriter writer = new FileWriter(output);
  int i = 0;
  String last = null;
  int countWord = 0;
  while(i < length){
   String s = null;
   if(last == null || !nextFrequency.containsKey(last))
    s = trainingSet.get(r.nextInt(trainingSet.size()));
   else
    s = getNext(nextFrequency.get(last));
   writer.append(s).append(" ");
   countWord++;
   if(countWord == 15){
    writer.append("\n");
    countWord = 0;
   }
   last = s;
   i++;
  }
  writer.append(".\n");
  writer.flush();
  writer.close();
 }
 private String getNext(List nextFre){
  int size = nextFre.size();
  Random r = new Random();
  int next = r.nextInt(nextFre.get(size - 1).count) + 1;
  for(Frequency f : nextFre){
   if(next <= f.count)
    return f.s;
  }
  return nextFre.get(r.nextInt(size)).s;
 }
 /**
  * the structure that contains the string and its count
  * @author shirleyyoung
  *
  */
 private class Frequency{
  String s;
  int count;
  public Frequency(String s, int c){
   this.s = s;
   count = c;
  }
 } 
}

Test
In order to show my averseness to the research that I am working on. I decide to use some paragraphs from the papers I am reading as the test case.

"The dynamics of flexible polymers in shear is of great practical interest because this type of flow occurs whenever a fluid flows past a surface.

Macroscopic, non-Newtonian rheological properties of polymer solutions, such

as flow-dependent viscosities and normal

stresses, result from microscopic stresses that

arise when polymeric molecules are stretched

and affect the solvent motion. Thus, much

effort has been directed at predicting the molecular

dynamics of polymers in shear flows. However, it has been difficult to rigorously

test these predictions because the dynamics

of a polymer molecule in shear have

not been observed directly. Experimental efforts

have mainly focused on measuring bulk

rheological properties or on measuring the

scattering of light or neutrons by polymer

solutions. Here we describe how single-molecule imaging techniques can be

used to study the configurations of polymers

in shear flow so that the detailed molecular predictions of theories can be tested.

In short, Shirley doesn't care about how polymer tumbles in shear flow!

Polymer dynamics are of central importance in materials science,

mechanical engineering, biology and medicine. The dynamics of

macromolecular solutions and melts in shear flow are typically

studied using bulk experimental methods such as light and

neutron scattering and birefringence. But the effect of shear

on the conformation and dynamics of individual polymers is

still not well understood. Here we describe observations of the

real-time dynamics of individual, flexible polymers fluorescently

labelled DNA molecules under a shear flow. The sheared

polymers exhibit many types of extended conformation with an

overall orientation ranging from parallel to perpendicular with

respect to the flow direction. For shear rates much smaller than

the inverse of the relaxation time of the molecule, the relative

populations of these two main types of conformation are

controlled by the rate of the shear flow. These results question

the adequacy of assumptions made in standard models of polymer

dynamics."

Here is the 300 strings generated from the input.

"mainly focused on measuring bulk rheological properties or on measuring the rate of extended conformation

and dynamics of individual flexible polymers in shear is still not been difficult to rigorously

test these predictions of shear on measuring bulk rheological properties or on measuring bulk rheological

properties of assumptions made in shear is still not been difficult to rigorously test these

two main types of polymer solutions Here we describe how polymer solutions , the detailed

molecular dynamics of the molecular predictions of light and normal stresses result from microscopic stresses

that the shear on measuring the molecular predictions of polymers exhibit many types of individual

flexible polymers fluorescently labelled DNA molecules are controlled by the shear is of extended conformation

with an overall orientation ranging from microscopic stresses result from microscopic stresses that arise when

polymeric molecules are of polymers in shear flow Polymer dynamics of flow occurs whenever a

shear flows However , by the molecule in shear rates much effort has been observed

directly Experimental efforts have mainly focused on the effect of conformation with an overall orientation

ranging from microscopic stresses that arise when polymeric molecules under a surface . flexible polymers

exhibit many types of light and normal stresses , rheological properties of the adequacy of

a shear flow so that the detailed molecular dynamics of conformation are of polymers in

shear flow occurs whenever a surface . under a shear flow Polymer dynamics of assumptions

made in shear flow are controlled by the dynamics of extended conformation and neutron scattering

and affect the molecule the effect of great practical interest because the scattering and affect

the scattering of polymers is still not been difficult to study the solvent motion Thus

, much smaller than the inverse of shear flow Polymer dynamics of conformation and medicine."

Well, not that bad.

The Fake Geek's blog

AdSense

Thursday, March 26, 2015

Markov String Generator

No comments:

Post a Comment