The Fake Geek's blog: March 2015

Monday, March 30, 2015

Input & output in java

I don't write code that needs to read from files/streams very often, so I am not quite familiar with input/output methods in Java. Especially when I try to review the readN questions on LeetCode, I realize I need to brush up some of the knowledge in this area.

Scanner
The Scanner class can parse primitive types and strings using regular expressions. A Scanner breaks its input into tokens using a delimiter pattern, which by default matches white space. The resulting tokens may then be converted into values of different types using various next methods.

A Scanner can read text from any object which implements the Readable interface. If an invocation of the underlying readable's Readable.read() method throws an IOException then the Scanner assumes the end of the input has been reached. The most recent IOException thrown by the underlying readable can be retrieved via ioException() method.

ByteArrayInputStream
This class reads an input byte array (when initialized).
The read() method reads the next byte of data from this input stream.

Source: http://www.codatlas.com/github.com/openjdk-mirror/jdk7u-jdk/master/src/share/classes/java/io/ByteArrayInputStream.java?keyword=ByteArrayInputStream&line=143

read(byte b[], int off, int len) method, on the other hand, read byte array b to the stream.

Source: http://www.codatlas.com/github.com/openjdk-mirror/jdk7u-jdk/master/src/share/classes/java/io/ByteArrayInputStream.java?keyword=ByteArrayInputStream&line=176

FileReader & FileInputStream
FileReader: reading character files.
read() method uses StreamDecoder's read() method, which reads bytes from the file, and then convert them to chars.
FileInputStream: reading streams of raw bytes.
Both can be used to instantiate a Scanner class.

BufferedReader
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders.

BufferedReader in = new BufferedReader(new FileReader(fileName));

The above example will buffer the input from the specified file. Without buffering, each invocation of read() or readLine() could cause bytes to be read from file, converted into characters, and then returned, which can be very inefficient.

read() method will read a single character, as an integer in the unicode range. The private method fill() will fill the buff array from input stream if necessary.

Source: http://www.codatlas.com/github.com/openjdk-mirror/jdk7u-jdk/master/src/share/classes/java/io/BufferedReader.java?keyword=bufferedreader&line=170

Private method read1() is the actual implementation of read4() in the LeetCode problem. char[] cbuf is the destination buffer. It will fill the inner buffer array from the input stream, copy it to the destination array, and output the number of characters read.

Source: http://www.codatlas.com/github.com/openjdk-mirror/jdk7u-jdk/master/src/share/classes/java/io/BufferedReader.java?keyword=bufferedreader&line=195

Ok, here comes the actual read(char cbuf[], int off, int len) method which, if you throw away the int off (the offset of the starting point), is pretty much the readN() implementation.

This method iteratively calling the read1() method, until len character is read.

Source: http://www.codatlas.com/github.com/openjdk-mirror/jdk7u-jdk/master/src/share/classes/java/io/BufferedReader.java?keyword=bufferedreader&line=269

readLine(), which is the most common method used. Fill the buffer array (char[] cb), and return a String. Reading is terminated by a line feed('\n') or a carriage return ('\r').

Sunday, March 29, 2015

Valid words

You are given a string 's' and you are given a dictionary of english words. You goal is to write an algorithm that returns all words from the dictionary the can be formed by characters from that string 's'.
Example:
s = "ogeg"
following words can be formed from 's': go egg ego . . .
Further it is given that string 's' always consists of four lower case characters. Lets say that the dictionary is contained in a file and can be fitted in the memory. It is up to you which data structure you use to represent the dictionary.
How would you design an efficient algorithm? Follow up: What if the dictionary file can not be fitted in the memory?

This is a very interesting question. The problem statement is kind confusing.
Here is a question, why do we need to put the whole dictionary into the memory? The goal is to output all words that can be formed from s. From the example, we know that the order doesn't matter. So we only need to construct a hash map and store all four characters of s in it. And then read the dict from the stream and compare it with the characters in the hash map. I used another hash map. Since there are at most four characters for both strings, we don't need much extra space, do we?

If the question is, how to store the output words in an efficient way, I guess Trie would be a good one.

public static void validWords(String s, String dict) throws FileNotFoundException, IOException{
  BufferedReader br = new BufferedReader(new FileReader(dict));
  String currLine;
  Map sMap = new HashMap();
  for(int i = 0; i < s.length(); i++){
   char c = s.charAt(i);
   if(!sMap.containsKey(c))
    sMap.put(c, 1);
   else
    sMap.put(c, sMap.get(c) + 1);
  }
  Map word = new HashMap();
  while((currLine = br.readLine()) != null){
   if(currLine.length() > 4)
    continue;
   word.clear();
   boolean valid = true;
   for(int i = 0; i < currLine.length(); i++){
    char c = currLine.charAt(i);
    if(!sMap.containsKey(c)){
     valid = false;
     break;
    }
    if(!word.containsKey(c))
     word.put(c, 1);
    else
     word.put(c, word.get(c) + 1);
    if(word.get(c) > sMap.get(c)){
     valid = false;
     break;
    }
   }
   if(valid)
    System.out.println(currLine);
  }
  br.close();
 }

Generate Random ID

Write a function that receives a stream of pairs: id (some unique integer) + weight (positive real number).
The stream ends with the special pair {0,0}.
The function should return one the unique ids at random, with probability proportional to its weight (compared to the total weights sum when stream has ended).
Stream size and weights sum are unknown in advance, and you are allowed O(1) memory consumption.
Example: If this was the stream: {1,2},{2,2},{3,4},{4,10}, then id 2 would be returned with probability 1/9.

The hard part for this question is that no extra space is allowed, which means we cannot store the pairs. The method to generate random number based on weight can be found here. Basically based on each weight, we calculate the accumulated weight, then when we generate a random number from 1 to the accumulated weight, return the number at certain range. Using the above array as an example:

1, 2 -> 2
2, 2 -> 4
3, 4 -> 8
4, 10 -> 18

So if we generate a random number between 1 to 18, and if it is 3 or 4, then we should return 2. The probability is 2 / 18 = 1 / 9.

Now here comes the problem, we are not allowed to backtrack the already read array, what should we do?

One way is to keep generating random number and update the number we need to output until we reach the end of the stream. So if we are at {3, 4}, we generate a random number at 6, this number is greater than the previous accumulated weight (4), so we update the number we need to output to 3. If the number we generated is 3, then we don't update the number. The probability of getting each number is based on its weight (assuming the uniform probability of generating a random number from 1 to accumulated weight).

public int random(String fileName) throws FileNotFoundException{
  Scanner in = new Scanner(new FileReader(fileName));
  Random r = new Random();
  int cum = 0;
  int sample = 0;
  int last = 0;
  while(in.hasNext()){
   int val = in.nextInt();
   int weight = in.nextInt();
   if(val == 0 && weight == 0)
    break;
   cum += weight;
   int p = r.nextInt(cum) + 1;
   if(p >= last && p <= cum)
    sample = val;
   last = cum;
  }
  return sample;
 }

Minimum bits required to send a sequence of a deck of cards

Consider the 52 cards of a deck. You generated a random sequence for these cards and want to send that sequence to a receiver. You want to minimize the communication between you and the receiver, i.e., minimize the number of bits required to send the sequence.
What is the minimum number of bits required to send the sequence?
Hint: It is not 6 x 52

So first, how to come up with 6 * 52? Each card is in the range 1 - 52, so if we use 6 bits, there are 2^6 = 64 possibilities, which can cover all possible sequences.

Now we know that the bits required must include all possible sequences of that 52 cards, so how many possibilities? (52!). This means that we need n bits that 2^n >= (52!), so the theoretical answer is that we need log2(52!) = 226 bits.

So how to get that limit?

The first approach( besides the 6 * 52 one):
We know that the first card has 52 possibilities, the second has 51, ..., the 20th has 33 possibilities, so for the first 20 cards, we need at least 6 bits for each.
Then the 21st has 32 possibilities, ... 36th has 17 possibilities, so the next 16 cards we need at least 5 bits.
Then the next 8 cards we need 4 bits.
Then 4 cards, 3 bits.
Then 2 cards, 2 bits.
Then 1 card, 1 bit.
And after we know 51 cards, we know what the 52th is, so we don't need any bits. Thus in total:

20 * 6 + 16 * 5 + 8 * 4 + 4 * 3 + 2 * 2 + 1 = 249 bits. This solution is still larger.

The second one:
Consider we first send out k cards, which needs at most log2(Permu(k, 52)) bits, and now we have 52 - k cards. So we compute from 1 to 52 the k that can give us the minimum total bits required. We use recursion. Moreover, at each calling, the smaller number is already calculated, so we use an array to store at each n, the minimum bits required, thus can reduce some recursions.

However, Java has some problems with rounding (e.g., permu(1, 6) should return 3, but I get 4), so I cannot reach the optimized solution. See here for the C++ solution that can get to 227 bits.

public class SendingCards {
 static int N = 52;
 //Given n cards, best number of cards sent
 //static int[] cardsSent = new int[N + 1];
 static int[] minBits = new int[N + 1];
 static{
  minBits[1] = 0;
  minBits[2] = 1;
  minBits[3] = 3;
  //given 2 cards, sent 1 card first
  //cardsSent[1] = 1;
  //cardsSent[2] = 1;
  //cardsSent[3] = 3;
  
 }
 public static double logFact(int n){
  if(n == 0)
   return 0;
  double rst = 0;
  for(int i = 1; i <= n; i++)
   rst += Math.log((double)n) / Math.log(2.0);
  return rst;
 }
 public static double logPerm(int k, int n){
  //System.out.format("%d, %d: %.4f\n", k, n, logFact(n) - logFact(n - k));
  return logFact(n) - logFact(n - k);
 }
 //maxSent can be changed, after certain recursions, the number should converge
 //so we don't need to calculate from 1 to 52
 public static int send(int n, int maxSent){
  if(n <= 1)
   return 0;
  if(n == 2)
   return 1;
  if(n == 3)
   return 3;
  //if minBits have been calculated before, return the previous solution
  if(minBits[n] > 0)
   return minBits[n];
  int min = Integer.MAX_VALUE;
  int bits = 0;
  int kmin = 0;
  
  for(int k = 1; k <= Math.min(maxSent, n); k++){
   bits = (int)Math.ceil(logPerm(k, n)) + send(n - k, maxSent);
   if(min > bits){
    min = bits;
    kmin = k;
   }
  }
  System.out.println("n: " + n + ", min: " + min);
  //cardsSent[n] = kmin;
  minBits[n] = min;
  return min;
  
 }
 public static void main(String[] args) {
  System.out.println(send(N, 16));

 }

}

Find popular items

Find popular item in sorted array of natural numbers.
An item is popular is if its repeated n/4 times or more.
For Ex:
Input: 123444445678
Popular item is 4.
Liner scan is O(n), but solution needs to be in O(logN) complexity and O(1) space complexity.

This is actually a very interesting problem. Since the popular item is defined as the element is repeated more than 1 / 4 times, and since it is a sorted array, so it can only occurs on 0, n / 4, n /2 and 3n/4 index. And the rest is just do binary search and get the range.

public class PopularNumber {
 public static void popular(int[] n){
  if(n == null || n.length == 0)
   return;
  int len = n.length;
  int[] check = {0, len / 4, len / 2, 3 * len / 4};
  for(int i = 0; i < 4; i++){
   if(i > 0 && n[check[i]] == n[check[i - 1]])
    continue;
   int l = check[i];
   int start = binarySearchStart(n, n[l]);
   int end = binarySearchEnd(n, n[l]);
   //need to be larger than the ceil in case len / 4.0 is not an integer
   if(end - start + 1 >= Math.ceil(len / 4.0))
    System.out.println(n[l]);
  }
 }
 private static int binarySearchEnd(int[] n, int target){
  int len = n.length;
  int start = 0;
  int end = len - 1;
  while(start + 1 < end){
   int mid = (start + end) / 2;
   if(n[mid] <= target)
    start = mid;
   else
    end = mid;
  }
  if(n[end] == target)
   return end;
  else return start;
 }
 private static int binarySearchStart(int[] n, int target){
  int len = n.length;
  int start = 0;
  int end = len - 1;
  while(start + 1 < end){
   int mid = (start + end) / 2;
   if(n[mid] >= target)
    end = mid;
   else
    start = mid;
  }
  if(n[start] == target)
   return start;
  else
   return end;
 }

 public static void main(String[] args) {
  //int[] n = {1, 2, 3, 4, 4, 4, 4, 4, 5, 6, 7, 8};
  int[] n = {1, 1, 2, 3, 4};
  popular(n);
 }
}

Integer to String

Write a function that takes an integer as input and produces an output string.
Some sample Input/outputs are as follows
i/p o/p
1 - A
2 - B
3 - AA
4 - AB
5 - BA
6 - BB
7 - AAA
8 - AAB
9 - ABA
10 - ABB
11 - BAA
12 - BAB
13 - BBA
14 - BBB
15 - AAAA

This looks like an integer to binary conversion, however, instead of using 0, 1 representation, it would be easier to understand if we convert in this way:

A 1
B 2
AA 11
AB 12
BA 21

If we use binary representations, we will have:

A 1
B 10
AA 11
AB 110
BA 101

It is reasonable that when n % 2 = 1, we append "A". However, if n % 2 = 0, we will have one extra digit, so we need to subtract 1 before we divide by 2.

public class IntegerToString {
 public static String convert(int n){
  if(n <= 0)
   return null;
  String rst = "";
  while(n > 0){
   if((n & 1) != 0)
    rst = "A" + rst;
   else {
    rst = "B" + rst;
    n -= 1;
   }
   n = (n >> 1);
  }
  return rst;
 }
 public static void main(String[] args) {
  for(int i = 1; i <= 15; i++)
   System.out.println(convert(i));
 }
}

Saturday, March 28, 2015

Maximum days to light candles

You are given heights of n candles .
First day you lit one candle
second day you need to lit two candles
Third day you need to lit three candles
..........
........
till possible.
After lighting candles the height of candles deduced by 1 each day.You can also extinguish any candle you want but only at the end of day.
So you need to tell the maximum number number of days , you can carry on lighting the candles.
Example : there are three candles of heights {2 , 2 ,2 }
Answer : 3
1.You light first candle on day one. heights -> {1,2,2}
2.You light second and third and extinguish first one . heights ->{1, 1,1}
3.You light all the candles. heights -{0,0,0}

This is actually a tricky question. The tricky part is not on the actual implementation, in fact, there is no fancy algorithm here, but on how you think of the question.

In fact, the last paragraph was written 90 minutes ago, then when I tried to use an example to explain my first approach, I realized that is was wrong.

Then comes the second and third approach...

Then I realize the problem is as easy as the interview type suggests: A written test.

This is a simple Greedy approach.

Sort the array. Start from the maximum, each day we add an element and subtract the number of candles needed. So given

2 2 2

Day 1: 2 - 1 = 1
Day 2: 1 + 2 - 2 = 1
Day 3: 1 + 2 - 3 = 0

That's it. Simple. Yeah, simple....

public static int maxDays(int[] candles){
  if(candles == null || candles.length == 0)
   return 0;
  Arrays.sort(candles);
  int lit = candles[candles.length - 1] - 1;
  int count = 1;
  
  for(int i = candles.length - 2; i >= 0; i--){
   if(lit + (candles[i] - (count + 1)) < 0)
    break;
   lit += candles[i] - (++count);
  }
  return count;
 }

Task Scheduler

Given the interface below, implement a task scheduler.

interface Task {
    void Run();
    Set<Task> GetDependencies();
}

Additionally, the task scheduler should follow two rules.
1. Each task may only be executed once.
2. The dependencies of a task should be executed before the task itself.

This one is quite straightforward. Given a set of task, using a DFS approach.

public class TaskScheduler {
 Set executed;
 Set allTasks;
 Set inProcess;
 public TaskScheduler(Set tasks){
  allTasks = tasks;
  executed = new HashSet();
  inProcess = new HashSet();
 }
 public void schedule(Set allTasks){
  for(Task t : allTasks){
   if(executed.contains(t))
    continue;
   if(!inProcess.isEmpty() && inProcess.contains(t)){
    t.Run();
    inProcess.remove(t);
    executed.add(t);
    continue;
   }
   inProcess.add(t);
   schedule(t.GetDependencies());
   t.Run();
   inProcess.remove(t);
   executed.add(t);
  }
 }
}

Now comes the second one:

implement the same task scheduler with parallel task execution.

So I am thinking, maybe it needs a concurrency approach. So I go back to my old posts and tried to come up an approach. Since all dependencies should be executed first, I use a stack to schedule all tasks. The scheduler will first add all its dependencies to the stack. Then I use a releaseCount to track the available resources. If releaseCount == 0 or if the current Task is not on top of the stack, it should wait for its turn. Pop the task out and execute it, while executing, the task has acquired the resource, so releaseCount should decrement by one, after executing, the task should release the resource, so releaseCount increment by one.

However, I am not sure if my approach is correct, so I have an open question on Stackoverflow.

public class TaskSchedulerParallel {
 Set executed;
 Stack scheduler;
 int releaseCount;
 //number of parallel nodes
 public TaskSchedulerParallel(int N){
  executed = new HashSet();
  scheduler = new Stack();
  releaseCount = N;
 }
 public synchronized void schedule(Task t) throws InterruptedException {
  scheduler.push(t);
  for(Task dep : t.GetDependencies()){
   if(!executed.contains(dep) && !scheduler.contains(dep))
    schedule(dep);
  }
  if(releaseCount == 0 || (!scheduler.isEmpty() && scheduler.peek() != t))
   t.wait();
  releaseCount--;
  scheduler.pop();
  t.Run();
  executed.add(t);
  releaseCount++;
 }
 

}

Check if a graph contains a cycle

This post is also derived from my last night's post "Construct tree with minimum weight from an acyclic undirected unweighted graph", because in order for that algorithm to work, the assumption is that the graph must be a tree or forest. So I write a simple method to check the assumption.

The algorithm is based on the Union/Find AST. This is my third version of it, since I want to create a generic class.

The UnionFind class:

import java.util.*;
public class UnionFind {
 Map elements;
 int count;
 public UnionFind(){
  elements = new HashMap ();
 }
 public UnionFind(int size){
  elements = new HashMap (size);
 }
 public void add(T e){
  elements.put(e, new Element(e, e, 1));
  count++;
 }
 public void add(Collection collection){
  for(T e : collection)
   add(e);
 }
 public T find(T e){
  Element curr = elements.get(e);
  Element root = elements.get(curr.root);
  if(curr.root != root.root){
   curr.root = find(root.root);
  }
  return curr.root;
 }
 public boolean union(T e1, T e2){
  if(count <= 1)
   return false;
  Element c1 = elements.get(e1);
  Element c2 = elements.get(e2);
  c1.root = find(e1);
  c2.root = find(e2);
  if(c1.root == c2.root)
   return false;
  if(c1.rank < c2.rank)
   c1.root = c2.element;
  else if(c1.rank > c2.rank)
   c2.root = c1.element;
  else{
   c2.root = c1.element;
   c1.rank++;
  }
  count--;
  return true;
 }
 private class Element{
  T element;
  T root;
  int rank;
  public Element(T element, T root, int rank){
   this.element = element;
   this.root = root;
   this.rank = rank;
  }
 }
}

The actual method to check if a cycle exists is quite straightforward: union all connecting vertices, if they have already been in one union, then there is a cycle.

private boolean isForest(){
  UnionFind uf = new UnionFind ();
  uf.add(forest.vertices.values());
  for(UndirectedGraphNode n : forest.vertices.values()){
   for(UndirectedGraphNode nei : n.neighbors){
    if(!uf.union(nei, n))
     return false;
   }
  }
  return true;
 }

Diameter of a graph

So after last night's two posts, I decide to practice and write a code to calculate the diameter of a graph.

The graph is an undirected weighted graph (my same old Graph class), so the way I do it is use Dijkstra to calculate and get the maximum shortest distance from one node to all other nodes, and update the max shortest distance each time. The maximum shortest distance among all is the diameter.

public class Diameter {
 private Graph g;
 public Diameter(Graph g){
  this.g = g;
 }
 public double maxShortestDistance(Node n){
  g.chooseSource(n);
  PriorityQueue pq = new PriorityQueue(new NodeComparator());
  //Set visited = new HashSet ();
  pq.offer(n);
  double maxDist = 0.0;
  while(!pq.isEmpty()){
   Node curr = pq.poll();
   for(Node nei : g.neighbors.get(curr)){
    double distance = g.getEdge(curr.label, nei.label).weight + curr.dist;
    if(distance < nei.dist){
     pq.remove(nei);
     nei.dist = distance;
     pq.offer(nei);
     maxDist = Math.max(maxDist, distance);
    }
   }
  }
  return maxDist;
 }
 public double diameter(){
  double diameter = 0.0;
  for(Node v : g.vertices.values()){
   diameter = Math.max(diameter, maxShortestDistance(v));
  }
  return diameter;
 }
 private class NodeComparator implements Comparator{
  
  public int compare(Node a, Node b){
   if(a.dist - b.dist > 0)
    return 1;
   else if (a.dist - b.dist < 0)
    return -1;
   return 0;
  }
 }
}

Friday, March 27, 2015

Diameter of a tree

My last post pumped me to calculate the diameter of a tree. And there is actually the question:

The diameter of a tree is the number of nodes on the longest path between two leaves in the tree. The diagram below shows a tree with diameter nine, the leaves that form the ends of a longest path are shaded (note that there is more than one path in each tree of length nine, but no path longer than nine nodes).
In particular, note that the diameter of a tree T is the largest of the following quantities:
the diameter of T's left subtree
the diameter of T's right subtree
the longest path between leaves that goes through the root of T
Given the root node of the tree, return the diameter of the tree

public static int maxDepth(TreeNode root){
  if(root == null)
   return 0;
  if(root.left == null && root.right == null)
   return 1;
  return Math.max(maxDepth(root.left), maxDepth(root.right)) + 1;
 }
 public static int Diameter(TreeNode root){
  if(root == null)
   return 0;
  if(root.left == null && root.right == null)
   return 1;
  return Math.max(Math.max(Diameter(root.left), Diameter(root.right)), maxDepth(root.left) + maxDepth(root.right));
 }

Construct tree with minimum weight from an acyclic undirected unweighted graph

Given an acyclic undirected unweighted connected graph, find its representation as a tree with the least height. Brute force is O(n^2). Come up with an O(n) solution

Well, the problem is the same as saying given a tree, find it's root. An acyclic graph, unfortunately, is a tree. The problem is, we don't know where is the root. So the way we do it is to find the diameter of the graph. The diameter of the graph is defined as:

The diameter $d$ of a graph is the maximum eccentricity of any vertex in the graph. That is, $d$ it is the greatest distance between any pair of vertices or, alternatively, $d = \max_{v \in V}\epsilon(v)$ . To find the diameter of a graph, first find the shortest path between each pair of vertices. The greatest length of any of these paths is the diameter of the graph.

To do it in O(n), we first randomly select a node from the graph. find the farthest node it can reach, call it farthest1. Then find the farthest node farthest1 can reach, call it farthest2. The diameter of the graph is the distance between farthest1 and farthest2. The reason it works is that the given graph is a tree, so the two farthest nodes will only have one neighbor. If not, the we can go farther down find the next farthest node.

So how to find the root? Given the two farthest nodes, the root must be in between. so from one node to the other, find the middle one. The way I do it is a brutal force one, traverse both nodes to the equal distance, then find the common node. So the question is, is it possible that there exists a second common node on the path? The answer is definitely no, since the given graph is a tree.

public class ConstructTree {
 UndirectedGraph forest;
 int diameter;
 public ConstructTree(UndirectedGraph g){
  forest = g;
  //System.out.println("Tree built");
 }
 private UndirectedGraphNode getFarthest(UndirectedGraphNode n){
  Queue q = new LinkedList();
  q.add(n);
  Map distance = new HashMap ();
  distance.put(n, 0);
  //Set visited = new HashSet ();
  int maxDist = 0;
  //int currDist = 0;
  UndirectedGraphNode farthest = null;
  while(!q.isEmpty()){
   UndirectedGraphNode curr = q.poll();
   int d = distance.get(curr);
   for(UndirectedGraphNode neighbor : curr.neighbors){
    if(!distance.containsKey(neighbor)){
     distance.put(neighbor, d + 1);
     q.offer(neighbor);
    }
   }
  }
  for(Entry e : distance.entrySet()){
   if(e.getValue() > maxDist){
    maxDist = e.getValue();
    farthest = e.getKey();
   }
  }
  diameter = maxDist;
  return farthest;
 }
 private UndirectedGraphNode getRoot(UndirectedGraphNode n1, UndirectedGraphNode n2){
  int d1, d2;
  if(diameter % 2 == 0)
   d1 = d2 = diameter / 2;
  else{
   d1 = diameter / 2;
   d2 = diameter - d1;
  }
  Queue q1 = new LinkedList();
  q1.add(n1);
  Queue q2 = new LinkedList();
  q2.add(n2);
  UndirectedGraphNode toReturn = null;
  while(!q1.isEmpty() && d1 > 0){
   UndirectedGraphNode curr = q1.poll();
   for(UndirectedGraphNode nei : curr.neighbors)
    q1.offer(nei);
   d1--;
  }
  while(!q2.isEmpty() && d2 > 0){
   UndirectedGraphNode  curr = q2.poll();
   for(UndirectedGraphNode nei : curr.neighbors)
    q2.offer(nei);
   d2--;
  }
  for(UndirectedGraphNode node1 : q1){
   for(UndirectedGraphNode node2 : q2){
    if (node1 == node2){
     toReturn = node1;
     break;
    }
   }
  }
  return toReturn;
 }
 
 public UndirectedGraphNode construct(){
  UndirectedGraphNode random = forest.vertices.entrySet().iterator().next().getValue();
  UndirectedGraphNode farthest = getFarthest(random);
  UndirectedGraphNode farthest2 = getFarthest(farthest);
  return getRoot(farthest, farthest2);
 }
}

House coloring problem

I ask my friends to throw me a practicing problem, here it is:

There are a row of houses, each house can be painted with three colors red, blue and green. The cost of painting each house with a certain color is different. You have to paint all the houses such that no two adjacent houses have the same color. You have to paint the houses with minimum cost. How would you do it?

No doubt the first thought is DP. However, the trick is we cannot only create an array of costs because we cannot determine which color to print except for the first house.

The idea is:

cost[color0][house] = Math.min(cost[color1][house - 1], cost[color2][house - 2]) + houseCost[color0][house].

Now here is a follow up question: what if there are n colors? I don't know a good answer because all I can think of is to go through all cost of print different colors of house - 1 and find the minimum. But that will make the whole complexity O(mn), where m is the number of houses and n is the number of colors.

public class HouseColoring {
 //assume rows are the cost of each color and columns are for each house
 public static int minCost(int[][] house){
  if(house == null || house.length == 0 ||house[0].length == 0)
   return -1;
  int cols = house[0].length;
  int[][] cost = new int[3][cols];
  for(int i = 0; i < 3; i++)
   cost[i][0] = house[i][0];
  for(int j = 1; j < cols; j++){
   cost[0][j] = Math.min(cost[1][j - 1], cost[2][j - 1]) + house[0][j];
   cost[1][j] = Math.min(cost[0][j - 1], cost[2][j - 1]) + house[1][j];
   cost[2][j] = Math.min(cost[0][j - 1], cost[2][j - 1]) + house[2][j];
  }
  return Math.min(cost[0][cols - 1], Math.min(cost[1][cols - 1], cost[2][cols - 1]));
 }
 public static void main(String[] args) {
  int[][] house = new int[3][];
  house[0] = new int[] {1, 3, 2, 6, 7, 8, 9};
  house[1] = new int[] {5, 4, 1, 3, 9, 8, 10};
  house[2] = new int[] {7, 6, 1, 5, 8, 2, 3};
  System.out.println(minCost(house));
 }
}

Design Hotel Reservation

Design a hotel reservation system. To make it simple, we assume that the hotel has only one building and the building has only one floor. Design your objects so that they work better with a non-sql database, say a document-oriented database.

The reservation system needs at least two another class. The first one is the Room class, which is identified by the room number and have several methods such as book() or unbook().

The second one is the Customer class. This class contains the information of a customer who has made a reservation.

I also use two maps to store all rooms and customers. The reason to use map is that it is easy for retrieval: I can use the name and room number to find the customer and room I want.

Two public methods: makeReservation and cancelReservation.

makeReservation() will create a Customer object, find the next available room, and assign that to the customer, so when the customer checks in, he/she can easily find the room. If there is no available room, then no reservation can be made.

cancelReservation() will remove the Customer object c out of the customers map and make the room available.

There are couple private helper methods that I don't want to go detail.

public class ReservationSystem{
 final Map rooms;
 Map customers;
 
 public ReservationSystem(List builtRoom){
  rooms = new HashMap ();
  //when the hotel is built, all the rooms are added into the system
  rooms.add(builtRoom);
  customers = new HashMap ();
 }
 
 public boolean makeReservation(String name, String id, String information){
  int room = findNextAvailableRoom();
  if(room == -1){
   System.out.println("No available room!");
   return false;
  }
  Customer c = new Customer(name, id, information, room);
  rooms.get(room).book();
  customers.put(name, c);
  return true;
 }
 
 public boolean cancelReservation(String name){
  Customer c = findCustomer(name);
  if(c == null){
   System.out.println("No reservation found!");
   return false;
  }
  customers.remove(name);
  makeAvailable(c.roomBooked);
  return true;
 }
 
 private Customer findCustomer(String name){
  if(!customers.containsKey(name))
   return null;
  return customers.get(name);
 }
 
 private void makeAvailable(int roomNumber){
  rooms.get(roomNumber).unbook();
 }
 private int findNextAvailableRoom(){
  for(Room r : rooms.values()){
   if(!r.isBooked())
    return r.roomNumber;
  }
  return -1;
 }
 
 
 class Room{
  final int roomNumber;
  private boolean available;
  
  public Room(int rN){
   roomNumber = rN;
   available = true;
  }
  
  public void book(){
   available = false;
  }
  void unbook(){
   available = true;
  }
  public boolean isBooked(){
   return !available;
  }
 }
 
 class Customer{
  String name;
  String id;
  String information;
  int roomBooked;
  
  public Customer(String name, String id, String information, int roomNumber){
   this.name = name;
   this.id = id;
   this.information = information;
   roomBooked = roomNumber;
  }
 }

Thursday, March 26, 2015

Swap sequence

You are given two arrays - A & B. The length of the arrays is the same - N. The values in the arrays are integers from 0 to N - 1 in random order and no number is repeated. You need to list a sequence of two-element swaps required to reorder the array A into the order of the array B. Additionally, at each step of the sequence you are allowed to swap two elements only if one of them is 0.

The idea is very simple.
1. Find the index of element 0 in A, then in B we get an corresponding element.
2. If it is not zero, find the index of that element in A, swap them.
3. If it is zero, then find the index of first element in A that is not equal to the corresponding element in B, swap it with 0 in A. Repeat 2.

public class SwapSequence {
 public static void printSequence(int[] A, int[]B){
  int indexZ  = findIndex(A, 0);
  if(B[indexZ] == 0){
   indexZ = getIndex(A, B, indexZ);
   if(indexZ == -1)
    return;
  }
  while(B[indexZ] != 0){
   int index = findIndex(A, B[indexZ]);
   System.out.println(A[indexZ] + " " + A[index]);
   swap(A, indexZ, index);
   indexZ = B[index] == 0 ? getIndex(A, B, index) : index;
   if(indexZ == -1)
    return;
  }
  
 }
 private static int getIndex(int[] A, int[] B, int indexZ){
  for(int i = 0; i < A.length; i++){
   System.out.println(A[i] + " " + B[i]);
   if(A[i] != B[i]){
    swap(A, i, indexZ);
    return i;
   }
  }
  return -1;
 }
 private static void swap(int[] array, int i, int j){
  int tmp = array[i];
  array[i] = array[j];
  array[j] = tmp;
 }
 private static int findIndex(int[] array, int element){
  for(int i = 0; i < array.length; i++){
   if(array[i] == element)
    return i;
  }
  return -1;
 }

 public static void main(String[] args) {
  int[] A = {0, 4, 8, 7, 1, 2, 5, 3, 6};
  int[] B = {1, 7, 6, 5, 8, 4 ,2, 3 ,0};
  printSequence(A, B);
  System.out.println(Arrays.toString(A));

 }

}

Read bytes

Given an API ReadFromDisk() which returns a block of 256 bytes from disk. Implement the API Read(SizeInBytes) which reads SizeInBytes bytes from the disk by calling only ReadFromDisk. Notice that every call to ReadFromDisk() will read the next 256 bytes from disk.

Update 2015 - 03 - 30
I realize that I understood the ReadN problem wrong, the char[] buff is the output array. Here is the updated version. I updated the ReadN too.

public int read2(byte[] buff, int sizeInBytes){
  int index = 0, next = 0;
  //assume return -1 if reach the end of the file
  byte[] tmp = new byte[256];
  while(index < sizeInBytes && (next = readFromDisk(tmp))!= -1)
   for(int i = 0; i< next && index < sizeInBytes; buff[index++] = tmp[i++]);
  return index;
 }
 public int readFromDisk(byte[] buff){
  return 256;
 }

This is similar to the LeetCode's ReadN question. I hijacked the code from java src (see here if you want the original).

public void read(int sizeInBytes) throws IOException{
  //readFromDisk() will read 256 bytes and put into 
  //the byte array, so we only need the size / 256
  int size = sizeInBytes / 256;
  byte[] b = new byte[size];
  int c = readFromDisk();
  //assume readFromDisk() will return -1 if the 
  //length is less than 256
  if(c == -1)
   return;
  b[0] = (byte)c;
  int i = 1;
  for(; i < size; i++){
   c = readFromDisk();
   //reach the end of the file
   if(c == -1)
    break;
   b[i] = (byte)c;
  }
 }
 public int readFromDisk(){
  return 255;
 }

Snakes and Ladders

Design an optimized algorithm to solve snakes and ladders with the least amount of roles possible under the assumption that you get whatever role you want when you role the dice.

I didn't even know about the game at first. As usual, if the question asks about "minimum" something, there is highly chance that it is a DP.

Just to describe the game for short:

It is a common board game, where you roll the dice and proceed the amount you get;
If you hit the lower side of the ladder, you are lucky, you move up to the upper side of the ladder;
If you face a snake (upper side of the snake), congratulations, go back to the tail (lower side) of the snake;
If you are the first one to proceed to the destination, you win.

If you are so interested in playing the game, consult Wikipedia.

Ok, back to the problem. Create a 1D array of length n * n, which is the minimum roles needed to proceed to that spot. If index < 6, then as long as it's not the upper side of the snake, it takes minimum 1 step. If index > 6, then it takes minimum from index - 6 to index - 1 plus 1 step, assuming normal condition. If it is the upper side of the ladder, it equals the step needed to go to the lower side of the ladder. If it is the upper side of a snake, since every time you hit this square, you always have to return, it takes infinity to go to the destination.

I use a struct to store the information on the board. The board of the test case can be viewed as this:

import java.util.*;
public class SnakesNLadders {
 public static int roles(Struct[][] board){
  if(board == null || board.length == 0 || board[0].length == 0 || board.length != board[0].length)
   return 0;
  int n = board.length;
  int[] roles = new int[n * n];
  for(int i = 0; i < n * n; i++){
   int x = i / n;
   int y = (x % 2 != 0) ? (n - 1 - i % n) : i % n;
   if(i < 6){
    roles[i] = board[x][y].s.equals("SU") ? Integer.MAX_VALUE / 2 : 1;
   }
   else{
    roles[i] = Integer.MAX_VALUE;
    for(int j = i - 6; j < i; j++)
     roles[i] = Math.min(roles[i], roles[j] + 1);
    if(board[x][y].s.equals("LU"))
     //in case the lower end of the ladder is the upper end of the
     //snake
     roles[i] = Math.min(ladder(board, x, y, roles),  roles[i]);
    else if(board[x][y].s.equals("SU"))
     roles[i] = Integer.MAX_VALUE / 2;
   }
  }
  
  return roles[n * n - 1];
 }
 private static int ladder(Struct[][] board, int x, int y, int[] roles){
  int n = board.length;
  int xc = board[x][y].x;
  return xc % 2 != 0 ? roles[xc * n + (n - 1 - board[x][y].y)]
     : roles[xc * n + board[x][y].y];
 }
 /**
  * consists the string which represents the status of the square:
  * "SU": snake upper side
  * "SL": snake lower side
  * "LU": ladder upper side
  * "LL": ladder lower side
  * as well as the coordinates of its corresponding end
  * e.g., "SU", 1, 1 indicates the lower side of the snake is at x = 1, y = 1 
  * @author shirleyyoung
  *
  */
 private static class Struct{
  String s;
  //the coordinate of the corresponding square
  int x;
  int y;
  public Struct(String s, int x, int y){
   this.s = s;
   this.x = x;
   this.y = y;
  }
 }
 public static void main(String[] args) {
  Struct[][] board = new Struct[4][4];
  for(int i = 0; i < 4; i++)
   //indicate nothing in the square
   Arrays.fill(board[i], new Struct("", -1, -1));
  board[0][1] = new Struct("SL", 1, 2);
  board[1][2] = new Struct("SU", 0, 1);
  board[0][3] = new Struct("LL", 2, 3);
  board[2][3] = new Struct("LU", 0, 3);
  board[1][1] = new Struct("SL", 3, 2);
  board[3][2] = new Struct("SU", 1, 1);
  board[1][0] = new Struct("LL", 2, 0);
  board[2][0] = new Struct("LU", 1, 0);
  System.out.println(roles(board));
 }
}

Markov String Generator

Update 2015 - 03 - 31:

Ok, I optimized to code and removed those cumbersome extra space. Now my code reads the strings from the file and generate the table directly.

public class StringGenerator2 {
 private class Frequency{
  String s;//hashcode of the string
  int count;//occurrence
  public Frequency(String s, int c){
   this.s = s;
   this.count = c;
  }
 }
 Map> table;
 Random r;
 public StringGenerator2(String fileName) throws FileNotFoundException, IOException{
  table = new HashMap> ();
  r = new Random();
  generateTable(fileName);
  assert(isUniqueList());
 }
 public void generateTable(String fileName) throws FileNotFoundException, IOException{
  BufferedReader reader = new BufferedReader(new FileReader(fileName));
  String line;
  String last = null;
  while((line = reader.readLine()) != null){
   String[] row = line.split(" ");
   for(String s : row){
    String puntua = null;
    if(!Character.isLetter(s.charAt(s.length() - 1))){
     puntua = s.substring(s.length() - 1);
     s = s.substring(0, s.length() - 1);
    }
    if(!table.containsKey(s))
     table.put(s, new ArrayList ());
    if(last != null)
     add(table.get(last), s);
    if(puntua != null)
     add(table.get(s), puntua);
    last = s;
   }
  }
  reader.close();
  mapFrequency();
 }
 private void add(List next, String s){
  int index = -1;
  for(int i = 0; i < next.size(); i++){
   if(next.get(i).s.equals(s)){
    index = i;
    break;
   }
  }
  if(index != -1)
   next.get(index).count++;
  else
   next.add(new Frequency(s, 1));
 }
 private void mapFrequency(){
  for(Entry> e : table.entrySet()){
   List next = e.getValue();
   for(int i = 1; i < next.size(); i++){
    next.get(i).count += next.get(i - 1).count;
   }
  }
 }
 private boolean isUniqueList(){
  Set words = new HashSet ();
  for(List next : table.values()){
   words.clear();
   for(Frequency f : next){
    if(!words.add(f.s))
     return false;
   }
  }
  return true;
 }
 public void generator(String outputName, int length) throws IOException{
  if(table.size() == 0){
   System.out.println("No training set found!");
   return;
  }
  FileWriter writer = new FileWriter(outputName);
  int index = 0;
  String last = null;
  int countWord = 0;//number of words in one line
  while(index < length){
   String s = null;
   if(last == null || !table.containsKey(last)){
    //generate a random string from the key set
    Object[] keys = table.keySet().toArray();
    s = (String) keys[r.nextInt(keys.length)];
   } else
    s = getNext(table.get(last));
   writer.append(s).append(" ");
   countWord++;
   if(countWord == 15){
    writer.append("\n");
    countWord = 0;
   }
   last = s;
   index++;
  }
  writer.append(".\n");
  writer.flush();
  writer.close();
 }
 private String getNext(List next){
  int size = next.size();
  int nextIndex = r.nextInt(next.get(size - 1).count) + 1;
  for(Frequency f : next){
   if(nextIndex <= f.count)
    return f.s;
  }
  return next.get(r.nextInt(size)).s;
 }
  
}

Too much fun for this morning. I saw this problem posted on Career Cup yesterday. I couldn't understand what the problem really meant (see here), especially after I took a look at Wikipedia's couldn't-be-more-confusing explanation on Markov chain. But Google never let me down. I found this blog, which explained it in a clearer way. In short, using Markov chain to generate a new String is like using Bayesian to predict when the first snow in next year will happen: based on the probability of when the first snow in last couple years happened.

In short, given a training set of strings, we create a table of the probability of all the next strings after a given string. Then when we need to generate a new string, we randomly select the first string from our training set, then pick the next string based on the probability of each "next" string in the table given the last generated string.

To get the next string based on the pre-calculated probability, I used this method. Basically when you get the probability distribution. You generate a random number, and based on the corresponding range in the PDF, you select the string.

I have to say that my implementation is not a good one. First, I used a list to store all strings into the memory, it definitely will cause a problem if there are too many strings. However, I don't know how to check if the current string is already in the table if I don't put everything into the memory.

Next, I calculate the frequency, to do that, I used another map, which is used to count the occurrence of each "next" string, then use two arrays to sum up, and finally use a list of structure (string, accumulated frequency) to store all "next" strings, given one string key in the table. See how much extra space I have used, this is definitely not good.

Moreover, when we hit a punctuation, I added the punctuation into the "next" list but didn't include it as a key.

To actually generate the string is easy, as I mentioned before, just keep randomly generate the next string in the "next" string list based on the frequency.

There are lots of ways to optimize this implementation. But I think this is enough for interview/learning purpose.

package markovChain;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.*;
import java.util.Map.Entry;
public class StringGenerator {
 List trainingSet;
 String file;
 public StringGenerator(String file) throws FileNotFoundException, IOException{
  this.file = file;
  trainingSet = new ArrayList ();
  read();
 }
 /**
  * read training set and buffered it into a list
  * @throws FileNotFoundException
  * @throws IOException
  */
 public void read() throws FileNotFoundException, IOException{
  BufferedReader reader = new BufferedReader(new FileReader(file));
  String line;
  while((line = reader.readLine()) != null){
   String[] row = line.split(" ");
   for(String s : row)
    trainingSet.add(s);
  }
  reader.close();  
 }
 /**
  * Generate frequency table
  * @param output
  * @param length
  * @throws IOException
  */
 private Map> generateTable(){
  Map> frequencyTable = new HashMap> ();
  frequencyTable.put(trainingSet.get(0), new ArrayList ());
  for(int i = 1; i < trainingSet.size(); i++){
   String s = trainingSet.get(i);
   String p = null;
   if(!Character.isLetter(s.charAt(s.length() - 1))){
    p = s.substring(s.length() - 1, s.length());
    s = s.substring(0, s.length() - 1);
   }
   if(!frequencyTable.containsKey(s)){
    frequencyTable.put(s, new ArrayList ());
    if(p != null)
     frequencyTable.get(s).add(p);
   }
   String last = trainingSet.get(i - 1);
   if(!Character.isLetter(last.charAt(last.length() - 1)))
    last = last.substring(0, last.length() - 1);
   frequencyTable.get(last).add(s); 
  }
  Map> nextFrequency = getFrequency(frequencyTable);
  return nextFrequency;
 }
 private Map> getFrequency(Map> frequencyTable){
  Map> countFre = new HashMap>();
  for(Entry> e : frequencyTable.entrySet()){
   String key = e.getKey();
   List next = e.getValue();
   Map f = new HashMap();
   for(String s : next){
    if(!f.containsKey(s))
     f.put(s, 1);
    else
     f.put(s, f.get(s) + 1);
   }
   List nextFrequency = mapFrequency(f);
   countFre.put(key, nextFrequency); 
  }
  return countFre;
 }
 private List mapFrequency(Map f){
  String[] array = new String[f.size()];
  int[] c = new int[f.size()];
  int index = 0;
  for(Entry ec : f.entrySet()){
   array[index] = ec.getKey();
   c[index] = ec.getValue();
   index++;
  }
  for(int i = 1; i < c.length; i++){
   c[i] += c[i - 1];
  }
  List rst = new ArrayList ();
  for(int i = 0; i < c.length; i++)
   rst.add(new Frequency(array[i], c[i]));
  return rst;
 }
 
 
 /**
  * generate the string
  * @param output
  * @param length
  * @throws IOException
  */
 public void Generator(String output, int length) throws IOException{
  if(trainingSet.size() == 0){
   System.out.println("No training set found!");
   return;
  }
  Map> nextFrequency = generateTable();
  Random r = new Random();
  FileWriter writer = new FileWriter(output);
  int i = 0;
  String last = null;
  int countWord = 0;
  while(i < length){
   String s = null;
   if(last == null || !nextFrequency.containsKey(last))
    s = trainingSet.get(r.nextInt(trainingSet.size()));
   else
    s = getNext(nextFrequency.get(last));
   writer.append(s).append(" ");
   countWord++;
   if(countWord == 15){
    writer.append("\n");
    countWord = 0;
   }
   last = s;
   i++;
  }
  writer.append(".\n");
  writer.flush();
  writer.close();
 }
 private String getNext(List nextFre){
  int size = nextFre.size();
  Random r = new Random();
  int next = r.nextInt(nextFre.get(size - 1).count) + 1;
  for(Frequency f : nextFre){
   if(next <= f.count)
    return f.s;
  }
  return nextFre.get(r.nextInt(size)).s;
 }
 /**
  * the structure that contains the string and its count
  * @author shirleyyoung
  *
  */
 private class Frequency{
  String s;
  int count;
  public Frequency(String s, int c){
   this.s = s;
   count = c;
  }
 } 
}

Test
In order to show my averseness to the research that I am working on. I decide to use some paragraphs from the papers I am reading as the test case.

"The dynamics of flexible polymers in shear is of great practical interest because this type of flow occurs whenever a fluid flows past a surface.

Macroscopic, non-Newtonian rheological properties of polymer solutions, such

as flow-dependent viscosities and normal

stresses, result from microscopic stresses that

arise when polymeric molecules are stretched

and affect the solvent motion. Thus, much

effort has been directed at predicting the molecular

dynamics of polymers in shear flows. However, it has been difficult to rigorously

test these predictions because the dynamics

of a polymer molecule in shear have

not been observed directly. Experimental efforts

have mainly focused on measuring bulk

rheological properties or on measuring the

scattering of light or neutrons by polymer

solutions. Here we describe how single-molecule imaging techniques can be

used to study the configurations of polymers

in shear flow so that the detailed molecular predictions of theories can be tested.

In short, Shirley doesn't care about how polymer tumbles in shear flow!

Polymer dynamics are of central importance in materials science,

mechanical engineering, biology and medicine. The dynamics of

macromolecular solutions and melts in shear flow are typically

studied using bulk experimental methods such as light and

neutron scattering and birefringence. But the effect of shear

on the conformation and dynamics of individual polymers is

still not well understood. Here we describe observations of the

real-time dynamics of individual, flexible polymers fluorescently

labelled DNA molecules under a shear flow. The sheared

polymers exhibit many types of extended conformation with an

overall orientation ranging from parallel to perpendicular with

respect to the flow direction. For shear rates much smaller than

the inverse of the relaxation time of the molecule, the relative

populations of these two main types of conformation are

controlled by the rate of the shear flow. These results question

the adequacy of assumptions made in standard models of polymer

dynamics."

Here is the 300 strings generated from the input.

"mainly focused on measuring bulk rheological properties or on measuring the rate of extended conformation

and dynamics of individual flexible polymers in shear is still not been difficult to rigorously

test these predictions of shear on measuring bulk rheological properties or on measuring bulk rheological

properties of assumptions made in shear is still not been difficult to rigorously test these

two main types of polymer solutions Here we describe how polymer solutions , the detailed

molecular dynamics of the molecular predictions of light and normal stresses result from microscopic stresses

that the shear on measuring the molecular predictions of polymers exhibit many types of individual

flexible polymers fluorescently labelled DNA molecules are controlled by the shear is of extended conformation

with an overall orientation ranging from microscopic stresses result from microscopic stresses that arise when

polymeric molecules are of polymers in shear flow Polymer dynamics of flow occurs whenever a

shear flows However , by the molecule in shear rates much effort has been observed

directly Experimental efforts have mainly focused on the effect of conformation with an overall orientation

ranging from microscopic stresses that arise when polymeric molecules under a surface . flexible polymers

exhibit many types of light and normal stresses , rheological properties of the adequacy of

a shear flow so that the detailed molecular dynamics of conformation are of polymers in

shear flow occurs whenever a surface . under a shear flow Polymer dynamics of assumptions

made in shear flow are controlled by the dynamics of extended conformation and neutron scattering

and affect the molecule the effect of great practical interest because the scattering and affect

the scattering of polymers is still not been difficult to study the solvent motion Thus

, much smaller than the inverse of shear flow Polymer dynamics of conformation and medicine."

Well, not that bad.

Wednesday, March 25, 2015

Print frequency

Given an unsorted array of natural numbers. Where numbers repeat in array. Out put numbers in the order of frequency. Where number of out put is passed as parameter.
For Ex:
Array -> [0, 0, 100, 3, 5, 4, 6, 4, 2, 100, 2, 100]
n -> 2
Out put -> [100, 0] or [100, 2]
Since 100 is repeated 3 times and 0, 2 is repeated 2 times, out put up to 2 most frequent elements, program should out put either 100, 0 or 100, 2

It doesn't look like there are ways (not on top of my mind) that we can avoid extra space. The PriorityQueue I created is fixed to the size of n, and it is a min heap. So each time I poll out the head if it is smaller than the one that is supposed to be put in, and add the new one into the queue. In the end, iterate through the queue and print out all elements in it.

public static void frequencyPrinter(int[] num, int n){
  if(num == null || num.length == 0)
   return;
  Map count = new HashMap ();
  for(int i : num){
   if(!count.containsKey(i))
    count.put(i, 1);
   else
    count.put(i, count.get(i) + 1);
  }
  PriorityQueue> pq = new PriorityQueue> (n, new countComparator());
  for(Entry e : count.entrySet()){
   if(pq.size() < n)
    pq.add(e);
   else{
    if(e.getValue() > pq.peek().getValue()){
     pq.poll();
     pq.offer(e);
    }
   }
  }
  Iterator> it = pq.iterator();
  while(it.hasNext()){
   System.out.println(it.next().toString());
  }
 }
 private static class countComparator implements Comparator>{
  public int compare(Entry a, Entry b){
   return a.getValue() - b.getValue();
  }
 }

Common sense (?) about YouTube

The Web Servers

Source: https://www.youtube.com/watch?v=w5WVu624fY8

NetScaler
load balances: mark down/up machines,
cache static contents: Apache is not very good at handling very large static contents.

mod_fastcgi: request look dynamic, follow certain rules

Run linux
can usually scale by adding machines
Python is fast enough
Web code speed is usually not the bottleneck
More often spending on time waiting on databases, caches, RPC (remote procedure call) calls.
Development code is important

<<100 ms page service times
dynamic Python -> C compiler, use psyco to speed up Python
selectively write encryptions, such as C extensions
pre-generated cached HTML

Serving Video
Issues: bandwidth, hardware, limited power in a data center, hardware doesn't consume too much power
Each video is hosted by "mini-cluster"
mini-cluster: small number machines that serve exact the same set of videos, few redundancy;
couple machines in one cluster:
1. scalability: more disks serving each content
2. head room: one machine goes down, other machines can take off the slack
3. online backups of everything

from Apache -> lighttpd
single process -> multi-process

Serving Video
Put the most popular content on to the CDNs
The moderately played and least played videos go to YouTube servers.
The traffic profile that goes to YouTube servers: some videos have 10 - 20 views / day, but if aggregate all those videos, it's still a lot. Rate control becomes more important, cache becomes more important. And want to tune the amount memory of each system so that it won't be too small. But not too large, because otherwise when you expand the number of machines, it will cost too much money without much benefit.

Keep it simple
simple network path: not too many network devices are in the path between the client and video servers
commodity hardware
simple common tools
Bundles of Linux, build layers on top of them, can be modified if needed
OS page cache size is critical
Handle random seeks(SATA tweaks, etc. )

Serving Thumbnails
Thumbnails images that correspond to each video are around 100 * 90 pixels, so ~ 5kb each, A lot of them, each video has multiple thumbnails. There are 4 times number of thumbnails than there are videos. Videos are across thousands of machines, but thumbnails are concentrated on small number of machines. Need to deal with lots of small objects: In & Out (INO) cache, direct entry caches, page caches, all over OS level
High request / second: any given web page, when search a key word, there are lots thumbnails, but there is only one video.
Start with Apache: unified pool (same machine that is serving the web traffic is also serving static images(thumbnails)) -> separate pool, high load, low performance
-> squid(reverse proxy): far more efficient with CPU, load increases, performance decreases over time
-> lighttpd (modified)

Source:https://www.youtube.com/watch?v=w5WVu624fY8

Main thread: epoll reading content that is already in memory

Assign disk reads to different Worker Thread
As long as keep all the fast stuff on the main thread, serving lots of request per second, all the slow stuff can come a little bit later

-> BTFE

Google's BTFE

Based on Bigtable(Google distributed data store)

avoids small file problem

set up new machines becomes faster and easier

fast, fault-tolerant

various forms of caching: cache multiple levels so that the latency between server and user is not so much

Databases

MySQL

only store metadata (users, video description titles tags, etc)

In the beginning: one main DB and one backup

main database is swapping: it is not using more memory than the system has: The Linux 2.4 kernels, it thinks the page caches is more important than applications, if the page caches is 1.6 gb, the application is 3.5 gb, if there is only 4 gb of physical memory, rather than reduce the size of page cache, it instead swapped out mySQL, but mySQL needs to be on memory, so it goes back and forth, so swapping in and swapping out is very frequently.

DB Optimizations

Query optimizations

Batch jobs whenever they are too expensive to do

memcache: instead of going to database all the time, use memcache D, hashtable on TCP socket, allows you to get and set data based on some string based key; using memory is better than using the disk

App server caching: database caches, database machine OS caches, memcache, memory space of the app server process: access the data very often (hundreds of times / second), sometimes precalculate the data, and store the data in memory

Replication

Spread reload across other databases when the traffic grows, which are replaying the transactions happening on the master databases. All the writes go to the master database, they spread to single replica that is attached to that database, asynchronized process.

Source: https://www.youtube.com/watch?v=w5WVu624fY8

All the writes are in multiple places, in some situations that writes are exhausting the databases;

If there are too many replicas, management issues;

Replica Lag (downside)

Replication is asynchronous

If there are multiple machines which have different CPU speed, slightly different lags on each machine;

On the Master database, updates comes from multiple threads. Each thread is executing a single update.

On the replica data base, MySQL serialize everything onto one thread, it can be bottleneck.Update 3 cannot execute until update 2 is finished assuming update 2 is a long update.

Source: https://www.youtube.com/watch?v=w5WVu624fY8

Unhealthy Replication Thread

Consider there are 5 updates, 3 are results of pages that are not in memory, so result in cache misses. Cache misses stalls the replication thread. At certain point the replica cannot keep up with the master database. The master database is using all the CPUs and all the disks, but the replica can only use 1 CPU and 1 disk, causing a reduction in replication speed.

DB updates involve:

1. reading the affected DB pages

2. Applying the changes

MySQL use 1 thread to buffer or the sql from the master database, and another one executing the sql.

The solution: since replication is behind, the cache primer thread can see the not-yet-replayed SQL queries and pre-fetch the affected rows.

Replica Pools

Since it cannot deal with all database load, prioritized video watching above everything else. Split replica databases into two pools:

Source: https://www.youtube.com/watch?v=w5WVu624fY8

Database Partitions

A gigantic database: split it to different pieces, each is independent

partition by user

spread writes AND reads

much better cache locality -> less I/O

30% hardware reduction with same amount of traffic

Scalable systems
simple
solve the problem at hand:
product of evolution:the complexity will come over time

Need the flexibility

Scalable techniques
divide and conquer: partition things. distribute things
approximate correctness: the system of the state is what what is reported to be so if you can't tell that your system is fundamentally skewing and inconsistent, then it's not
e.g., if you write a comment, and another person on the other side of the world is loading the watch page, he or she may not care what you write, but you may want to see the comment, so inconsistency comes to make sure the writer sees the comment first, and thus can save some cost.
expert knob twiddling: consistency, durability
jitter: introduces randomness, things have tendency to stack up
e.g., cache expirations: if there is a popular video, we want to cache it. If we set usual cache timeout for an hour or 10 minutes, but for a very popular video (the most popular one yesterday), the cache timeout maybe 24 hours. If everything expires at one given time, then every machine will observe that expiration exactly the same time and go back and try to compute it, and this is awful. By jittering, the desired expiration time is about 24 hours and up, but we will randomly jittering (adding variances) around 18 to 30 hours , and that prevent things from stacking up. Because system has the tendency to synchronize lines up.
cheating: when you have monotonically increasing counters like a view count, you could do a full transaction every time you update that, or you can do a transaction once in a while and update by a random amount, as long as the change people will believe it is real.

Scalable components
Well defined inputs:
Well defined dependencies
frequently end up behind an RPC (remote procedure call)
leverage local optimizations

Efficiency
uncorrelated with scalability
The most efficient thing would be to cram everything in process and see
focus on algorithms: focus on macro level, they way individual the components break out doesn't matter,
libraries:
wiseguy
pycurl
spitfire
serialization formats

tools
Apache
linux
mySQL
vitess
zookeeper

AdSense

Monday, March 30, 2015

Sunday, March 29, 2015

Saturday, March 28, 2015

Friday, March 27, 2015

Thursday, March 26, 2015

Wednesday, March 25, 2015