AdSense

Saturday, September 19, 2015

Scala note 3: OOP (TweetSet)

Week 3 class focuses on classes and objects in Scala. The OOP concepts are similar to that in Java. However, Scala uses traits instead of interface in Java to serve similar functions.

One feature in Scala that I have never met in Java is lazy val. The difference between val and lazy val is that the later will only be executed at the first time it is accessed. Lazy val is executed once and only once. This means next time the lazy val is called, it will return the same value. Details can be found here and here.

Moreover, def defines a function and every time the function is called, it creates new function. val defines a value, it returns a value that is evaluated from a function. See this post for more details.

The homework asks us to implement an abstract class called TweetSet. It should be implemented as a binary tree structure and contains two concrete subclass Empty and NonEmpty. I am not going to put the whole assignment here due to the length, but you can always go here for the problem.

package objsets

import common._
import TweetReader._

/**
 * A class to represent tweets.
 */
class Tweet(val user: String, val text: String, val retweets: Int) {
  def moreRetweets(that: Tweet) : Tweet =
    if(this.retweets >= that.retweets) this
    else that
  override def toString: String =
    "User: " + user + "\n" +
    "Text: " + text + " [" + retweets + "]"
}

/**
 * This represents a set of objects of type `Tweet` in the form of a binary search
 * tree. Every branch in the tree has two children (two `TweetSet`s). There is an
 * invariant which always holds: for every branch `b`, all elements in the left
 * subtree are smaller than the tweet at `b`. The elements in the right subtree are
 * larger.
 *
 * Note that the above structure requires us to be able to compare two tweets (we
 * need to be able to say which of two tweets is larger, or if they are equal). In
 * this implementation, the equality / order of tweets is based on the tweet's text
 * (see `def incl`). Hence, a `TweetSet` could not contain two tweets with the same
 * text from different users.
 *
 *
 * The advantage of representing sets as binary search trees is that the elements
 * of the set can be found quickly. If you want to learn more you can take a look
 * at the Wikipedia page [1], but this is not necessary in order to solve this
 * assignment.
 *
 * [1] http://en.wikipedia.org/wiki/Binary_search_tree
 */
abstract class TweetSet {

  /**
   * This method takes a predicate and returns a subset of all the elements
   * in the original set for which the predicate is true.
   *
   * Question: Can we implment this method here, or should it remain abstract
   * and be implemented in the subclasses?
   */
  def filter(p: Tweet => Boolean): TweetSet = filterAcc(p, new Empty)

  /**
   * This is a helper method for `filter` that propagetes the accumulated tweets.
   */
  def filterAcc(p: Tweet => Boolean, acc: TweetSet): TweetSet

  /**
   * Returns a new `TweetSet` that is the union of `TweetSet`s `this` and `that`.
   *
   * Question: Should we implment this method here, or should it remain abstract
   * and be implemented in the subclasses?
   */
   def union(that: TweetSet): TweetSet 

  /**
   * Returns the tweet from this set which has the greatest retweet count.
   *
   * Calling `mostRetweeted` on an empty set should throw an exception of
   * type `java.util.NoSuchElementException`.
   *
   * Question: Should we implment this method here, or should it remain abstract
   * and be implemented in the subclasses?
   */
  def mostRetweeted: Tweet

  /**
   * Returns a list containing all tweets of this set, sorted by retweet count
   * in descending order. In other words, the head of the resulting list should
   * have the highest retweet count.
   *
   * Hint: the method `remove` on TweetSet will be very useful.
   * Question: Should we implment this method here, or should it remain abstract
   * and be implemented in the subclasses?
   */
  def descendingByRetweet: TweetList


  /**
   * The following methods are already implemented
   */

  /**
   * Returns a new `TweetSet` which contains all elements of this set, and the
   * the new element `tweet` in case it does not already exist in this set.
   *
   * If `this.contains(tweet)`, the current set is returned.
   */
  def incl(tweet: Tweet): TweetSet

  /**
   * Returns a new `TweetSet` which excludes `tweet`.
   */
  def remove(tweet: Tweet): TweetSet

  /**
   * Tests if `tweet` exists in this `TweetSet`.
   */
  def contains(tweet: Tweet): Boolean

  /**
   * This method takes a function and applies it to every element in the set.
   */
  def foreach(f: Tweet => Unit): Unit
  
  /**
   * if the set is empty
   */
  def isEmpty: Boolean
}

class Empty extends TweetSet {

  def filterAcc(p: Tweet => Boolean, acc: TweetSet): TweetSet = acc
  
  def union(that: TweetSet): TweetSet = that
  
  def mostRetweeted: Tweet = throw new NoSuchElementException("Empty TweetSet")
 
  def isEmpty: Boolean = true
  
  def descendingByRetweet: TweetList = Nil

  /**
   * The following methods are already implemented
   */

  def contains(tweet: Tweet): Boolean = false

  def incl(tweet: Tweet): TweetSet = new NonEmpty(tweet, new Empty, new Empty)

  def remove(tweet: Tweet): TweetSet = this

  def foreach(f: Tweet => Unit): Unit = ()
}

class NonEmpty(elem: Tweet, left: TweetSet, right: TweetSet) extends TweetSet {

  def filterAcc(p: Tweet => Boolean, acc: TweetSet): TweetSet = {
    if (p(this.elem)) 
      left.filterAcc(p, right.filterAcc(p, acc.incl(this.elem)))
    else
     left.filterAcc(p, right.filterAcc(p, acc))  
  }
  def union(that: TweetSet): TweetSet =
    (left.union(right)).union(that).incl(elem)
  
  def isEmpty: Boolean = false
   
  def mostRetweeted: Tweet = 
    if (left.isEmpty && right.isEmpty) elem
    else if (left.isEmpty) elem.moreRetweets(right.mostRetweeted)
    else if (right.isEmpty) elem.moreRetweets(left.mostRetweeted)
    else elem.moreRetweets(left.mostRetweeted).moreRetweets(right.mostRetweeted)
    
 def descendingByRetweet: TweetList = 
   new Cons(mostRetweeted, remove(mostRetweeted).descendingByRetweet)

  /**
   * The following methods are already implemented
   */

  def contains(x: Tweet): Boolean =
    if (x.text < elem.text) left.contains(x)
    else if (elem.text < x.text) right.contains(x)
    else true

  def incl(x: Tweet): TweetSet = {
    if (x.text < elem.text) new NonEmpty(elem, left.incl(x), right)
    else if (elem.text < x.text) new NonEmpty(elem, left, right.incl(x))
    else this
  }

  def remove(tw: Tweet): TweetSet =
    if (tw.text < elem.text) new NonEmpty(elem, left.remove(tw), right)
    else if (elem.text < tw.text) new NonEmpty(elem, left, right.remove(tw))
    else left.union(right)

  def foreach(f: Tweet => Unit): Unit = {
    f(elem)
    left.foreach(f)
    right.foreach(f)
  }
}

trait TweetList {
  def head: Tweet
  def tail: TweetList
  def isEmpty: Boolean
  def foreach(f: Tweet => Unit): Unit =
    if (!isEmpty) {
      f(head)
      tail.foreach(f)
    }
}

object Nil extends TweetList {
  def head = throw new java.util.NoSuchElementException("head of EmptyList")
  def tail = throw new java.util.NoSuchElementException("tail of EmptyList")
  def isEmpty = true
}

class Cons(val head: Tweet, val tail: TweetList) extends TweetList {
  def isEmpty = false
}


object GoogleVsApple {
  val google = List("android", "Android", "galaxy", "Galaxy", "nexus", "Nexus")
  val apple = List("ios", "iOS", "iphone", "iPhone", "ipad", "iPad")

  //TweetReader.allTweets foreach println
  lazy val googleTweets: TweetSet = TweetReader.allTweets.filter(tweet => google.exists(keyWord => tweet.text.contains(keyWord)))
  lazy val appleTweets: TweetSet = TweetReader.allTweets.filter(tweet => apple.exists(keyWord => tweet.text.contains(keyWord)))


  /**
   * A list of all tweets mentioning a keyword from either apple or google,
   * sorted by the number of retweets.
   */
  lazy val trending: TweetList = googleTweets.union(appleTweets).descendingByRetweet
}

object Main extends App {
  // Print the trending tweets
 GoogleVsApple.trending foreach println
}

I only include the part that we need to implement here. Due to the functional nature, it is fairly easy to understand. However, I do realize that due to intensive recursion, the program is slow, like really slow if we test for all tweets included in the test program. Or probably just my bad implementation?

The whole assignment can be found on git.

Thursday, September 10, 2015

B-Tree Java implementation

Two days of coding. One afternoon to build a tree, three nights to destroy the tree in different ways, I finally make the code.

Sort of...

The code works for M >= 4, where M is the number of maximum key-value pairs in one node (In fact, this number can never be reached because the node will be split into two once this number is reached.

For rules on how B-Tree works, refer to this post. I may update this blog when I fix the deletion issue with M = 3 (probably rewrite the code to make it cleaner). Moreover, it turns out Cassandra has a B-Tree implementation. It's a little bit to complicated for me (plus I couldn't find its deletion implementation), but it's interesting to take a look. Check here.


package bTree;


/**
 * M: maximum degree/number of key-value pairs in a node
 * HT:height of the B-tree
 * N: number of key-value pairs in the tree
 * @author shirleyyoung
 *
 * @param 
 * @param 
 */
public class BTree <Key extends Comparable<Key>, Value>{
	private int M;
	private Node root;
	private int HT;
	private int N;
	
	/******************************************************
	 * The structure to store the key-value pairs
	 * @author shirleyyoung
	 *
	 ******************************************************/
	private class Entry <Key extends Comparable<Key>, Value> implements Comparable{
		private Comparable key;
		private Object value;
		public Entry(Comparable key, Object value){
			this.key = key;
			this.value = value;
		}
		
		public int compareTo(Entry e) {
			return key.compareTo(e.key);
		}
		public String toString(){
			return "(" + key.toString() + ", " + value.toString() + ")";
		}
	}
	
	
	/***************************************************************
	 * tree node class
	 * Each node is a list of Entry objects
	 * Entry list is used for all key - value pairs stored in the node
	 * index: the current number of key-value pairs,
	 * as well as the index where next key-value pair will be added
	 * since it's 0 index, the node contains m + 1 key - value pairs
	 * entryList: the entryList
	 * children: children list of the node, each node may have at most
	 * M + 1 children
	 * @author shirleyyoung
	 *
	 ***************************************************************/
	private class Node  indexToAdd; i--)
				entryList[i] = entryList[i - 1];
			entryList[indexToAdd] = e;
			index++;
		}
		
		public String toString(){
			String s = "";
			for (int i = 0; i < index; i++){
				s += entryList[i].toString() + ", ";
				
			}
			s = s.substring(0, s.length() - 2);
			s += String.format("\nindex: %d\n", index);
			return s;
		}
	}

	/*******************************************************
	 * constructor
	 * @param M
	********************************************************/
	public BTree(){
		M = 4;
		HT = 0;
		N = 0;
		root = new Node();
	}
	public BTree(int M){
		this();
		this.M = M;
		root = new Node();
	}
	
	/********************************************************
	 * search for a value given the key
	 * @param key
	 * @return
	 *******************************************************/
	public Object get(Comparable key){
		Entry found = search(root, key, HT);
		if (found == null){
			System.out.println("No such key!");
			return null;
		}
		else return found.value;
	}
	private Entry search(Node x, Comparable key, int ht){
		if (x == null)
			return null;
		Entry[] entryList = x.entryList;
		if(ht == 0){//leaf node
			for(int j = 0; j < x.index; j++){
				if(key.equals(entryList[j].key))
					return entryList[j];
			}
		} else{
			for(int j = 0; j < x.index; j++){
				if (key.equals(entryList[j].key))
					return entryList[j];
				else if(j == 0 && key.compareTo(entryList[j].key) < 0)
					return search(x.children[j], key, ht - 1);
				else if(j == x.index - 1|| key.compareTo(entryList[j + 1].key) < 0)
					return search(x.children[j + 1], key, ht - 1);
			}
		}
		return null;
	}
	
	
	/*******************************************************
	 * Insertion
	 * @param key
	 * @param value
	 *******************************************************/
	public void insert(Comparable key, Object value){
		Entry exists = search(root, key, HT);
		if(exists != null){
			exists.value = value;
			System.out.println("Key already exists, modify the value to the input value.");
			return;
		}
		Node newRoot = insert(root, key, value, HT);
		N++;
		if(newRoot == null) return;
		root = newRoot;
		HT++;
	}
	private Node insert(Node curr, Comparable key, Object value, int ht){
		int indexToInsert;
		if(ht == 0){//leaf nodes level
			//find the correct place to insert the value			
			for(indexToInsert = 0; indexToInsert < curr.index; indexToInsert++){
				if(key.compareTo(curr.entryList[indexToInsert].key) < 0) break;
			}
			//shift the larger nodes right
			curr.add(new Entry(key, value), indexToInsert);
			
			//the key value pair is added to the existing node, no new node is inserted, return null
			if(curr.index < M) return null;
			return split(curr);
		}else {
			for(indexToInsert = 0; indexToInsert < curr.index; indexToInsert++){
				if (indexToInsert == 0 && key.compareTo(curr.entryList[indexToInsert].key) < 0) break;			
				else if((indexToInsert == curr.index - 1) || key.compareTo(curr.entryList[indexToInsert + 1].key) < 0){
					//children[j].next, j++
					indexToInsert += 1;
					break;
				}
			}
			Node u = insert(curr.children[indexToInsert], key, value, ht - 1);
			if(u == null) return null;
			return combine(curr, u);
		}
	}
	/*********************************************************
	 * split the node if it reaches maximum degree
	 * choose middle node as the new root 
	 * e.g., if M = 4 or 5, choose entryList[2] as new root
	 * @param curr
	 * @return
	 *********************************************************/
	private Node split(Node curr){
		Node newRoot = new Node();
		newRoot.entryList[0] = curr.entryList[M/2];
		newRoot.index++;
		Node right = new Node();
		int j = 0, i = M/2 + 1;
		while(j < M && i < M){
			right.entryList[j] = curr.entryList[i];
			right.children[j++] = curr.children[i++];
		}
		right.children[j] = curr.children[i];
		right.index = j;
		curr.index = M/2;
		newRoot.children[0] = curr;
		newRoot.children[1] = right;
		return newRoot;
	}
	/****************************************************
	 * combine two nodes when total elements 
	 * are less than M
	 * split the node if the entry list of the 
	 * combined node reaches M
	 * @param left
	 * @param right
	 * @return
	 ****************************************************/
	private Node combine(Node left, Node right){
		int j = 0;
		int total = left.index + right.index;
		while (left.index < total){
			left.entryList[left.index] = right.entryList[j];
			left.children[left.index++] = right.children[j++];
		}
		left.children[left.index] = right.children[j];
		right.children = null;
		right = null;
		if (left.index < M) {return null;}
		else return split(left);
	}
	
	/****************************************************
	 * Deletion
	 * @param key
	 * @return
	 ****************************************************/
	public Object delete(Comparable key){
		Entry toDelete = search(root, key, HT);
		if(toDelete == null) {
			System.out.println("No such key!");
			return null;
		}
		N--; 
		return delete(root, key, HT).value;
	}
	private Entry delete(Node n, Comparable key, int ht){
		int index = n.searchEntry(key);
		if(ht == 0){ // if leaf nodes
			if(n.entryList[index].key != key)
				return null;
			else
				return n.removeEntry(index);
		}else {
			if (key.equals(n.entryList[index].key)){ //if key is in internal node n
				//if left children has a size larger than M/2
				//replace entry that has the key to be deleted in n with the largest 
				//entry in left children, then recursively delete children entry
				//System.out.println("found index: " + index);
				Entry toDelete = n.entryList[index];
				int h = ht - 1;
				if(n.children[index].index >= M/2){ 
					Node child = n.children[index];
					while (h != 0){
						child = child.children[child.index];
						h--;
					}
					n.entryList[index] = child.entryList[child.index - 1];
					delete(n.children[index], child.entryList[child.index - 1].key, ht - 1);
				//else if right children has a size larger than M/2
				//replace entry has the key to be deleted in n with the smallest
				//entry in right children, then recursively delete children entry
				} else if (n.children[index + 1].index >= M/2){
					Node child = n.children[index + 1];
					while(h != 0){
						child = child.children[0];
						h--;
					}
					n.entryList[index] = child.entryList[0];
					delete(child, child.entryList[0].key, h);
				//case 3 will take care of the case when node n has too few entries to be moved
				} else {
					n = moveEntryDown(n.children[index], n, index, n.children[index + 1]);
					if (ht == HT && root.index == 0){
						root = n;
						HT--;
						ht--;
					}
					delete(n, key, ht);
				}
				return toDelete;
			} else { // if key is not in internal node n
				//find the root of the subtree where the desired key is 
				//supposed to present
				//System.out.println("index: " + index);
				//System.out.println("children[index] has index: " + n.children[index].index);
				//Node child = n.children[index];
				//System.out.println("Largest children key: " + child.entryList[child.index - 1].key);
				if(index == (n.index - 1) && key.compareTo(n.entryList[n.index - 1].key) > 0){
					//child = n.children[index + 1];
					index += 1;
				}
				//if key is in the children
				boolean chk = key == n.children[index].entryList[n.children[index].searchEntry(key)].key;
				if (n.children[index].index < M/2 && chk){
					//case 3a, left sibling has size at least M/2
					//remove the last entry from the left sibling, put into n[index - 1]
					//add n[index - 1] to the n.children[index]
					if(index > 0 && n.children[index - 1].index >= M/2){
						Node helper = n.children[index - 1];
						n.children[index].add(n.entryList[index - 1], 0);
						n.entryList[index - 1] = helper.entryList[helper.index - 1];
						helper.removeEntry(helper.index - 1);
					//case 3a, right sibling has size at least M/2
					} else if (index < n.index && n.children[index + 1].index >= M/2){
						Node helper = n.children[index + 1];
						n.children[index].add(n.entryList[index], n.children[index].index);
						n.entryList[index] = helper.entryList[0];
						helper.removeEntry(0);
					//case 3b, merge two nodes
					} else {
						Node mergeChild = n.children[index];
						if(index == n.index)
							n = moveEntryDown(n.children[index - 1], n, index - 1, mergeChild);
						else n = moveEntryDown(mergeChild, n, index, n.children[index + 1]);
						if (ht == HT && root.index == 0){
							root = n;
							HT--;
							ht--;
						}
						return delete(n, key, ht);
					}
				}
				if (index == n.index - 1 && key.compareTo(n.entryList[index].key) > 0)
					return delete(n.children[index + 1], key, ht - 1);
				else return delete(n.children[index], key, ht - 1);
			}
		}
	}
	/**
	 * combine left and right children and move the entry (parent[index]) down
	 * to serve as median node
	 * only be called when both left and right has size < M/2
	 * and parent has size >= M/2
	 * @param left
	 * @param parent
	 * @param index
	 * @param right
	 */
	private Node moveEntryDown(Node left, Node parent, int index, Node right){
		left.add(parent.entryList[index], left.index);
		parent.removeEntry(index);
		combine(left, right);
		if (parent.index == 0) 
			parent = left;
		else parent.children[index] = left;
		return parent;
	}
	
	public String toString(){
		return toString(root, HT) + "\n";
	}
	private String toString(Node n, int ht){
		String s = "";
		if(ht == 0){
			for(int i = 0; i < n.index; i++)
				s += i + ": " + n.entryList[i].toString() +  "\n";
		} else{
			for(int i = 0; i < n.index; i++){
				s += i + ": " + n.entryList[i].toString() + "\nChildren " + i + ": \n";
				s += toString(n.children[i], ht - 1) + "\n";
			}
			s += toString(n.children[n.index], ht - 1);
		}
		return s;
	}
	
	
	public int size(){
		return N;
	}
	public int height(){
		return HT + 1;
	}
}

Check my git for test code.

Monday, September 7, 2015

Scala note 2: Functional Sets

Mathematically, we call the function which takes an integer as argument and which returns a boolean indicating whether the given integer belongs to a set, the characteristic function of the set. For example, we can characterize the set of negative integers by the characteristic function (x: Int) => x < 0.
Therefore, we choose to represent a set by its characterisitc function and define a type alias for this representation:
type Set = Int => Boolean
Using this representation, we define a function that tests for the presence of a value in a set:
def contains(s: Set, elem: Int): Boolean = s(elem)

2.1 Basic Functions on Sets

Let’s start by implementing basic functions on sets.
  1. Define a function which creates a singleton set from one integer value: the set represents the set of the one given element. Its signature is as follows:
    def singletonSet(elem: Int): Set
    Now that we have a way to create singleton sets, we want to define a function that allow us to build bigger sets from smaller ones.
  2. Define the functions unionintersect, and diff, which takes two sets, and return, respectively, their union, intersection and differences.diff(s, t) returns a set which contains all the elements of the set s that are not in the set t. These functions have the following signatures:
    def union(s: Set, t: Set): Set
    def intersect(s: Set, t: Set): Set
    def diff(s: Set, t: Set): Set
  3. Define the function filter which selects only the elements of a set that are accepted by a given predicate p. The filtered elements are returned as a new set. The signature of filter is as follows:
    def filter(s: Set, p: Int => Boolean): Set

2.2 Queries and Transformations on Sets

In this part, we are interested in functions used to make requests on elements of a set. The first function tests whether a given predicate is true for all elements of the set. This forall function has the following signature:
def forall(s: Set, p: Int => Boolean): Boolean
Note that there is no direct way to find which elements are in a set. contains only allows to know whether a given element is included. Thus, if we wish to do something to all elements of a set, then we have to iterate over all integers, testing each time whether it is included in the set, and if so, to do something with it. Here, we consider that an integer x has the property -1000 <= x <= 1000 in order to limit the search space.
  1. Implement forall using linear recursion. For this, use a helper function nested in forall. Its structure is as follows (replace the ???):
    def forall(s: Set, p: Int => Boolean): Boolean = {
     def iter(a: Int): Boolean = {
       if (???) ???
       else if (???) ???
       else iter(???)
     }
     iter(???)
    }
  2. Using forall, implement a function exists which tests whether a set contains at least one element for which the given predicate is true. Note that the functions forall and exists behave like the universal and existential quantifiers of first-order logic.
    def exists(s: Set, p: Int => Boolean): Boolean
  3. Finally, write a function map which transforms a given set into another one by applying to each of its elements the given function. map has the following signature:
    def map(s: Set, f: Int => Int): Set


Coursera Scala course week 2 homework. The original question can be found here.

The question asks as two define an object Set, which has a characteristic function that projects an integer to a boolean.

1. define Singleton
This question requires us to create a set with one integer, i.e., given an integer, projects it to a boolean to claim the set contains the integer:

def singletonSet(elem: Int): Set = (x: Int) => x == elem

This means projects an integer x to if x equals given parameter elem. It can also be considered as given any integer, if x equals elem, then set contains x, which defines the singleton set that it only contains elem.

2. define Union, Intersect and Difference

  def union(s: Set, t: Set): Set = (x: Int) => s(x) || t(x)
  def intersect(s: Set, t: Set): Set = (x: Int) => s(x) && t(x)
  def diff(s: Set, t: Set): Set = (x: Int) => s(x) && !t(x)

s(x) indicates contains. Thus, for union, it means either for any x, if it satisfies the condition that either s contains x or t contains x, then x is in union. For intersect, the condition becomes both s and t should contains x. For difference, it means s should contain x but t should not.

3. define filter

def filter(s: Set, p: Int => Boolean): Set = (x: Int) => s(x) && p(x)

filter is a set that for any x in filter, it should satisfies that s contains x and x satisfies predicate p.

4. forall (∀ )

val bound = 1000

  def forall(s: Set, p: Int => Boolean): Boolean = {
    def iter(a: Int): Boolean = {
      if (a > bound) true
      else if (s(a) && !p(a)) false
      else iter(a+1)
    }
    iter(-bound)
  }

This is nothing special. Starts from -bound, if any a in s doesn't satisfy p, then return false. Iterate until bound, then return true.

5. exists(∃ )

def exists(s: Set, p: Int => Boolean): Boolean = !forall(s, x => !p(x))

This one is tricky. The solution means not all elements in s satisfies !p, which in turn indicates there is at least one element in s satisfies p.

6. map

def map(s: Set, f: Int => Int): Set = (y: Int) => exists(s, x => f(x) == y)

This is similar as how we define a singleton set: For any y, if there exists an element x in s that satisfies the condition f(x) equals y, then y is in new Set map.


Source on git.