Saturday, June 13, 2015

Baking a π can teach you a bit of Parametricity

Even though I got my copy of Prof. Eugenia Cheng's awesome How to Bake π a couple of weeks back, I started reading it only over this weekend. I am only on page 19 enjoying all the stuff regarding cookies that Prof. Cheng is using to explain abstraction. This is a beautiful piece of explanation and if you are a programmer you may get an extra mile out of the concepts that she explains here. Let's see if we can unravel a few of them ..

She starts with a real life situation such as:

If Grandma gives you five cookies and Grandpa gives you five cookies, how many cookies will you have ?

Let's model this as box of cookies that you get from your Grandma and Grandpa and you need to count them and find the total. Let's model this in Scala and we may have something like the following ..

case class CookieBox(count: Int)

and we can define a function that gives you a CookieBox containing the total number of cookies from the 2 boxes that we pass to the function ..

def howManyCookies(gm: CookieBox, gp: CookieBox) = {
  CookieBox(gm.count + gp.count)
}

and we use howManyCookies to find the count ..

scala> val gm = CookieBox(5)
gm: CookieBox = CookieBox(5)

scala> val gp = CookieBox(5)
gp: CookieBox = CookieBox(5)

scala> howManyCookies(gm, gp)
res5: CookieBox = CookieBox(10)

.. so we have 10 cookies from our Grandma & Grandpa .. Perfect!

The problem is .. the child answers: "None, because I'll eat them all". To model this let's add a function eat to our CookieBox abstraction ..

case class CookieBox(count: Int) {
  // let's assume n < count for simplicity
  def eat(n: Int): CookieBox = CookieBox(count - n)
}

So instead of the correct way to answer the question, the child cheats and implements howManyCookies as ..

def howManyCookies(gm: CookieBox, gp: CookieBox) = {
  CookieBox(gm.eat(gm.count).count + gp.eat(gp.count).count)
}

and we get the following ..

scala> howManyCookies(gm, gf)
res6: CookieBox = CookieBox(0)

Prof. Cheng continues ..

The trouble here is that cookies do not obey the rules of logic, so using math to study them doesn't quite work. .. We could impose an extra rule on the situation by adding "... and you're not allowed to eat the cookies". If you're not allowed to eat them, what's the point of them being cookies ?

This is profound indeed. When we are asked to count some stuff, it really doesn't matter if they are cookies or stones or pastries. The only property we need here is to be able to add together the 2 stuff that we are handed over. The fact that we have implemented howManyCookies in terms of CookieBox gives the little child the opportunity to cheat by using the eat function. More information is actually hurting us here, being concrete with data types is actually creating more avenues for incorrect implementation.

Prof. Cheng is succinct here when she explains ..

We could treat the cookies as just things rather than cookies. We lose some resemblance to reality, but we gain scope and with it efficiency. The point of numbers is that we can reason about "things" without having to change the reasoning depending on what "thing" we are thinking about.

Yes, she is talking about generalization, being polymorphic over what we count. We just need the ability to add 2 "things", be it cookies, monkeys or anchovies. In programming we model this with parametric polymorphism, and use a universal quantification over the set of types for which we implement the behavior.

def howMany[A](gm: A, gp: A) = //..

We have made the implementation parametric and got rid of the concrete data type CookieBox. But how do we add the capability to sum the 2 objects and get the result ? You got it right - we already have an abstraction that makes this algebra available to a generic data type. Monoids FTW .. and it doesn't get simpler than this ..

trait Monoid[T] {
  def zero: T
  def append(t1: T, t2: T): T
}

zero is the identity function and append is a binary associative function over 2 objects of the type. So given a monoid instance for our data type, we can model howMany in a completely generic way irrespective of whether A is a CookieBox or Monkey.

def howMany[A : Monoid](gm: A, gp: A): A = gm append gp

Implementing a monoid for CookieBox is also simple ..

object CookieBox {
  implicit val CookieBoxMonoid = new Monoid[CookieBox] {
    val zero = CookieBox(0)
    def append(i: CookieBox, j: CookieBox) = CookieBox(i.count + j.count)
  }
}

With the above implementation of howMany, the little child will not be able to cheat. By providing a simpler data type we have made the implementation more robust and reusable across multiple data types.

Next time someone wants me to explain parametricity, I will point them to Page 19 of How to Bake π.

Thursday, March 26, 2015

Randomization and Probabilistic Techniques to scale up Machine Learning

Some time back I blogged about the possibilities that probabilistic techniques and randomization bring on to the paradigm of stream computing. Architectures based on big data not only relate to high volume storage, but also on low latency velocities, and this is exactly where stream computing has a role to play. I discussed a few data structures like bloom filters, count min sketch and hyperloglog and algorithms like Locality Sensitive Hashing that use probabilistic techniques to reduce the search and storage space while processing huge volumes of data.

Of late, I have been studying some of the theories behind machine learning algorithms and how they can be used in conjunction with the petabytes of data that we generate everyday. And the same thing strikes here - there are algorithms that can model the most perfect classifier. But you need randomization and probabilistic techniques to make them scale, even at the expense of a small amount of inaccuracy creeping within your model. In most cases we will see that the small inaccuracy that comes within your algorithm because of probabilistic bounds can be compensated by the ability to process more data within the specified computational timeline. This is true even for some of the basic algorithms like matrix multiplication that form the core of machine learning models.

The contents of this post is nothing original or new. It's just to share some of my thoughts in learning the usage of approximation techniques in building machine learning classifiers.

Matrix Multiplication

Not only these specialized data structures or algorithms, randomization has been found to be quite effective for processing large data sets even for standard algorithms like matrix multiplication, polynomial identity verification or min cut identification from large graphs. In all such cases the best available algorithms have computational complexity which works well for a small data set but doesn't scale well enough with the volumes of data.

Consider a case where we are given 3 matrices, $A$, $B$ and $C$ and we need to verify if $AB = C$. The standard algorithm for matrix multiplication takes $\Theta(n^3)$ operations and there's also a sophisticated algorithm that works in $\Theta(n^{2.37})$ operations. Instead let's consider some randomization and choose a random vector $\bar{r} = (r_1, r_2, .. r_n) \in \{0, 1\}^n$. Now we can compute $AB\bar{r}$ by first computing $B\bar{r}$ and then $A(B\bar{r})$. And then we compute $C\bar{r}$. If we find $A(B\bar{r}) \neq C\bar{r}$, then $AB \neq C$. Otherwise we return $AB = C$. Instead of matrix-matrix multiplication our randomized algorithm uses matrix-vector multiplication, which can be done in $\Theta(n^2)$ operations the standard way.

Obviously a $\Theta(n^2)$ algorithm has a lower computational complexity than $\Theta(n^3)$ and scales better with larger data sets. Now the question is how accurate is this algorithm ? Is it guaranteed to give the correct answer every time we run it ? As with other probabilistic algorithms, there's a chance that our algorithm will return a wrong result. But as long as we can show that the chance is minimal and can be reduced by tuning some parameters, we should be fine.

It can be shown that if $AB \neq C$ and if $\bar{r}$ is chosen uniformly at random from $\{0, 1\}^n$ then $Pr(AB\bar{r} = C\bar{r}) <= 1/2$. But the trick is that we can run our randomized algorithm many times choosing $\bar{r}$ with replacement from $\{0, 1\}^n$. If for any of these trials we get $AB\bar{r} \neq C\bar{r}$, then we can conclude $AB \neq C$. And the probability that we get $AB\bar{r} = C\bar{r}$ for all $k$ trials despite $AB \neq C$ is $2^{-k}$. So for $100$ trials, the chance of error is $2^{-100}$, which we can see is really small. The detailed proof of this analysis can be found in the excellent book Probability and Computing by Michael Mitzenmacher & Eli Upfal.

Matrix multiplication is something that's used heavily especially in implementing machine learning classifiers. And if we can tolerate that little chance of error we get an algorithm with lower computational complexity that scales much better.

Stochastic Gradient Descent

Consider another use case from core machine learning classifier design. Gradient descent is a standard way to minimize the empirical risk for measuring training set performance. The empirical risk is given by the following equation:
$$E_n(f) = (1/n)\sum_i l(f_w(x_i),y_i)$$
where $l$ is the loss function that measures the cost of predicting $f_w(x_i)$ from $n$ training examples where the actual answer is $y$ and $f_w(x)$ is the function parameterized by the weight vector $w$. Each iteration of gradient descent updates the weights $w$ on the basis of the gradient of $E_n(f_w)$ according to the following iterative step:

$$w_{t+1} = w_t - \gamma (1/n) \sum_i \nabla_w Q(z_i, w_t)$$
where $\gamma$ is an adequately chosen gain. Note that a single update step for the parameter runs through all the training examples and this gets repeated for every update step that you do before convergence. Compare this with Stochastic Gradient Descent (SGD) where the update step is given by the following:

$$w_{t+1} = w_t - \gamma \nabla_w Q(z_t, w_t)$$
Note instead of running through all examples and compute the exact gradient, SGD computes the gradient based on one randomly picked example $z_t$. So, SGD does a noisy approximation to the true gradient. But since it does not have to process all the examples in every iteration it scales better with a large data set. In this paper on Large Scale Machine Learning With Stochastic Gradient Descent, Leon Bottou classifies the error in building the classifier into 3 components:

Approximation Error, which comes from the fact that the function $f$ that we choose is different from the optimal function $f^*$ and we approximate using a few examples

Estimation Error, which comes from the fact that we have a finite number of training examples and would have gone away with infinite number of them

Optimization Error, which comes from the fact that we are using an inferior algorithm to estimate the gradient

With normal gradient descent we will have low optimization error since we run through all the training examples in every iteration to compute the gradient, which is clearly superior to the algorithm of SGD that does a noisy approximation. But SGD will report a lower approximation and estimation error since we will be able to process a larger dataset within the stipulated computation time. So it's a tradeoff of that we make using SGD, but clearly we scale better with larger data sets.

Singular Value Decomposition

Singular Value Decomposition is a dimensionality reduction technique to unearth a smaller number of intrinsic concepts from a high dimensional matrix by removing unnecessary information. It does so by projecting the original matrix on to lower dimensions such that the reconstruction error is minimized. What this means is that given a matrix $A$ we decompose it into lower dimensional matrices by removing the lesser important information. And we do this in such a way that we can reconstruct a fairly close approximation to $A$ from those lower dimensional matrices. In theory SVD gives the best possible projection in terms of reconstruction error (optimal low rank approximation). But in practice it suffers from scalability problems with large data sets. It generates dense singular vectors even if the original matrix is a sparse one and hence is computationally inefficient, taking cubic time in the size of the data.

This can be addressed by another algorithm, the CUR algorithm which allows larger reconstruction error but lesser computation time. CUR decomposes the original matrix into ones of lesser dimensions but uses a randomized algorithm in selection of columns and rows based on their probability distribution. Now it can be shown that CUR reconstruction is just an additive term away from SVD reconstruction and it's a probabilistic bound subject to the condition that we select a specific range of columns and rows from $A$. The computational bound of CUR is of the order of the data set, which is much less than that of SVD (which as I mentioned earlier is cubic). This is yet another example where we apply randomization and probabilistic techniques to scale our algorithm better for larger data sets in exchange for a little amount of inaccuracy.

These are only a few instances of probabilistic bounds being applied to solve real world machine learning problems. There are a lots more. In fact I find that scalability of machine learning has a vey direct correlation with application of probabilistic techniques to the model. As I mentioned earlier the point of this post is to share some of my thoughts as I continue to learn techniques to scale up machine learning models. Feel free to share your ideas, thoughts and discussions in comments.

Wednesday, February 11, 2015

Functional Patterns in Domain Modeling - Composing a domain workflow with statically checked invariants

I have been doing quite a bit of domain modeling using functional programming mostly in Scala. And as it happens when you work on something for a long period of time you tend to identify more and more patterns that come up repeatedly within your implementations. You may ignore these as patterns the first time, get a feeling of mere coincidence the next time, but third time really gives you that aha! moment and you feel like documenting it as a design pattern. In course of my learnings I have started blogging on some of these patterns - you can find the earlier ones in the series in:

Functional Patterns in Domain Modeling - The Specification Pattern

Functional Patterns in Domain Modeling - Immutable Aggregates and Functional Updates

Functional Patterns in Domain Modeling - Anemic Models and Compositional Domain Behaviors

In this continuing series of functional patterns in domain modeling, I will go through yet another idiom which has been a quite common occurrence in my explorations across various domain models. You will find many of these patterns explained in details in my upcoming book on Functional and Reactive Domain Modeling, the early access edition of which is already published by Manning.

One of the things that I strive to achieve in implementing domain models is to use the type system to encode as much domain logic as possible. If you can use the type system effectively then you get the benefits of parametricity, which not only makes your code generic, concise and polymorphic, but also makes it self-testing. But that's another story which we can discuss in another post. In this post I will talk about a pattern that helps you design domain workflows compositionally, and also enables implementing domain invariants within the workflow, all done statically with little help from the type system.

As an example let's consider a loan processing system (simplified for illustration purposes) typically followed by banks issuing loans to customers. A typical simplified workflow looks like the following :-

The Domain Model

The details of each process is not important - we will focus on how we compose the sequence and ensure that the API verifies statically that the correct sequence is followed. Let's start with a domain model for the loan application - we will keep on enriching it as we traverse the workflow.

case class LoanApplication private[Loans](
  // date of application
  date: Date,
  // name of applicant
  name: String,
  // purpose of loan
  purpose: String,
  // intended period of repayment in years
  repayIn: Int,
  // actually sanctioned repayment period in years
  actualRepaymentYears: Option[Int] = None,
  // actual start date of loan repayment
  startDate: Option[Date] = None,
  // loan application number
  loanNo: Option[String] = None,
  // emi
  emi: Option[BigDecimal] = None
)

Note we have a bunch of attributes that are defined as optional and will be filled out later as the loan application traverses through the sequence of workflow. Also we have declared the class private and we will have a smart constructor to create an instance of the class.

Wiring the workflow with Kleisli

Here are the various domain behaviors modeling the stages of the workflow .. I will be using the scalaz library for the Kleisli implementation.

def applyLoan(name: String, purpose: String, repayIn: Int, 
  date: Date = today) =
  LoanApplication(date, name, purpose, repayIn)

def approve = Kleisli[Option, LoanApplication, LoanApplication] { l => 
  // .. some logic to approve
  l.copy(
    loanNo = scala.util.Random.nextString(10).some,
    actualRepaymentYears = 15.some,
    startDate = today.some
  ).some
}

def enrich = Kleisli[Option, LoanApplication, LoanApplication] { l => 
  //.. may be some logic here
  val x = for {
    y <- l.actualRepaymentYears
    s <- l.startDate
  } yield (y, s)

  l.copy(emi = x.map { case (y, s) => calculateEMI(y, s) }).some
}

applyLoan is the smart constructor that creates the initial instance of LoanApplication. The other 2 functions approve and enrich perform the approval and enrichment steps of the workflow. Note both of them return an enriched version of the LoanApplication within a Kleisli, so that we can use the power of Kleisli composition and wire them together to model the workflow ..

val l = applyLoan("john", "house building", 10)
val op = approve andThen enrich
op run l

When you have a sequence to model that takes an initial object and then applies a chain of functions, you can use plain function composition like h(g(f(x))) or using the point free notation, (h compose g compose f) or using the more readable order (f andThen g andThen h). But in the above case we need to have effects along with the composition - we are returning Option from each stage of the workflow. So here instead of plain composition we need effectful composition of functions and that's exactly what Kleisli offers. The andThen combinator in the above code snippet is actually a Kleisli composition aka function composition with effects.

So we have everything the workflow needs and clients use our API to construct workflows for processing loan applications. But one of the qualities of good API design is to design it in such a way that it becomes difficult for the client to use it in the wrong way. Consider what happens with the above design of the workflow if we invoke the sequence as enrich andThen approve. This violates the domain invariant that states that enrichment is a process that happens after the approval. Approval of the application generates some information which the enrichment process needs to use. But because our types align, the compiler will be perfectly happy to accept this semantically invalid composition to pass through. And we will have the error reported during run time in this case.

Remembering that we have a static type system at our disposal, can we do better ?

Phantom Types in the Mix

Let's throw in some more types and see if we can tag in some more information for the compiler to help us. Let's tag each state of the workflow with a separate type ..

trait Applied
trait Approved
trait Enriched

Finally make the main model LoanApplication parameterized on a type that indicates which state it is in. And we have some helpful type aliases ..

case class LoanApplication[Status] private[Loans]( //..

type LoanApplied  = LoanApplication[Applied]
type LoanApproved = LoanApplication[Approved]
type LoanEnriched = LoanApplication[Enriched]

These types will have no role in modeling domain behaviors - they will just be used to dispatch to the correct state of the sequence that the domain invariants mandate. The workflow functions need to be modified slightly to take care of this ..

def applyLoan(name: String, purpose: String, repayIn: Int, 
  date: Date = today) =
  LoanApplication[Applied](date, name, purpose, repayIn)

def approve = Kleisli[Option, LoanApplied, LoanApproved] { l => 
  l.copy(
    loanNo = scala.util.Random.nextString(10).some,
    actualRepaymentYears = 15.some,
    startDate = today.some
  ).some.map(identity[LoanApproved])
}

def enrich = Kleisli[Option, LoanApproved, LoanEnriched] { l => 
  val x = for {
    y <- l.actualRepaymentYears
    s <- l.startDate
  } yield (y, s)

  l.copy(emi = x.map { case (y, s) => calculateEMI(y, s) }).some.map(identity[LoanEnriched])
}

Note how we use the phantom types within the Kleisli and ensure statically that the sequence can flow only in one direction - that which is mandated by the domain invariant. So now an invocation of enrich andThen approve will result in a compilation error because the types don't match. So once again yay! for having the correct encoding of domain logic with proper types.

Thursday, January 01, 2015

Probabilistic techniques, data streams and online learning - Looking forward to a bigger 2015

I look forward to 2015 as the year when randomized algorithms, probabilistic techniques and data structures become more pervasive and mainstream. The primary driving factors for this will be more and more prevalence of big data and the necessity to process them in near real time using minimal (or constant) memory bandwidth. You are given data streams where possibly you will see every data only once in your lifetime and you need to churn out analytics from them in real time. You cannot afford to store all of them in a database on disk since it will incur an unrealistic performance penalty to serve queries in real time. And you cannot afford to store all information in memory even if you add RAM at your own will. You need to find clever ways to optimize your storage, employ algorithms and data structures that use sublinear space and yet deliver information in real time.

Many such data structures are already being used quite heavily for specialized processing of data streams ..

Bloom Filters for set membership using sublinear storage
HyperLogLog for efficient cardinality estimation
Count Min Sketch for estimating frequency related properties of a data stream

These data structures are becoming more and more useful as we prepare to embrace and process larger data sets with fairly strict online requirements. And it has started making a difference. Take for example Impala, the open source analytic database from Cloudera that works on top of Hadoop. Impala's NDV aggregate function (number of distinct values) uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space. This blog post has the details of the performance improvement that it offers in comparison to the standard distinct count. The immensely popular NoSQL store Redis also offers a HyperLogLog implementation that you can use to get an approximation on the cardinality of a set using randomization. Salvatore has the details here on the implementation of HyperLogLog algorithm in Redis.

The most important reason these algorithms and data structures are becoming popular is the increased focus on our "online" requirements. We are not only processing bigger and bigger data set, we need results faster too. We just cannot afford to push all analytics to the batch mode and expect results coming out after an overnight batch processing. Various architectural paradigms like the lambda architecture also target to address this niche area. But before investing on such complex architectures, often some neat data structures that use probabilistic techniques and randomization may offer a much lighter weight solution that you are looking for.

Consider processing the Twitter stream and generating analytics (of whatever form) online. This means that immediately after seeing one twitter feed you must be able to predict something and update your model at the same time. Which means you need to memorize the data that you see in the feed, apply it to update your model and yet cannot store the entire hose that you have seen so far. This is online learning and is the essence of techniques like stochastic gradient descent that help you do this - the model is capable of making up to date predictions after every data that you see. John Myles White has an excellent presentation on this topic.

Consider this other problem of detecting similarities between documents. When you are doing this on a Web scale you will have to deal with millions of documents to find the similar sets. There are techniques like minhash which enable you to compress documents into signature matrices. But even then the scale becomes too big to be processed and reported to the user in a meaningful amount of time. As an example (from Mining Massive Datasets), if you process 1 million document using signatures of length 250, you still have to use 1000 bytes per document - the total comes to 1 gigabyte which very well fits into the memory of a standard laptop. But when you check for similar pairs, you need to process (1,000,000 choose 2) or half a trillion pairs of documents which will take almost 6 days to compute all similarities on a laptop. Enter probabilistic techniques and locality sensitive hashing (LSH) algorithm fits this problem like a charm. Detecting similarity is a problem that arises in recommender systems with collaborative filtering and LSH can be used there as well. The basic idea of LSH as applied to similarity detection is to use hashing multiple number of times and identify candidate pairs that qualify for similarity checking. The idea is to reduce the search space using probabilistic techniques so that we can eliminate a class of candidates which have very low chance of being similar.

Here I have only scratched the surface of the areas where we apply randomization and probabilistic techniques to solve problems that are very real today. There are plentiful other areas in data mining, graph clustering, machine learning and big data processing where similar techniques are employed to reduce the curse of dimensionality and provide practical solution at scale. 2014 has already seen a big surge in terms of popularizing these techniques. I expect 2015 to be bigger and more mainstream in terms of their usage.

Personally I have been exploring data stream algorithms a lot and have prepared a collection of some useful references. Feel free to share in case you find it useful. I hope to do something more meaningful with stream processing data structures and online learning in 2015. Have a very happy and joyous new year ..