Out of the Box

Agile, Scrum, Tdd, Software Craftsmanship, System Thinking

Sunday, October 9, 2016

Simple decision trees using xml in F# and Fsharp.Data.Xsd package

This post is about new xml based "Open" and "Save" features, and is the follow up of my latest post about "animal quiz" game
 (http://tonyxzt.blogspot.it/2016/09/adding-simple-gtk-gui-in-console-f.html). The "game" is a simple "dialog based yes-no decision tree builder" based on console or on gtk (enabled by the "gui" parameter on command line) written in F# (under Mac Os X using Mac/Xamarin/Visual Studio Code+Ionide).

Sources are on github: https://github.com/tonyx/fsharpanimalquizgtk

Here I share some ideas about the solution I implemented for those "Open" and "Save" features.

I used an Xml/Xsd  package available via nuget PM> Install-Package FSharp.Data.Xsd

(webpage https://giacomociti.github.io/FSharp.Data.Xsd )

For me a convenient way to represent the tree in Xml for me is represented by the following examples.

Single node decision tree containing just one element

<node>
    <animal>
        elephant
    </animal>
</node>

A two leaves decision tree with a single yes-no question on the root to distinguish those two leaves (i.e. single animal) nodes:


<node>
    <question>
        is it big?
    </question>       
    <yesBranch>
       <node>
          <animal>
             elephant
          </animal>
       </node>
    </yesBranch>
    <noBranch>
        <node>
           <animal>
              cat
           </animal>
        </node>
    </noBranch>
</node>


The following xml schema can validate the previous instances (and more of course, that I verified by some NUnit based tests )


<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"
    attributeFormDefault="unqualified">
    <xs:element name="node">
        <xs:complexType>
            <xs:sequence>
                <xs:choice>
                    <xs:element name="animal" type="xs:string"/>
                    <xs:sequence>
                        <xs:element name="question" type="xs:string"/>
                        <xs:element name="yesBranch">
                            <xs:complexType>
                                <xs:sequence>
                                    <xs:element ref="node"/>
                                </xs:sequence>
                            </xs:complexType>
                        </xs:element>
                        <xs:element name="noBranch">
                            <xs:complexType>
                                <xs:sequence>
                                    <xs:element ref="node"/>
                                </xs:sequence>
                            </xs:complexType>
                        </xs:element>
                    </xs:sequence>
                </xs:choice>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>


(I don't know if this implementation is the simplest possible, but I know that it works)

According to this schema, a node is a sequence of one of the two choices:
a single element named animal, or a sequence of following three element:
-a question (to distinguish categories of animals by yes or not)
-a "yesBranch" (which contains the subtree related to the "yes" answer to that question)
- a "noBranch" (which contains the subtree related to the "no" answer to that question

yesBranch and noBranch of course use recursivity so they contain again the "node" element, which is the root of this xml type.

I need to wrap xml tree from and to the internal representation of the tree. They   are logically equivalent (though field names may slightly differ from the corresponding xml elements, and that is still ok for me):

type KnowledgeTree = AnimalName of string | SubTree of Tree
and Tree = {QuestionstringYesBranchKnowledgeTreeNoBranchKnowledgeTree}



Wrapping internal tree representation to the corresponding xml string is based on producing plain xml text recursively:

let rec treeToXml tree =
    match tree with 
        | AnimalName name -> "<node><animal>" + name + "</animal></node>"
        | SubTree {Question=questionYesBranch=yBranchNoBranch=nBranch } ->  
           "<node><question>" + question + "</question>" + "<yesBranch>" + treeToXml yBranch + "</yesBranch><noBranch>" + treeToXml nBranch + "</noBranch></node>"



(this function does not depend on the Schema: if I want change the schema, I need to change treeToXml accordingly, few lines of code anyway)

To wrap the xml representation to an object, I followed the official documentation which introduces a XmlProvider type which needs to see the schema.

For instance I have

 type KnowledgeBaseXmlSchema = XmlProvider<Schema="""
        <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
            elementFormDefault="qualified" attributeFormDefault="unqualified">
            <xs:element name = "node">
                <xs:complexType>      

  ....


and then use the Parse method from the "KnowledgeBaseXmlSchema" as follows:

let xmlWrapped = KnowlegeBaseXmlSchema.Parse """<node>
....


Now the xmlWrapped object is coherent, in term of its fields and their types, with the xml datatype as defined by the schema. Particularly optional nodes are represented by option types.

Let's see go a little bit through this:
The Animal got from the KnowledgeBaseSchema.Parse is option because it follows the element in the schema which says it is optional as well. The editor Visual Studio Code+Ionide help detects statically this kind of things.





The other way around is true. If an element is mandatory, it is not "wrapped" into an option anymore:
so in the following lines of code, the myNode is not option, but its type is
XmlProvider<...>.Node,  and in fact is a mandatory element according to the xml schema.






All that said, let's take a look to the xmlToTree function:

let rec xmlToTree (tree:KnowledgeBaseXmlSchema.Node) =      
    match tree.Animal with
    | Some x -> AnimalName x
    | None ->  match (tree.Question,tree.YesBranch,tree.NoBranch) with 
      | (Some y1,Some y2,Some y3) -> SubTree {Question=y1YesBranch=xmlToTree y2.Node ; NoBranch = xmlToTree y3.Node }
      | _ -> failwith "error wrapping xml tree"  
    



I had to specify the type (KnowledgeBaseXmlSchema.Node) of the parameter, tree, because the type inferences does not work in some cases (and this is one of these cases)

(such thing is not terrible, because in the rare case type inferences won't work, I can just inspect the parameter type we are going to pass to it and then modify the function signature accordingly)

As I mentioned, the "Animal" is an instance of an option type so I have to pattern match it by the "Some X" syntax, and if it the match ha success, then the corresponding internal tree can be simply instantiated by:

AnimalName 

(single node tree)

If the root is not an animal (leaf node), then it should match the "None" part of the pattern matching instead, and in that case I assume that there must be a sequence of the following elements: a Question, a YesBranch and a Nobranch which are options as well.
So I try to match them as a triplet like (Some y1, Some y2, Some y3) and simply build a "SubTree" represented by the Question and the recursive call of the same function to the yesBranch and noBranch subnodes.
(note that this implementation is recursive, but it is not tail-recursive)

(Nunit tests are available in this file on github if needed to play more with what I share in this post https://github.com/tonyx/fsharpanimalquizgtk/blob/master/fsharpAnimalQuizKata/XmlTests.fs)

Conclusion is:
- options types are useful
- the fact that optional nodes are represented by option types by using an xml/xsd processing library is very useful
-kudos to static analysis tool provide by editors. They helps checking those things before compiling or testing.

About possible new feature I may add to this toy project:
-adding scrollbars on the main window (to easily visualize trees when they will grow)
-indenting xml output files, to make them more human readable
-better error handling in xml processing, and asking confirmation in case of re-writing existing file, for obvious reasons.

Thanks for reading. Enjoy.
Toni.




Saturday, September 24, 2016

Adding simple gtk# gui to a console F# application

In this post I am quickly talking about how I added a simple gtk# interface to a program I've done some times ago as an exercise in F#.
(Animal Quiz kata: http://tonyxzt.blogspot.it/2014/02/animal-quiz-problem-in-f.html)

The first version was command line based, and was developed with extensive use of unit tests and interaction tests (using Rhino framework).

That version is still incorporated in this new version (git: https://github.com/tonyx/fsharpanimalquizgtk) except that it is possible to run in a gui gtk# wideget adding the paremeter "gui" to the command line.

For instance, under mono, I will run it as
> mono fsharpAnimalQuizKata.exe gui

and the following window will appear:




The interaction with the user is still text based, as in the console based version: the user dialogues with the "engine" using the textbox and the label of the textbox, instead of the standard input/standard output.

Now the knowledge tree can be visualised, hidden, expanded by using the three buttons available.

The knowledge base starts from as single node with the animal "elephant" in it.

An example of interaction to make the engine add a new node with a "cat" is:















After this dialogue, the knowledge tree can be expanded and we can see the following:





Now the loop starts again, and in a new dialogue we can feed the engine to make it store a new node "ant", which is small and is an insect, so expanding the knowledge tree we can see the following:





Essentially the logic of using gtk# in F# is the same as using it in C#.
I've found that translating existing C# based tutorials of gtk# in F# requires:
- declaring as mutable any objects that will be reassigned.
- when event handling in C# has the form:

// adding

// adding handler
btn.Clicked += on_btn_clicked;

// declaration of the handler
void on_btn_clicked (object obj, EventArgs args) {
// method body
}

An equivalent in F# is 

// adding handler
do btn.Clicked.AddHandler(fun i e -> this.on_btn_clicked(i,e))


// declaration of the handler
member this.on_btn_clicked(i,e: EventArgs) =
// method body


Using Monodevelop/Xamarin I can create an empty F# gtk# template project that starts with an empty window. In my experience starting from that, and consulting any C# gtk#  tutorial to add buttons, menus etc... is good enough to play with gtk# gui.

(I'm sure they it should work also in .net and Windows, but I have not tried yet.)

To "wrap" a knowledge tree to a gtk# TreeStore view I used this function

let rec treeToTreeStore tree (storeTreeStore) iter  prefix =
    match tree with 
        | AnimalName name -> 
             do store.AppendValues(iter,[|prefix + name|]) |> ignore 
        | SubTree {Question=questYesBranch=yBranchNoBranch=nBranch } -> 
             let innerIter = store.AppendValues(iter,[|prefix + quest|])
             do treeToTreeStore yBranch store innerIter "YES: " 
             do treeToTreeStore nBranch store innerIter "NO:  "



Just as a reminder. A knowledge tree type is defined as follows:

type KnowledgeTree = AnimalName of string | SubTree of Tree
and Tree = {QuestionstringYesBranchKnowledgeTreeNoBranchKnowledgeTree}


which means that a knowledge tree is a leaf node constituted by a string (the name of the animal) or is a yes/no question and two sub trees (one leading to the animals for whom the answer is "yes" and another to the animals for whom the answer is "no").

I found reasonable to make extensive use of automated testing of the console based version by mocking the standard input and standard output, and avoiding adding tests related to the gui.

Mocking standard output and standard input requires only making those kind of abstractions

type OutStream =
    abstract Writestring -> unit

type InStream =
    abstract Inputunit -> string
that will be implemented in the real application as follows:

type ConsoleOutput() =
    interface OutStream with
       member this.Write(x) = printf "%s\n" x

type ConsoleInput() =
    interface InStream with
        member this.Input() = Console.ReadLine()


whereas mocks can be used to simulate input/output interaction (by simulating that the user write some input values, and adding expectations about values that the engine will write)


[<Test>]
    let ``with a one leaf node knowledge tree, engine asks if it is the animal on that node, the user says yes, and so the engine answer with "yeah!"``() =
        // setup 
        let mockrepo = new MockRepository()
        let inStream = mockrepo.StrictMock<InStream>()

        let outStream = mockrepo.StrictMock<OutStream>()

        // setup expectations
        Expect.Call(outStream.Write("think about an animal")) |> ignore
        Expect.Call(inStream.Input()).Return("whatever") |> ignore

        Expect.Call(outStream.Write("is it a cat?")) |> ignore     
        Expect.Call(inStream.Input()).Return("yes") |> ignore

        Expect.Call(outStream.Write("yeah!")) |> ignore 
        Expect.Call(inStream.Input()).Return("anything") |> ignore
          
        mockrepo.ReplayAll()

        // arrange
        let tree = AnimalName "cat"
        let initState = Welcome
        let numOfLoops = 3

        // act
        runUntilIndex outStream inStream tree initState numOfLoops |> ignore

        // verify expectations
        mockrepo.VerifyAll()



I guess it would be more complicated to make a similar mechanism of mocking the gui as well, with less value, because the gui is just a thin layer upon the engine which is already extensively tested.

Special mention to Visual Studio Code and  Ionide as editor with some autocompletion features that are missing in Monodevelop/Xamarin.

Future work could be related to
-saving/loading trees, and
-editing tree nodes in the tree view.

Toni.













Friday, March 21, 2014

Venn diagrams and Categorical Propositions in F#

Categorical propositions express specific relations between elements of different sets, in term of quantifier which can be Universal (like All, or None), or Particular (Some); and positive or negative ("is" or "is not").

Examples: All Mammals are Animals, No Mineral is Alive, Some people are Smart, Some people are Not Smart.
In this post I am going to show some code in F# I wrote, representing categorical propositions, and particularly, related to the union of them, with an internal structure that reflects the representation based on Venn diagrams.
This can be used to verify in an automated way the compatibility between different propositions, in the same way we use simple rules by Venn diagrams.
Example: "All S is P" and "Some S is not P" are not compatible. Venn diagrams show in this case that the two diagrams can't be overlapped.
The area of the diagrams can only be blank, can have a '*' or can be blackfilled, but we can't have '*' (meaning: there is something here), and a black area (meaning: there is nothing here), at the same time.

The building blocks are the four basic form of categorical propositions, and are classified as Affirmative or Negative, and Universal or Particular, as follows:



They have representation based on Venn diagrams:




Venn diagrams in this case consist on two circles indicating two sets which are partially overlapped, so they form three regions. The left region indicates what is in S that is not in P, in the middle region what is in common between them, and on the right what is in P that is not in S.
To indicate that a region is empty we have to black-color it, and to indicate that in a region there must be something in it we have to put an asterisk in it. It is not possible to black-color and put an asterisk to a region at the same time because it will lead to a contradiction (like "All S is P" and "Some S is not P" which is impossible).

Symmetric basic form diagrams (E and O) represent proposition that are valid also on the reverse: so Some S is P means also Some P is S, and No S is P means also No S is P.

As long as we avoid placing asterisk and black-color in the same area (which leads to a contradiction) we can have more complex and valid (consistent) diagrams than the basic forms, which represent more proposition for example as follows:

This diagram means: Some S is not P, No S is P, No P is S, and Some P is not S.

I want to write some F# code solving the problem of translating any Venn diagram like the previous one, involving two sets, in the list of all the categorical propositions that follow from it.

I'll use records to represent either the diagram and either a categorical proposition.
To explain what I will get, this is a possible automated NUnit test related to the "A" proposition:

[<Test>]
    let ``A Venn diagram to proposition``()=
        let allSIsP = {S=BlackFilled; SP=Empty;P=Empty}
        let expectedProposition = [{quantifier1=All;category1=S;appartenence=Is;category2=P}]
        let actual = VennToPropositions allSIsP
        Assert.AreEqual(
expectedProposition,actual)




I need to define:
a type for any diagram involving two categories, which will be a record:

type twoSetsVennDiagram = {S: SectionStatus; SP: SectionStatus; P: SectionStatus}

A "SectionStatus" to set each of its region with the appropriate status:

type SectionStatus = Empty | BlackFilled | Starred

A "Propositions" type (again, as a record):

type twoTermsProposition = {quantifier1: Quantifier; category1: CategoryRef; appartenence: Appartenence; category2: CategoryRef}

Quantifier will be:

type Quantifier = All | Somes | No

("Some" is an F# keyword, so to avoid confusion I chose "Somes")

Appartenence (i.e. "belonging") will be:

type Appartenence = Is | IsNot

CategoryRef will simply be a possible name of the set involved:

type CategoryRef = S | M | P 

("S", "M", "P" are typical names used to model categorical propositions and syllogisms  - which I'm not going to write about this time, so they are just "typical names" given to sets involved in categorical forms. For instance in the classical syllogism "Socrates is human, all humans are mortal, Socrates is Mortal, S is Socrates (or "everything that is Socrate"), M is "everything that is human" and P is "everything that is mortal").


The implementation aimed just to pass this test is like:

let VennToPropositions diagram = [{quantifier1=All;category1=S;appartenence=Is;category2=P}]

I expect that I need a sort of lookup table for basic diagrams (with only one non empty region, having BlackFilled or Starred state), and that I can one step a time add an element to such lookup on the base of adding a new test against the case:

       


let VennToPropositions diagram =

    match diagram with 

        | {S=Empty; SP=Empty;P=Empty} -> []

        | {S=BlackFilled; SP=Empty;P=Empty} ->  [{quantifier1=All;category1=S;appartenence=Is;category2=P}]

        | {S=Empty; SP=Empty;P=BlackFilled} ->  [{quantifier1=All;category1=P;appartenence=Is;category2=S}]

        | {S=Empty; SP=BlackFilled;P=Empty} ->  [{quantifier1=No;category1=S;appartenence=Is;category2=P};{quantifier1=No;category1=P;appartenence=Is;category2=S}]

        | {S=Empty; SP=Starred;P=Empty} -> [{quantifier1=Somes;category1=S;appartenence=Is;category2=P};{quantifier1=Somes;category1=P;appartenence=Is;category2=S}]

        | {S=Starred; SP=Empty;P=Empty} -> [{quantifier1=Somes;category1=S;appartenence=IsNot;category2=P}]

        | {S=Empty; SP=Empty;P=Starred} -> [{quantifier1=Somes;category1=P;appartenence=IsNot;category2=S}]






       
 
Basically the cases that are considered in this "lookup" table are the four basic forms, their "reverse", and the completely empty one, and the reverse of the ones that are not symmetrical:







But the match against more complex diagrams is missing, as the following:


I can see that this diagram is the union of the following basic forms:





















The resulting proposition is the union of them.

(Extreme case:  when all the region are filled.
The result may looks weird: "All M is P - No M is P - No P is M  - All P is M", but it can be sound anyway, because it applies when M and P are both empty: the "All" quantifier does not commit to existence, but the "Some" does).

The related test to a function that decomposes it could be as follows:

[<Test>]
    let ``complex decomposition``()=
         let SomeSIsNoP = {S=Starred; SP=BlackFilled;P=Starred}
         let actual = basicCategoricalDecomposition SomeSIsNoP
         let expected = [{S=Starred;SP=Empty;P=Empty};{S=Empty;SP=BlackFilled;P=Empty};{S=Empty;SP=Empty;P=Starred}]
         Assert.AreEqual(expected,actual)



So, any diagram that is more complex than those six forms, can still be decomposed in basic forms, then they can be matched returning the basic propositions, and then all the propositions can be joined together.

The basicCategoricalDecomposition function is:


       



let basicCategoricalDecomposition diagram =

        match diagram with

            | {S=s_pattern; SP=Empty;P=Empty}  when s_pattern <> Empty -> [{S=s_pattern; SP=Empty;P=Empty}] // A or O

            | {S=Empty; SP=sp_pattern;P=Empty} when sp_pattern <> Empty -> [{S=Empty; SP=sp_pattern;P=Empty}]    // E or I  

            | {S=Empty; SP=Empty;P=p_pattern} when p_pattern <> Empty -> [{S=Empty; SP=Empty;P=p_pattern}]

            | {S=s_pattern; SP=sp_pattern;P=Empty} -> [{S=s_pattern; SP=Empty; P=Empty};{S=Empty;SP=sp_pattern;P=Empty}]

            | {S=Empty; SP=sp_pattern;P=p_pattern} -> [{S=Empty; SP=sp_pattern; P=Empty};{S=Empty;SP=Empty;P=p_pattern}]

            | {S=s_pattern; SP=Empty;P=p_pattern} -> [{S=s_pattern; SP=Empty; P=Empty};{S=Empty;SP=Empty;P=p_pattern}]

            | {S=s_pattern; SP=sp_pattern;P=p_pattern} -> [{S=s_pattern;SP=Empty;P=Empty};{S=Empty;SP=sp_pattern;P=Empty};{S=Empty;SP=Empty;P=p_pattern}]

            | _ -> []


       
 



So far if the VennToProposition matches a basic diagram it works as a lookup table.
If I pass is is a more complex diagram, then any match fails, and so I add a default matching rule that just calls the above basicCategoricalDecomposition diagram by a recursive call to get the union of all the propositions.

I rewrite here the complete VennToProposition function:




       


let rec VennToPropositions diagram =

    match diagram with 

        | {S=Empty; SP=Empty;P=Empty} -> []

        | {S=BlackFilled; SP=Empty;P=Empty} ->  [{quantifier1=All;category1=S;appartenence=Is;category2=P}]

        | {S=Empty; SP=Empty;P=BlackFilled} ->  [{quantifier1=All;category1=P;appartenence=Is;category2=S}]

        | {S=Empty; SP=BlackFilled;P=Empty} ->  [{quantifier1=No;category1=S;appartenence=Is;category2=P};{quantifier1=No;category1=P;appartenence=Is;category2=S}]

        | {S=Empty; SP=Starred;P=Empty} -> [{quantifier1=Somes;category1=S;appartenence=Is;category2=P};{quantifier1=Somes;category1=P;appartenence=Is;category2=S}]

        | {S=Starred; SP=Empty;P=Empty} -> [{quantifier1=Somes;category1=S;appartenence=IsNot;category2=P}]

        | {S=Empty; SP=Empty;P=Starred} -> [{quantifier1=Somes;category1=P;appartenence=IsNot;category2=S}]

        | _ -> List.fold (fun acc item -> acc @ (VennToPropositions item) ) [] (basicCategoricalDecomposition diagram)



      
 



This last line is the one which decomposes in basic forms and then call recursively the VennToPropositions.

So far, any valid two sets Venn diagram corresponds to a set of one or more basic categorical propositions consistent each other.
Using the Fsharp Repl it is possible to test immediately the code, or it is possible to write new tests.
If needed, the sources are available on gitHub.

The page 184 of "Understanding Arguments" provides problems that can be solved using this "VennToPropositions" function (but of course they are not supposed to be solved with an automated tool).

Possible extension can be about evaluating categorical syllogisms, i.e. particular inferences involving two categorical propositions and three sets, but it can be something for another post.










Tuesday, March 11, 2014

kMeans in F#

This post is about the  KMeans , and a simple implementation in F#.

A quick definition of the problem I can give is:
given a set of points, and an integer n, divide the set in n partitions in a way that minimise the "within-cluster sum of squares (WCSS)".

By literature, the algorithm is fast but suboptimal, i.e. it converges to a local minimum.
It takes an initial guess of centroids, then calculate the clusters associated to each centroid guess, and then recalculate a new centroid of that clusters, and so... on until there is no difference between the set of clusters and the ones of the next iteration.

You should run it different times, and with different n, and then take the best result (according to the objective function minimising the "within cluster sum of squares").

For simplicity, I assumed only bidimensional points, but real problems are more complex and so they have more features that need to be represented by more dimensions.
Anyway this is the definition of a point:

type Point = {X: float; Y: float}

I describe here functions useful for the final algorithm.

Measure of a distance between two points:

let quadraticDistance point1 point2 = 
    Math.Pow((point1.X-point2.X),2.0)+ Math.Pow((point1.Y-point2.Y),2.0)


Any point has to be associated to a cluster represented by a centroid, when that centroids is the closest one:

let closestCentroid point centroids  =
    let precondition = match centroids with | [] -> failwith "Centroid list should be not empty" | _ -> true
    List.fold (fun acc item -> if ((quadraticDistance item point) < (quadraticDistance acc point)) then item else acc) (List.head centroids) centroids


(never mind the "precondition" check: it just creates an exception with a description, avoiding having a more obscure error of calling List.head on an empty list)

Given a segment, I need to find its real centroid:

let centroid segment = 
    let checkPrecondition = match segment with |[] -> failwith "precondition error, list of points expected to be not empty" | _ -> true
    let Xs = List.sumBy (fun item -> item.X) segment
    let Ys = List.sumBy (fun  item -> item.Y) segment
    let centroid = {X=(Xs/(float (List.length segment)));Y=(Ys/(float (List.length segment)))}
    centroid


(Again: the "precondition" make sure that if there is an error condition from empty segment we get the specific message, instead of a "division by zero")

Getting centroids for more segments all in once is easy by using the map function:

let centroids segments =
  List.map (fun x -> centroid x) segments



Now I have almost all the elements to build up the algorithm, and so I can try to implement it, and this time I use mutable types in the loop (so that I "reuse" the same variables, instead of virtually creating new ones each iteration).

I keep up to date the centroids relative to the current iteration and the relative cluster, and the centroids relative to the next iteration. If the cluster do not change, then we stop. Looks like list comparison works as long as they are reordered.


let iterateFindingFinalCentroids points initialCentroids =
    let mutable currentCentroids = initialCentroids
    let mutable segments = segmentPointsByClosestCentroid points currentCentroids
    let mutable nextCentroids = centroids segments 
    let mutable nextSegments = segmentPointsByClosestCentroid points nextCentroids
    while ((List.sort nextSegments) <> (List.sort segments)) do
        currentCentroids 
<- nextCentroids
        segments 
<- nextSegments
        nextCentroids 
<- centroids nextSegments
        nextSegments 
<- segmentPointsByClosestCentroid points nextCentroids
    nextCentroids



One missing part here is a function that separate all the points in different clusters on the base of their proximity to the given centroids.

I found convenient doing it in two steps. Tagging each point with its closest centroid, and then aggregating the points tagged with the same centroid.

A tagged point type can be useful, and perhaps better than a simple pair, because is more expressive:

type TaggedPoints = {Point: Point; AssignedCentroid: Point}


let tagPointsWithClosestCentroid centroids segment =
    List.fold (fun acc item -> {Point=item; AssignedCentroid = (closestCentroid item centroids)}::acc) [] segment


To put the points in different clusters:

let segmentPointsByClosestCentroid points centroids =
    let taggedPoints = tagPointsWithClosestCentroid centroids points
    List.fold (fun acc item -> (List.filter (fun p -> p.AssignedCentroid = item) taggedPoints |> List.map (fun taggedPoint -> taggedPoint.Point))::acc) [] centroids
   

(A next step could be making the tagPointsWithClosestCentroid internal to this last one).

For description of functions used here, see the msdn (for example this is the entry about how to use List.fold: http://msdn.microsoft.com/en-us/library/ee353894.aspx).

I played a little bit with artificial data: I generated artificially four clusters given by bidimensional Gaussian, so expecting that the centroid found will be the central/average (or just 'μ') of that distributions, and actually this is what happened (except when I included a subdle "bug" in the algorithm).

For that experiment I wrote some code in unit tests.

The Normal distributions have the 'μ'  (their centers) equals to (1,1), (1,8), (8,8), (8,1) respectively and variances the (σ^2)'s all equals to 2.

The algorithm actually converge to values close to the 'μ' of the random normal distributions used to generate the points.

I'll show some code for generating cluster from Normal distributions. This array contains the four sources of random data:

 let rnd = System.Random()
            let generators = [|
              {XGauss=new Normal(1.0,2.0);YGauss= new Normal(1.0,2.0)};   
              {XGauss=new Normal(
1.0,2.0);YGauss= new Normal(8.0,2.0)};
              {XGauss=new Normal(8.0,2.0);YGauss= new Normal(8.0,2.0)};
              {XGauss=new Normal(8.0,2.0);YGauss= new Normal(1.0,2.0)}
            |]



To experiment if the algorithm is affected by choosing different initial points the strategy is to take four generators:

            let sampleFirsts = [
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()};
                {X=generators.[1].XGauss.Sample();Y=generators.[1].YGauss.Sample()};
                {X=generators.[2].XGauss.Sample();Y=generators.[2].YGauss.Sample()};
                {X=generators.[3].XGauss.Sample();Y=generators.[3].YGauss.Sample()}
            ]
              

I can change the configuration of the sources simply changing their index. For example in the following case the initial centroids are taken all from the first source:

            let sampleFirsts = [
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()};
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()};
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()};
                {X=generators.[0].XGauss.Sample();Y=generators.[0].YGauss.Sample()}


To sample on thousand of points from the different sources (and the first four according to the idea of having initial centroids, so using the "SampleFirsts"):

            let sample = sampleFirsts @ (List.map (fun x -> {X=x.XGauss.Sample();Y=x.YGauss.Sample()}) (List.map (fun x -> generators.[x % 4 ]) [0..995]))


            let initialCentroids = [sample.[0]; sample.[1]; sample.[2];sample.[3]]

            let finalCentroids = iterateFindingFinalCentroids sample initialCentroids



Actually the code returns the centroids (and not the segments), and neither the "variability" (dispersion) of the segments, but they can be calculated later:

            let segment0 = List.fold (fun acc item -> if closestCentroid item finalCentroids = finalCentroids.[0then item::acc else acc) [] sample
            let segment1 = List.fold (fun acc item -> if closestCentroid item finalCentroids = finalCentroids.[1then item::acc else acc) [] sample
            let segment2 = List.fold (fun acc item -> if closestCentroid item finalCentroids = finalCentroids.[2then item::acc else acc) [] sample
            let segment3 = List.fold (fun acc item -> if closestCentroid item finalCentroids = finalCentroids.[3then item::acc else acc) [] sample

            let quadraticDist0 = sumOfQuadraticDistances finalCentroids.[0] segment0
            let quadraticDist1 = sumOfQuadraticDistances finalCentroids.[1] segment1
            let quadraticDist2 = sumOfQuadraticDistances finalCentroids.[2] segment2
            let quadraticDist3 = sumOfQuadraticDistances finalCentroids.[3] segment3




The result is that the algorithm is not affected by the different combinations of sources for the initial centroids (that correspond to the first four points of the sample), still converging to the four expected points, i.e. the center of the generators of the Normal (Guassian) bidimensional distributions.