19 February 2015

Tidy data set: an "Orange" data mining try out

Dear friends,

This post will be the first post in 2015. I've got many administrative tasks at the Uni, that consumed my blogging hours. The following post actually a series of my tweets in response to my earlier tweets about using Orange data mining software. It's a free and open source software based on Python. This amazing point and click (pnp) piece is developed and maintained by  Bioinformatics Lab at University of Ljubljana, Slovenia, in collaboration with open source community. I use R most of the time, but knowing this pnp implementation has been interesting since my students are not familiar with programming. So you might want to check it out.

The first three tweets (from my account @dasaptaerwin) were:

1st tweet
Orange data mining app: python-based, free, runs on Linux, Mac, Win. Point-click approach ()
 2nd tweet

Orange data mining app: python-based,free,runs on Linux,Mac,Win. Point-click approach ()

3rd tweet
And here's the look of Orange data mining app on my Linux desktop  

I am using Ubuntu 14.04, so it will look like this.


Then the tweets were replied by @belajargeologi  

Thank you for this useful information. We hv downloaded it. Would you like to teach us how to use this program?

So then I decided to blog it. 

Orange offers a visual analytic environment. You could just drag any of the analyses (or Widgets as they are called in Orange language) from the toolbox on the left to the work-space (Orange calls it as Canvas). Then you could connect all of the widgets to one another to produce some outputs on the separate sheet (window). You could save any resulted charts and plots. If you see a "red minus sign", that means Orange can't run your analysis flow and ask you to review it.  


It offers interactivity with its user by showing directions and clues about what inputs are needed and what outputs will be produced for each of the widgets. So you have to pay attention to the directions before connecting the widgets. Each node in the above image is a process that needs inputs from its left and produces outputs to the right. 

Then I tweeted some of the results of my analysis on water quality, as follows:  

Distance matrix


Hierarchy cluster 

Regression tree


Then my tweets grew to the data preparation stage or how to make tidy data set (#tidydataset). 

: start by preparing your , containing only variables in columns & case/observations in rows.

: has to be free from table title, merged columns and/or rows. So it is ready to be analyzed.

: numbers in must not mixed w/ string or character, eg: "151.180" or "151,180" or "1,511.80" R diff

: uses the same coordinate projection sys, strike/dip direction, unit of measurements.

: u can use text (char) for: sampleid, classifier/categories data: rock type, high-med-low in (end)

: left table (not tidy) -> un-necessary formatting, right table

This is an example of the not-tidy data table. You can see how the cells are merged with too many un-necessary formatting. This table is nice for final visualization but not for data analysis. You can also see that the rows are for variables/measurements and the columns are for cases or observations. The standard formats for data table is cases/observations in rows and variables/measurements in columns. However, you might want to consider this format when you have a "long" data formats (read The Wide and Long Data Format for Repeated Measures Data and  Is the wide or long format data more efficient? - Stack Overflow).
).

Not-tidy data table 

Tidy data table

So that's all for now. I might re-visit and add more information. Let me know what you think.

No comments: