Datamining Forums

Would be interesting to compare writing styles, format etc of users on forums. Would be interesting if we could identify users with multiple accounts. Also, would be interesting to look at the change in style over time per person to see if it was hacked or something out of the ordinary. Plot emotions over time. Would be interesting to see who has the emotional sway (can cause others to become emotional etc.) over the forum or post etc. I wonder how much there is in this area. I would imagine it would be a lot since we have had forums for a long time and I think this is something that Facebook has (or should be) been doing.

JUnit and TestNG

I just got finished some code refactoring and a major rearchitecturing of bounties to be more ammenable to unit tests. Which I, for lack of time, unfortunately didn’t do. But, now that I have a bit of time and its for a class I’m going to start adding them.

Junit has always been my goto unit testing framework. It was the one I was taught. However, I just found a new unit testing framework called TestNG. According to this article TestNG has more features than Junit! That link also goes through and compares and gives examples how its used.

So, I’m going to try it out. See how I like it.

Robins and cryptochrome

I was sitting in my chair and I saw a robin out the window and I wondered whether it could see me (as in how far can robin’s see). I of course googled it. But, I got sidetracked and found a Discovery article explaining that robins can see the earth’s EM field. According to the article the effect is due to the right eye, left half of the brain and a molecule called cryptochrome (best name for such a molecule, in my opinion). The article discusses an experiment that shows that the birds can’t orient correctly if their vision in the right is obscured. Of course this didn’t answer my original question, but it is much cooler.

This reminds me of Gary from Alphas (“respect the badge!”).

Comparison Continued

I took a look at the R source code for the tukey test (HSD test) from the agricolae library http://cran.r-project.org/web/packages/agricolae/index.html. The interesting thing is that they purposefully call round:

round(1-ptukey(abs(dif[k])*sqrt(2)/sdtdif,ntr,DFerror),6)

Also, they assume lm or aov. So, we know that they don’t like big numbers. However, the nice thing is that the R file shows how to use the ptukey function! Which is where I was a bit iffy. So, I think I could get away with writing it in R and just instead of requiring lm or aov I just want the data then I can perform the requisite calculations.

I think I’ve convinced myself that I’ll trust mathematica and I’ll deal with how long it takes to get the data. I don’t like how ineffiecient it is, but I can sacrifice time if it means that I get accurate results. I don’t think that I’m qualified or have the time to create such a library. It is interesting though that we don’t have many papers on knowing how many significicant digits are necessary for accurate tests. Or that the R libraries don’t report the errors in their calculations. This must be why people use Mathematica or Matlab because we trust them to do it right to enough sig figs that it doesn’t matter the error because it is insignificant.

So, I just need a nice desktop.

New computer, is it worth it?

OR I could just ask for a computer with more ram then I could just use slow mathematica or R or whatever and not care. Of course it might not be that slow for R since its not that slow it just takes lots of ram. Which makes sense since it is very memory inefficient. That would be so much easier. But it would cost probably 2k for a computer like that. Crazy that if only they had coded the tests better I could get results instantly!

Then I can also have the nice pretty graphs that Mathematica has too. Then I don’t have the headache of ensuring that the tests are correct etc. That would be terrifying if I coded the tests wrong and I reported incorrect results. So, probably best to just get a new computer… I think he said he has money to buy machines.

Handcoded Tukey

So, it seems that if I want a fast tukey test I need a fast ANOVA. Which seems to be where the bottle neck is. If I had time I would code ANOVA and borrow a Tukey implementation in C. I think that all of these programming languages are doing it wrong for what I need to do. In all of the languages I have to load all of the numbers and then perform an ANOVA. Then do the post hoc tukey test. The thing that takes a lot of ram is that I think these algorithms are keeping all of the numbers in ram rather than loading and summing and then releasing the memory and keeping the sum. What should be done is that the values could be calculated while the file that the numbers are stored in is being read. Then you would really have at most n numbers (where n is the number of groups) in ram. Which is tiny. So, the ANOVA really should be super fast. Which means that maybe its the tukey test itself that takes a while… I don’t think so though. The code for it looks pretty fast. So, it must just be poor memory management and large numbers that are tripping mathematica and R etc. up because it seems like it could be a super fast calculation.

http://www.graphpad.com/support/faqid/1517/ has a link to the C code that R uses for the tukey test. It uses double values so I don’t think I would be able to just plug and play with it since the values I have are greater than a double.

large numbers

Well I’ve been looking into using R to calculate the tukey test. I’m doing this because I have a 10 files of 200,000 integers (all around 5000 and above). That is about 200MB per file. So, right now the ANOVA with a post hoc tukey test takes about 1hr to run in Mathematica. THIS IS SO SLOWWWW.

To do a tukey test you first get the numbers from an ANOVA (http://web.mst.edu/~psyworld/anovaexample.htm). Then run it through the tukey formula (http://web.mst.edu/~psyworld/tukeysexample.htm) for each pair. It really doesn’t seem that hard. And it doesn’t seem like it should take an hour to run. In mathematica it takes forever to even load the files.

So, I’ve been looking into using R. Loading the numbers into a list is super fast, takes no time at all. However, to run the ANOVA it uses so much ram I can’t do it on my computer. I’m starting to think it might be that the numbers are so big after squaring them that it causes it to use so much memory. I don’t know….

But, the fun thing was in the process I tried out Rmpfr a library that is suposed to provide arbitrary precision calculations. However, when I try doing 1246863692^2 using Rmpfr I get 1554669066427870976. Which is wrong! It should be 1554669066427870864. Which I hand, wolfram, and scientific calculator triple checked! So, I’m not going to use that package.

Also, I tried http://cran.r-project.org/web/packages/Brobdingnag/vignettes/brobpaper.pdf. Which also doesn’t do what I want!

So, I found gmp which works!!!

Touchscreen for touchpads

Would be neat to have a color touch screen that acted as the touchpad for a laptop. It could make gestures easier, allow you to maybe see your clipboard, show your password here instead of on your monitor (less easy for someone to observe this way). I’m sure there would be a lot of superficial and gimmicky things you could do.

Man vs Machine

Some people (not me) believe that we are the result of random chemical interactions. I wonder what the probability is that they assign to us existing. Because it has to be less than the probability of machine men existing. The existence of such machines would fit more into the notion of evolving randomly by fitness since they would be more durable and require less variety of resources. The number of micro organisms and organs and what not that must work together in order for humans to work compared to what a machine man would need, most likely would astound us.

http://learn.genetics.utah.edu/content/cells/scale/ really neat slider app that puts the size of a cell into perspective.

On a side note I found this neat PopSci article, http://www.popsci.com/have-we-found-alien-life, haven’t read it yet but it looks interesting.

Just a note, I’ve been sick for that past few days, most likely the flu. Thats probably why I’ve been writing such odd posts… Oh well.

Bounty Hunting AAMAS

Bounty Hunting and Multiagent Task Allocation, a paper I co-authered with David and my Professor was accepted into AAMAS as a full paper! This is the first paper I have published where I’m the first author and at a respected AI conference. I’ll post a link to the full paper once everything is finalized. Its very exciting. I will get to go to Istanbul in May to present the paper. I’ll get to meet and talk to all kinds of people who are interested in the same things I am and hopefully get some good ideas and advice.

Drew’s Borg

Prepare to be assimilated