Category Archives: Programming

Code Monkey Monday- The Meme Generator for R

News you can use: There is a meme generator package in R. Because that’s what we all need.

Check it out here. You can use built-in meme photos on the web at sites like http://memecaptain.com/ or you can link to your own image of interest. Here are some of my creations:

Me coding:
code poorly

In honor of the World Cup:
whineabout

And one for Maria:
sneakattack

Theory Tuesday- Statistics’ Place in Big Data

Interesting, but long, talk about statistics place in the Big Data world:

I’d suggest watching from about 10 minutes in to about 40 minutes.

“Statistics”, “data mining”, and “bioinformatics” are all on the decline according to Google Trends, while “Big Data” is booming. Many big data people don’t see the need for statisticians because of their seemingly antiquated/belligerent/unhelpful opinions on model validity, result confidence, and experiment design. However, people who ignore statistics are condemned to re-create statistics.

In my experience, the people who don’t see value in statistics are action-oriented and typically mathematically-ignorant. These people want to do something, and they are not especially interested in how accurate their actions are. More responsible big data teams will be built with people with three skill sets: programming, math/statistics, and domain knowledge.

Code Monkey Monday- Setting Up Self-Version Control

I program a lot. Mostly by myself. Sometimes at the office and sometimes at home. Last week, I suggested you set up Dropbox to store your personal files as you move from home to work. This week, I’ll show my solution to version control on these programming files.

While I do program by myself most of the time, version control keeps me from being an idiot and messing up my projects. It saves versions every time I “check in” my code and allows me to revert to a previous version if something goes wrong. I learned version control with Subversion, so I will be using that here. I know Git and Mercurial are also popular, so you may want to check out those instead.

I downloaded Subversion to both my work computer and personal laptop. This allows me to check in files from Windows Explorer by right-clicking on them. But first, we must set up a repository. Go into your Dropbox folder and create a folder titled “Repo”. Right-click this new folder and click TortoiseSVN->Create Repository Here.

Also in Dropbox, you are going to check out your code twice. Once from your personal computer and once from your work computer. I will explain why we check out two sets of the code in a couple paragraphs. Create a folder in Dropbox called “Repo-Checked out from home” (or work if you’re at work). Right-click on this folder and select SVNCheckout. The URL of your repository will be something like “file:///C:/Users/computerName/Dropbox/Repo”. Do a Fully Recursive checkout from the HEAD revision. Do the same thing at both work and home, adjusting the last word of the folder name accordingly.

This will set up your checked out code folders to have sub-folders “branches”, “tags”, and “trunk”. I’m working by myself, so I store all my code in the “trunk” sub-folder. Create a folder in “trunk” for each project and store your code in there. When you want to check in, right-click on the project folder and select “SVN Commit”. This will send your code to the Repo. When you next change computers, you’ll need to right click on the “Repo-Checked out from home/work” folder of your location and select “SVN Update”. This will update the files in this other repository with your work from the other location. Always update/commit from the correct folder on the correct computer.

I use two repositories like this because I use a development environment (Eclipse) that makes you select your workspace. I select one of the checked out code folders for home and one for work. If I ever lock my computer with Eclipse still running, I cannot select that same folder to be my workspace in Eclipse from another computer. So I can’t have both computers using the same checked out code folder as workspace if I plan on ever leaving Eclipse open when I travel. Having two folders, one for work and one for home, gets around this issue.

Theory Thursday- Simulation of a Poisson Process

You are using discrete-event simulation to analyze a process or system. Imagine that your arrivals occur according to a Poisson Process. Without using specialized simulation software, this post shows how to code up a Poisson arrival process. Y_n will represent the n’th arrival time.

Stationary arrivals:
If your arrival rate does not vary over time, then your task is easy. You have two main options:
1. Generate exponential random variables X_1, X_2, … , representing interarrival times. Set Y_1 = X_1, Y_2 = Y_1+X_2, Y_3=Y_2+X_3,… Stop generating random variables when Y_m>T for some m, where T is your time interval length that you want to simulate.
2. Generate a Poisson random variable N(T), representing the number of arrivals over a time interval of length T. Then generate N(T) uniform [0,T] random variables, representing arrival times. Sort the arrival times in ascending order to obtain Y_1, Y_2, …

Non-stationary arrivals:
If your arrival rate varies over time, you’ll need an extra step or two of logic to generate your arrival times. Here are two options for simulation:
1. Find the highest arrival rate in your time interval, \lambda_{max}. Generate a stationary Poisson process with rate \lambda_{max} as in option 1 in the stationary arrival section above. For each arrival simulated, generate a uniform[0,1] random variable. If this uniform random variable is less than \lambda(t)/\lambda_{max}, where \lambda(t) is the arrival rate at the generated arrival time, accept the arrival as “real” and keep it. If not, discard the arrival. The “real” arrivals make up an accurate non-stationary Poisson arrival process.
2. Divide the time period [0,T] into small time increments. For each time increment, generate a uniform[0,1] random variable. For each time increment, if the uniform variable is less than or equal to \lambda(t) dt, where dt is the time increment size, then an arrival occurs during that increment. Assign the arrival time to be a random time in the increment. This method is only approximately correct, but it is good enough in most cases and may be faster to simulate in certain cases.

There are other options, but these will get you started in simulating your Poisson processes.

Code Monkey Monday- Dropbox for moving files between home and work

I got frustrated by the unreliability of IU’s shared drive in the fall. I also got tired of forgetting files at home or work when I needed them at the other location. So I took the plunge and set up a Dropbox account. I can now store all of my files on Dropbox and not worry about bringing them with me everytime I go home. I downloaded the desktop application to my work computer and personal laptop, which automatically synchs any updates and keeps me from having to go to dropbox.com every time I want to access a file. This works well for personal and homework files; I doubt it would be advisable for anything proprietary or confidential.

Next week, I’ll show how I set up version control in my Dropbox so that I have version control on all my programming files.

Code Monkey Mondays- LaTeX for WordPress

If you like to write math and run a blog with WordPress, it may be useful to know how to use LaTeX in your posts. First question: do you host your own blog and use WordPress.org as the formatting system or does WordPress.com host your blog?

Self-hosting with WordPress.org formatting:

This is what I use for my site. Goto Plugins and search for LaTeX. Install the plugin WP LaTeX. You can now add equations and math to your posts by typing “latex mathy-LaTeX-code” with $ instead of quotes. So, for example, “latex e=mc^2” would become e=mc^2 if I had used the dollar signs instead of the quotes on the outside of the expression. To allow similar expressions to be used in your comments, goto the Plugin settings and enable the comments parsing.

Blog hosted by WordPress.com:

You don’t need to install any plugins. Just type LaTeX as described above and it will auto-parse.

Code Monkey Monday- Other Versions for Python Packages

If you program in Python on Windows, you know that there are a variety of versions of Python. Version 2.7 is still the standard, but the 3.x versions are probably getting better. You also need to worry about whether you are using 32-bit or 64-bit Python. Whenever you want to install a library, you need to be sure to get the library version that matches your Python version. That’s impossible half the time if you are using the standard package download site, whatever that may be. The developers of the library may or may not offer your version. If they don’t use your version and can’t be troubled with testing all the versions, then they won’t offer it.

The best work-around that I’ve found is a site by Christoph Gohlke. He offers unofficial Windows binaries for Python packages. There, you will typically find the version of the package that matches your Python version. Usually 32-bit and 64-bit options are available for recent releases of Python. Be sure to check there if you ever need a hard-to-find version of a package. It’s way easier than building it yourself.

Code Monkey Monday- Eclipse and PyDev

If you program in Python, you’ll eventually want to use a development environment. Eclipse is a well-known IDE (integrated development environment) for Java programming. You can co-opt Eclipse to use with Python by using PyDev.

Download Eclipse, saving it to your desktop. I don’t install it to my C: drive or anything; I’m not even sure if you can do that. Download PyDev. Drag the folders “features” and “plugins” to your Eclipse folder, thus merging the contents of the PyDev folders and the Eclipse folders. Open Eclipse and select a working directory. In Eclipse, goto Window->Open Perspective and PyDev should be listed there. Open it up and now you’ll be able to program in Python in Eclipse and run Python programs in Eclipse.

Reds Caravan Reveals Team Employs 3 Programmers

The Reds Caravan rolled through Bloomington yesterday. This edition of the traveling side-show featured Marty Brennaman, Eric Davis, Assistant GM Bob Miller, new guy Brayan Pena, minor leaguer Tucker Barnhart, and Big Red Machine glue-guy Doug Flynn. After Marty introduced everyone, he opened up the floor to questions. I asked Bob Miller about the status of the Reds’ Analytics efforts. He tried to convey the vast amount of data that the team collects, including over 90 data points for every pitch thrown. The Director of Baseball Research/Analysis is Sam Grossman, who heads a team of three programmers. The team also employs over 20 scouts, which are especially necessary for understanding high school and foreign talent where the data on the player’s performance is sparser/non-existant.

While I appreciate the honest and helpful answer from Mr. Miller, I wonder whether having three people doing analytics for a team that is going to spend $100M+ each year on player payroll is enough. Do other teams have more analytics professionals? The Reds, under the Dusty Baker regime, tended to ignore a lot of largely accepted analytics wisdom:
-Baker consistently batted his shortstop in the top 2 spots in the order, despite the Reds not having an above-average bat playing shortstop
-The Reds left Aroldis Chapman, one of the most dominant pitchers in baseball, to languish in the closer roll for 2 years, where he pitched a total of 135 innings over 2 years, having a minimal effect on the game. Mike Leake, the Reds’ 5th starter, registered 371 innings in those 2 years.
-Baker wanted his hitters to be aggressive at the plate, which lowered their walk rate, sometimes to comical levels. Getting on base is important, and walks are a way to get on base.

I’d like to see the Reds become more cutting-edge in accepting data-driven wisdom that will improve their team’s performance. As a skilled analytics developer, its frustrating for me to see my team frequently mocked by those individuals who work full-time in baseball analytics. Maybe they’ll hire me as a consultant. I can fix them.

Maria wrote a wrap-up of the Bloomington Caravan stop for Redleg Nation. You should check it out here!