Code Monkey, by Jonathan Coulton
Very catchy. Very appropriate. Very much the anthem to my Monday morning posts.
Code Monkey, by Jonathan Coulton
Very catchy. Very appropriate. Very much the anthem to my Monday morning posts.
Why are you using Access? It’s a pain in the butt and I don’t like it. Transfer your database to MySQL for easier compatibility with… everything.
I had to make this transition a few weeks ago, and here’s what I did. Remember that I am not a database expert, but this worked for me.
1. Download MySQL on your computer. Click on the Enterprise Edition link here. Oracle apparently controls MySQL nowadays, so sign in or register with them to get to the download. You can insert fake data in every data field for the registration except the email address. Use a real email address so you can confirm your registration. Once you get to the download page in Oracle, select MySQL and your computer type (i.e. Windows 64 bit). Input that query, then download the “MySQL Installer 22.214.171.124 Package” (or whatever version they’re currently on).
2. Run the installer .msi. Install the “Developer Default” setting unless you know any better. When you start installing, MySQL will probably tell you that you’re missing some pre-requisite programs. Just install them through MySQL like it wants you to. Install the program. After the install, there will be some configuration. On the “MySQL Server Configuration” page, I’d leave the defaults (Port=3306). On the next page, you’ll have to specify a password. Remember that password for future use. I’d leave the defaults on the other pages as well, unless you want to uncheck the box for starting MySQL Server at computer startup.
3. Download the “Access to MySQL” tool from here. The download link is at the top of the page.
4. Install the Access to MySQL program.
5. If your computer is like mine, when you try to run the Access to MySQL program, a box will pop up and complain about drivers. It will say you can only create dump files instead of doing the transition in one step. Whatever, just continue.
6. Click “Next” on the first page. In the Filename field of the second page, input the path to the .mdb file of your Access database. If you have security settings (username/password), enter those. On the third page, create a dump file (probably your only option). Type the name that you want the MySQL database to have and the path that you want the dump file saved at. Click “Next” or “Run” on any remaining pages. This will create a dump file of all of your tables in the Access database that you specified. Now we’ll import that dump file into MySQL.
7. Open MySQL Workbench. Double click on your “Local instance MySQL56” box on the front page. Enter your password that you just specified during installation.
8. Goto Server->Data Import. Select the bubble for “Import from Self-Contained File” and enter the path to your dump file. Click Start Import at the bottom.
9. If all went well, your Access database will now be available in MySQL under the database name that you specified in step 6.
You can now query your database from Python using my instructions from last week.
1. Set up a MySQL Database on your computer.
2. Install the Python library MySQL-python from http://www.lfd.uci.edu/~gohlke/pythonlibs/
3. In your code, add the import statement “import MySQLdb” at the top.
4. Find the database host, username, password, and database of interest.
5. Set up a database connection in Python with a line similar to
“db_connection = MySQLdb.connect(host=’localhost’, user=’root’, passwd=’FluffyBunnies1234′, db=’MyAwesomeDatabase’)”
Here, my host is “localhost” because the database is stored locally on my desktop at work. I access the database from the “root” user account. You can also connect to external databases as long as you frame the connection string properly.
6. Set up a database cursor that you will use to execute queries. “cursor = db_connection.cursor()”
7. Structure up your SQL query. The query should be a string. All white space is treated equivalently, so you can have a space or a newline between parts of the SQL. Example “SQL = SELECT alpha, beta FROM parameter_table WHERE alpha>.5 ORDER BY beta ASC”.
8. Execute the query with the cursor. “cursor.execute(SQL)”
9. Fetch the output. “output = cursor.fetchall()”. If your query will return too many results to grab all at once, you can instead use the function fetchone() instead.
10. Your “output” variable will be a list of lists, where each inner list represents one line returned by query and each element of the inner list represents a column referenced in the SELECT statement. If my query above returned 500 parameter combinations, then I would have a list of 500 lists, where each of the 500 inner lists has two elements: alpha and beta.
Hope this gets you started. For a full documentation of the MySQLdb class, you could start here. Pay attention to make sure you’re reading the parts about the MySQLdb wrapper.
If you like to read data in from text files or save lots of data to text files, you’ve probably discovered that Window’s default text reading program, Notepad, sucks. It can’t open large (MB or larger) text files, sucks at formatting, and seems slow. I prefer the freely downloaded program Notepad++. It can handle large files with ease, allows multiple text files to be open in the same program, highlights all equivalent words if a word is highlighted, gives line numbers, and probably has a ton of other capabilities that I’m not familiar with. You should upgrade to Notepad++.
You want to include a little bit of dark magic in your Python coding, right? Well, read on.
More times than I care to admit, I have situations in which I want my code to write more code for me. Because either I am lazy or I need to be tricky. A recent example: I needed to create and populate a list of lists for each week of the NFL season to contain some of my analysis results. With normal coding, I saw two options: copy paste my code 21 times (17 regular season weeks and 4 playoff weeks) or create a confusing list of list of lists of results. Because the analysis was consistent over the weeks, there was a third option that I preferred: using the exec() and eval() statements.
exec(string) takes a string and executes it like it was a full line of code. So exec(“list”+str(week)+”=”) would execute different things depending on the value of week:
If week==1, the line would be: list1=
If week==2, the line would be: list2=
etc. It basically substitutes the value of week into the string and then executes what the string says.
By cycling through the weeks of the season, I created an empty list for each week using exec:
for week in weeks:
2 lines of code to create 21 lists instead of 21.
I later populated the list with something like: exec(“list”+str(week)+”.append(results)”)
The eval(string) turns the string into code and then allows it to be used in a larger code statement. For example: output+=eval(“list”+week+”[i]”)
This creates the code line
if week==1. In this way, I can add the i’th component of each week’s list to some output string that I saved.
There are more exotic uses of exec() and eval() that I’ve used in the past. Someday I’ll write about how we created code to parse and evaluate any Google Protocol Buffer (kind of like xml) at Booz Allen.
Disclaimer: exec() and eval() may not be “good practice” for a professional coder, but I’m not a professional coder and just create code that works for me.
Quick tip: When you use the RAND function in Excel to generate a random number, the number tends to update everytime you change something in Excel. To stop the auto-updating, copy the random number(s), and then re-paste them into the same cell(s) using Paste Special -> Values. This will save the random value and get rid of the auto-updating random function.
I worked on a project in a Complex Systems class where I wanted to know if there was any value in looking at the network of at-bats in baseball. To create this network, I assumed that if a batter got on base via hit, HBP, or walk, the batter won the at-bat and I drew a link from the pitcher to the batter. If the pitcher got the batter out, I said the pitcher won the at-bat and drew a link from the batter to the pitcher in the network. This created a directed graph that I could run networked statistics on, such as PageRank. I wanted to know how well the rank of a player in summary statistics (ERA, AVG, OBP, WAR, etc) matched up with the rank of the player in PageRank. PageRank, in this context, puts value upon beating other players with high PageRank. So, when playing the Dodgers, getting a hit off of Clayton Kershaw last year was worth more than getting a hit off of Chris Capuano. Did good players overperform or underperform against other good players? Does this have value for predicting playoff success? These were some of my questions as I started my study.
To get results of batter/pitcher matchups, I crawled Baseball-Reference.com. Their Play Index Tool lets you look up the results of any players’ batting/pitching matchups, possibly filtered by year. I wanted to download every at-bat for the year 2013.
To begin with, I’m not sure Baseball-Reference.com wanted me to crawl their records. They have disclaimers against this sort of bulk downloading, but I was using the data for a personal project and didn’t profit from it, so I went ahead. They didn’t kick off my IP as I went about crawling/downloading these matchups.
My code for this project is in Python. I used the screen-scraping package Beautiful Soup.
I first had to grab the usernames for all players in the majors in 2013. I went to this page to get the batters and this page to get the pitchers. Looking at the page source for the pitching page, you notice that the usernames start around line 1727. Download the page source for those pages and use some logic to grab all the usernames for pitchers and batters. Here is my ugly code to parse the usernames.
Once you have the usernames, you’ll want to crawl Baseball-Reference.com to get matchup data from every batter. Unfortunately, a batter’s matchup data (like this for Barry Larkin) creates the same page source whether you filter by year or not. Filtering by year only dynamically changes what is shown on the screen; it doesn’t change the page source, which is what we are going to crawl. So we have to use three steps to get only 2013 data:
-Parse the page source for a batter’s alltime matchups to see which pitchers he ever faced
-For each pitcher, see if that pitcher is in the list of 2013 pitchers
-If it is, crawl ‘http://www.baseball-reference.com/play-index/batter_vs_pitcher.cgi?batter=’+batter+’&pitcher=’+pitcher to get the line related to 2013. Add this line to your statistics that you are keeping.
Here is my Python code to download all at-bats from 2013. You’ll notice that I import urlopen from urllib2 to tell Python to open the webpages of interest. Then I use Beautiful Soup to parse the page source. Throughout the code, I added in lines like “time.sleep(random.random()*10)” from the time package to make the code delay a random amount of time. This kept me from overloading Baseball-Reference.com with requests and hopefully kept me from pissing them off. If you’re interested in using the code, note that you’ll obviously need to change your input/output folders to match your computer.
Hope this helps. I know it’s not 100% complete in its description, but post in the comments if you’re confused in some way.
News you can use: There is a meme generator package in R. Because that’s what we all need.
Check it out here. You can use built-in meme photos on the web at sites like http://memecaptain.com/ or you can link to your own image of interest. Here are some of my creations:
Interesting, but long, talk about statistics place in the Big Data world:
I’d suggest watching from about 10 minutes in to about 40 minutes.
“Statistics”, “data mining”, and “bioinformatics” are all on the decline according to Google Trends, while “Big Data” is booming. Many big data people don’t see the need for statisticians because of their seemingly antiquated/belligerent/unhelpful opinions on model validity, result confidence, and experiment design. However, people who ignore statistics are condemned to re-create statistics.
In my experience, the people who don’t see value in statistics are action-oriented and typically mathematically-ignorant. These people want to do something, and they are not especially interested in how accurate their actions are. More responsible big data teams will be built with people with three skill sets: programming, math/statistics, and domain knowledge.
I program a lot. Mostly by myself. Sometimes at the office and sometimes at home. Last week, I suggested you set up Dropbox to store your personal files as you move from home to work. This week, I’ll show my solution to version control on these programming files.
While I do program by myself most of the time, version control keeps me from being an idiot and messing up my projects. It saves versions every time I “check in” my code and allows me to revert to a previous version if something goes wrong. I learned version control with Subversion, so I will be using that here. I know Git and Mercurial are also popular, so you may want to check out those instead.
I downloaded Subversion to both my work computer and personal laptop. This allows me to check in files from Windows Explorer by right-clicking on them. But first, we must set up a repository. Go into your Dropbox folder and create a folder titled “Repo”. Right-click this new folder and click TortoiseSVN->Create Repository Here.
Also in Dropbox, you are going to check out your code twice. Once from your personal computer and once from your work computer. I will explain why we check out two sets of the code in a couple paragraphs. Create a folder in Dropbox called “Repo-Checked out from home” (or work if you’re at work). Right-click on this folder and select SVNCheckout. The URL of your repository will be something like “file:///C:/Users/computerName/Dropbox/Repo”. Do a Fully Recursive checkout from the HEAD revision. Do the same thing at both work and home, adjusting the last word of the folder name accordingly.
This will set up your checked out code folders to have sub-folders “branches”, “tags”, and “trunk”. I’m working by myself, so I store all my code in the “trunk” sub-folder. Create a folder in “trunk” for each project and store your code in there. When you want to check in, right-click on the project folder and select “SVN Commit”. This will send your code to the Repo. When you next change computers, you’ll need to right click on the “Repo-Checked out from home/work” folder of your location and select “SVN Update”. This will update the files in this other repository with your work from the other location. Always update/commit from the correct folder on the correct computer.
I use two repositories like this because I use a development environment (Eclipse) that makes you select your workspace. I select one of the checked out code folders for home and one for work. If I ever lock my computer with Eclipse still running, I cannot select that same folder to be my workspace in Eclipse from another computer. So I can’t have both computers using the same checked out code folder as workspace if I plan on ever leaving Eclipse open when I travel. Having two folders, one for work and one for home, gets around this issue.