« The Intrinsic Poverty of Thumbs Up/Down Popularity Voting on the Web | Main | Announcing PeepWeather.com! The at-a-glance weather forecast for outdoor enthusiasts. »

Using word clouds to get a birds eye view of a large professional programmer's notebook

(Image couresy of Theen Moy.)

tl;dr: Running a word cloud program on a multi-year professional log is entertaining and useful for a quick understanding of past projects.

Background: My Big-Arse Text File

In 2005 I wrote My Big-Arse Text File - a Poor Man's Wiki+Blog+PIM where I described the simple setup I use for keeping my professional ProgrammersNotebook, something I've been doing for decades. (If you're a programmer and don't keep one, I suggest you experiment with it. You don't have to go old school Emacs like I did - there exist wonderful tools like Evernote.) This practice has been undeniably helpful, especially when combined with using separate outline files for individual projects, and it facilitates my using the journal to:

  • Understand where I've been spending my time, including meetings (I use CamelCase to name people and projects, like GraphxEvaluation),
  • track the lifespan of individual projects,
  • appreciate just how much work I've done,
  • record code snippets,
  • log sites/algorithms/programs/tools that are exciting or that might be useful,
  • capture ideas (of course!),
  • save great quotes [1] I come across,
  • record account information (be careful of security, though), and especially
  • provide a master index to other project files (done simply by naming the file or by inserting a faux hypertext link where I put the relevant text in square brackets, e.g., "see [analyze log of failed graphx 5-step path]")

I find that all of this (including Emacs tools like Occur), combined with Mac OS X's Spotlight, is effective.

The problem: Getting a birds eye view

In preparing for a projects meeting with my boss, I needed to look over the file to get a birds eye view of the last 3 1/2 years. But I wasn't able to get a higher-level perspective of the 2500 entries in 40K lines of text. So I tried out an idea: Would creating a tag cloud on the file help to identify useful patterns? I found the answer is yes; in a just few minutes it can remind you of what you worked on. The only consideration was that a single word cloud for the whole file wasn't useful. I found I had to split the file into chunks to get the right temporal granularity, with 1000-line splits (basically monthly pieces) being about right.


Here are a few examples, followed by the steps I took to generate them.

#1 - A few main projects

Yep - I've been evaluating GraphX with its Scala API for performance on a essential path-based algorithm the lab needs. (I'll share results in a future post.) Two months earlier you'd see basically the same pattern, but with Giraph and its outline text file as the focus (yes, The Apache Software Foundation is amazing) with Giraph, Vertica, and Impala before that.

#2 - Mixture of projects

This one shows a time that was a little more diverse, including a relational data generator tool and some cluster https://en.wikipedia.org/wiki/Computer_cluster improvements.

#3 - A single giant focus

I really like Postgres, which has been our go-to SQL RDBMS for some time. During this time period I was writing an SQL data store for our "causal database."

#4 - Layout issues

Python development with TDD (it's how I write code - XP rocks). Obviously the layout was skewed by a single very long line - maybe from some program's output log. The text is too small to read, so I couldn't add it to the stop words file below. Unfortunately the program I used doesn't seem to have a line limit feature.


Generating these was straightforward once I found a workable tool, which took about two minutes of searching. I didn't want to use an online one (Wordle is popular) for privacy reasons, but fortunately IBM Word Cloud Generator fit the bill. It's a Java program with reasonable arguments and a config file. I ran split to get the chunks then ran the jar in a Bash for loop, and that's it. All I had to do was create a stop words file to remove some distracting ones ("system", "new", "INFO", etc.)

What do you think? Have you had to examine your log file for projects? Cheers!

[1] A few quotes

"When you're a student, you're judged by how well you answer questions. But in life, you're judged by how good your questions are." The art of entrepreneurship

(Regarding Heinlein's quote, "Specialization is for insects.") "[companies currently] hire people specialized to know some very narrow system. They want them to come running out of the box. [disappointingly] They want to cast them aside when they’re done. [instead] We need people who can think and change and learn what they need to learn." Masterminds of Programming: Conversations with the Creators of Major Programming Languages (Amazon link).

"Then he told me, very tenderly, that it can be dangerous to believe things just because you want them to be true. You can get tricked if you don’t question yourself and others, especially people in a position of authority. He told me that anything that’s truly real can stand up to scrutiny." Lessons of Immortality and Mortality From My Father, Carl Sagan

There are a ton of thought-provoking programming quotes here, such as these (which are not jokes):

  • When in doubt, use brute force.
  • The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.
  • The essence of XML is this: the problem it solves is not hard, and it does not solve the problem well.
  • Nobody should start to undertake a large project .. start small, and think about the details. Don't think about some big picture and fancy design. If it doesn't solve some fairly immediate need, it's almost certainly over-designed.

Reader Comments (2)

I love it! Emacs & Evernote in the same sentence! :-)

I have 12k notes in Evernote after passing through the eye of the needle in 2013 & digitizing 40 years of my life - I look forward to donating my data to science as it makes sense ;-)

I started a running accomplishment register for 2015 based on a Franklin-Covey compass method I'm experimenting with. I'm clear the note is meant to be grepped.

I'm particularly overdue playing w the Evernote API to see what I can do - have you dived in yourself yet? Be curious to see any utilities you've developed.

A lot of my focus around measures is in the realm of well being & where I am most interested is quantifying beyond the raw data from wearables. I have an note I've been meaning to revisit of a huge brainstorm of possible measures to pull from Evernote if one were to architect one's notes in a very object/outcome-oriented way - still experimenting.

Well, just want to let ya know I'm available as a guinea pig for crazy techniques you conjure up :)
March 15, 2015 | Unregistered CommenterChinarut
I really like what you describe you're doing in Evernote - a tool you're using for self-improvement. This is highly honorable in my book. I hadn't thought about the API, but I agree it opens up a lot. Glad they have a ton of languages supported, esp. Python, which I've come to really enjoy. Cheers, and thanks for writing!
March 22, 2015 | Registered CommenterMatthew Cornell

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
All HTML will be escaped. Hyperlinks will be created for URLs automatically.