CSP 301, Fall 2012: Design Practices in Computer Science -- Information Visualization
This is an introductory course to initiate undergrad students into working on large projects in a team and get familiar with various tools that will come in handy for future courses. This time we will do very interesting work in information visualization. We will start with using an object oriented toolkit called Prefuse and later in the course we will move to a functional programming toolkit called D3. You can expect to do a lot of programming in Java, Javascript, and scripting languages such as Perl or Python. You will also have to learn to use a database such as Mysql. And paper writing tools such as Latex.
Thanks to in Bangalore, we will have access to some cool datasets about Indian politics and trade relationships. Here is a peek at several cool visualizations that can be done using these platforms.
Small world networks
Military as a profession
2004 US election choropleth
Week 1: Get set up organizationally
- You will work in teams of 3-4 students. About 8 teams will be assigned to each TA for supervision. The first step will be to form your groups using the following Google doc. TAs include:
- Daniel J. Mathew (lead TA), with Lalchand: Office hours Mon 3-4pm in the SIT lab (Bharti 203)
- Ravee Malla (lead TA), with Minhaj Ahmad: Office hours Mon 3-4pm in the GCL lab
- Sandeep Kumar Bindal (lead TA), with Saurabh Mangal: Office hours Wed 3-4pm in the SIT lab (Bharti 203)
- Rahul Goyal (lead TA), with Himanshu Panwar: Office hours Mon 3-4pm in the GCL lab
- Abhinav (lead TA), with Manjeet Dahiya: Office hours Wed 3-4pm in the GCL lab
- We will use the Piazza platform to exchange notes where you can ask questions, help answer questions for your friends, report any interesting results, etc. Bonus marks will be awarded based on Piazza participation.
- Go to https://piazza.com/
- Click on "Enroll for a class"
- Search for "Indian Institute of Technology Delhi"
- Search for "CSP 301"
- Enroll as a student
- Grading will be done based on the following criteria:
- Assignments
- Completion, innovation & creativity, aesthetics
- Code documentation
- Code cleanliness, including indentation and modularization
- Wiki how-to on installation and execution of code
- Viva & demo: Each group will be questioned on their assignment submissions. Unsatisfactory answers will rewarded with negative marks. Any evidence of plagiarism or free riding on other team partners with lead to an instant F grade in the course
- Reports: A report in Latex is compulsory with each assignment
- Attendance: Each group will meet their TA mentor once a week to report progress and discuss ideas
- Participation and collaboration with others in the course via Piazza or otherwise
- Assignments
- The best three submissions for each assignment will also be featured on the course web page, along with the TA mentor who helped the team
Week 2: Get your coding infrastructure in place
- The initial assignments will be in Java, so you need to pick up the language asap if you don't know it already. Please indicate in the Google doc whether any team members in your group do not know Java, so that we can time your progress and grading accordingly. There are plenty of books and online tutorials on Java programming, so once you understand the basics of object oriented design the rest should come easily.
- We will use the Eclipse IDE for development. Download and install, and make some basic programs in consultation with your TA mentor. But make sure that you also understand java compilation and execution from the command line, something that will come in handy in the future.
- We will use Eclipse for development and Github as a code repository. Github internally uses the Git version control system to allow multiple people to collaborate on a project.
- Here is a good Git tutorial to understand the concepts
- And here is a good tutorial to integrate a Git plugin into Eclipse. You will have to make a free account on Github for your team, add team members, download the Egit plugin for Eclipse, and follow the instructions to sync with your Github repository directly through Eclipse. Be careful to specify the proxy settings in Eclipse to be able to communicate with Github over Https. Also experiment with command line Git using ssh
- Add your TA mentor to your Github repo as well. Remember to code your project well, document it, and maintain a how-to wiki on Github -- you will be evaluated on these aspects.
- Finally, download and compile Prefuse. You can compile through the command line or via Eclipse. Also compile the API docs so that you have a local version with you. Run the demo programs
Week 3 - 4: Assignment 1 -- affiliation networks of political blogs and books
- There are several tutorials on Prefuse but for most tasks, you will have to consult the API documentation
- A modest Prefuse tutorial
- Another Prefuse tutorial
- Paper describing the Prefuse design philosophy: Very important!
- Prefuse manual
- dataset on purchase patterns of politics-related books on Amazon. The books are nodes, labeled as whether the book is left leaning, right leaning, or neutral in its political stance. Edges exist between two books if some people have purchased both the books. We want you to build a visualization using various inbuilt layout algorithms in Prefuse, that will show if people like to read diverse books that touch upon several different affiliations or they rather like to read stuff that possibly resonatttp://prefuse.org/">/li>
- Hint: Use a visualization algorithm which finds clusters in a graph, so that nodes with interlinkages between them will . You can even colour nodes based on the affiliation.
- Once you are done with that, you can should visualize a similar dataset on blog affiliations.
- You should state any interesting observations in your report. For example, you can try to derive some graph statistics to quantify the degree of polarization of viewpoints.
- You could do something simple like find the radio of [the number of edges between nodes of the same type] : [the total number of edges].
- You could even do something more complicated like compare the given graph with an equivalent random graph, where the same number of edges are laid out between pairs of nodes selected randomly. You can do this as follows:
- Create about 30 random graphs. Use the same node definitions but pick (number of edges) pairs of nodes completely randomly, ie. start with a different seed value and generate the first random number between 1 to n to choose the first node, then a second random number between 1 to n-1 to choose the second node, and join the two nodes with an edge.
- For each random graph, compute the same ratio as earlier of [the number of edges between nodes of the same affiliation]:[total number of edges]
- Then plot a histogram of these ratios. Does it looks like a normal bell curve? Find out where the statistic for the actual blog and books dataset lies -- is it to the left of the peak or to the right, etc?
- You can write additional Java programs to compute these ratios, or even better would be to use perl or python scripts. To plot histographs, you could simply use Excel.
- You could also try to understand the degree of clustering in a graph, for example, if a lot of triads exist where there are edges between nodes A-B, B-C, and C-A. An existence of these triads would indicate that people tend to buy books or reference each other's blogs in clusters. You could then compute a statistics called the clustering coefficient, defined as the ratio of [number of such triads noticed] : [n C 2, ie. number of ways in which you can choose 2 nodes out of n = n!/2!/(n-2)!]. Do the same exercise as before -- find the clustering coefficients for the random graphs you created above, and plot a histograph of the clustering coefficients, then see where the statistic on the blog and book dataset lies on this histogram. You could also do things like find the number of triads where all (nodes are of the same affiliation) or (2 nodes are of the same affiliation and 1 is different), etc.
- You could also try to examine more closely the nodes that connect with other nodes of different affiliations. Are these nodes from among those with a conservative affiliation or a liberal affiliation or neutral? Think of similar statistics you could compute here.
- Deadline: Week of August 27th (the specific deadline for each group will be the date they are scheduled to meet their TA mentor). During the evaluation, you should give a demo of your visualizations, submit a hardcopy of your report, and take the TA through your code and documentation and how-to.
Week 5 - 7: Assignment 2 -- state of the Indian Lok Sabha
- Use the MP Track data from PRS website on the 15th Lok Sabha to show a series of visualizations that capture the state of the Indian Lok Sabha. The data is in Excel and contains several things such as the political party to which the MPs belong, their educational qualifications, age, and most interestingly their activity in parliament -- attendance, debates in which they participated, questions they raised in parliament, etc.
- One idea is to start with making a webpage that contains an applet for summary graphs and visualizations:
- State-wise histogram on the educational qualification of MPs
- State-wise histogram on the age of MPs
- State-wise histogram on the attendance of MPs
- State-wise histogram on the fraction of seats occupied by the leading political party in that state
- The same kind of histograms can also be plotted for major political parties. How will you identify "major" political parties? One idea here is to plot a graph of the size of the political party (according to number of elected MPs) against their rank and take the top 20% maybe. What does the distribution of the party size look like?
- You could make the webpage dynamic by allowing the user to select the state or political party, and replot the applet. It you want to get more adventurous, instead of applets you could even use one of the many Javascript libraries to plot charts, such as Google charts, YUI, etc.
- Some more interesting insights may be brought out by mashing the state and political parties together on the same applet visualization. An idea is to plot a map with states on the X axis and political parties on the Y axis, and colour each (x, y) square with a shade depending on the average attendance of MPs for that particular (x, y) pair of state and political party. For clarity, you can think about ordering the states in order of the number of MPs from that state, and ordering the political parties such that the largest parties for each state lie on the x=y line.
- See if there are any areas in the map that stand out from the rest, ie. they are brighter or lighter, or is it all just random. A bright band along the x=y line could indicate that the largest party from a state typically maintains a good attendance in parliament, or a bright area to the left or right of the map could indicate a bias towards the larger or smaller of the states, or vertical/horizontal bands could indicate that a particular state or political party respectively maintains good attendance.
- You are of course open to try other ideas. Instead of squares, you could show circles with their size dependent upon the attendance.
- For greater clarity, you could even add filters to include only some selective political parties in the map. Feel free to use external data such as the coalition parties that clubbed to form the NDA or the UPA, etc.
- Instead of attendance, you could even replot the same map for educational qualifications, or questions asked in parliament, or the age of MPs.
- Keep in mind that the data is in Excel, so you will have port it out in a CSV or other format for external processing in Prefuse or your own scripts.
- In your report, you should mention any interesting trends you noticed. As before, you can then do some interesting statistical tests to demonstrate whether mathematically too you can make the same claims. In statistics, a number of hypothesis tests are available, such as the t-test, the Wald test, etc.
- Hypothesis: MPs above a certain age have a low attendance. First define an age threshold and an attendance threshold, maybe based on the averages. Then draw four samples of MPs {above the threshold, below the threshold} X MPs {having high attendance, having low attendance}. Your intuition will be true if the MPs are disproportionately distributed in the quadrants {high age, low attendance} and {low age, high attendance} than in other quadrants. And you can statistically verify this using the Welch t-test on each pair of samples to compare samples with different sizes and different variances. You could even visually show this by doing a scatterplot for age and attendance of each MP.
- Hypothesis: MPs from small states maintain better attendance in parliament. Similar to the above, obtain samples for small states Vs large states and do a hypothesis test.
- Hypothesis: MPs from UPA (in power) or NDA (opposition) maintain better attendance.
- Hypothesis: MPs from North India maintain a better attendance than MPs from South India.
- Hypothesis: More educated MPs maintain better attendance.
- Hypothesis: Female MPs are better educated than male MPs.
- Your main goal in the visualization part of the assignment should be to allow exploration and mashup of different parameters together. And the main goal in the statistics part should be to mathematically describe any interesting trends you have spotted in the visualizations.
- Deadline: Week of September 24th. Same procedure as before -- meet your TAs regularly, keep them posted of your progress, during the evaluation be prepared with a report and demo. All group members should meet the TAs.
Week 8 - 11: Assignment 3 -- analytics dashboard for a social media website
- In this final assignment, you will work with live data from a hypothetical social media website, and build a web based analytics dashboard to get a birds eye view of the activity on the website, such as the features provided by Google analytics. This hypothetical website has 2,500 users with approximately 65,000 edges between them. They come from interesting places such as Heaven and Hell and Asgard! And they don't sleep or eat or drink, they only gossip with each other on a wide variety of topics!
- The data is as follows.
- You will find a file called log-graph.out on http://10.22.4.33/csp301/ which contains the list of nodes and their mythological locations, and the list of edges between these nodes. The edges are assumed to be undirected. This therefore represents the social network graph of the users.
- On http://10.22.4.33/csp301/ you will also find a new file released each day, called log-comm.00.out, log-comm.01.out, etc. Each of these files contains a log of communication that happened on the social networking site over approximately one week. Each line has the following CSV fields:
- timestamp in milliseconds
- Human readable timestamp
- Edge (pair of nodes) that communicated with each other
- A tag for the communication, with topics such as films, robbery, obama, disney, GI Joe, etc
- Over the duration of the assignment, over 40 such files will be released each day, that will span a log of over 4 million communication activities over a simulated period of almost one year
- You have to build a web analytics dashboard that does at least the following, plus more ideas that you can think of
- Automatically downloads the new log file released each day, and adds the log to the database
- Rather than store the log as such in the database, you should store summary values that capture the characteristics you want to report. Imagine what would happen if instead of 2,500 users there were 2,500,000 users! Think of a suitable schema for this, such as:
- A hierarchical filter that stores the volume of communication each day for the last 7 days, the communication each week for the last 4 weeks, the communication each month for the last 3 months, etc. This hierarchical filter will help you keep track of summary values that are precise for the near past but get imprecise over longer term history
- The communication per day on each topic
- Cluster the social network graph and capture the communication within each cluster, across pairs of clusters, etc
- Communication activity in different locations
- Display the aggregate communication activity on the social media website, maybe shown as a timeline graph
- Per-topic communication activity, maybe shown as a stacked bar graph to see rising and falling per-topic communication. Or, shown on a social network graph to see the communication spreading over the network.
- Top-10 topics of discussion
- Trending topics, identified as the topics with the highest growth rate
- Social network visualization of the communication activity. You may not be able to visualize all the 2,500 nodes together, but you can try to cluster the nodes and visualize the clusters. Think how you can make the dashboard interactive, by allowing the selection of topics or the selection of locations for which to display the communication activity
- You can optionally even allow time based navigation, to see past activity
- Remember, the most interesting part here is that you are working with a dataset that is continuously growing. Your backend will need to process this incrementally growing data offline, and store summary values that will reduce the online processing required while rendering the dashboard. In your report, you should therefore describe the overall architecture of your dashboard system, and how you split the online/offline processing tasks so that the dashboard renders quickly using pre-processed data.
- You should build this dashboard as a website using platforms such as django, or using php/cgi/jsp scripts that allow interactivity. You already know how to plot graphs with dynamic data, so try to use that knowledge in this assignment.
- The dataset was actually generated artificially using certain underlying models. You should see if you can do the reverse -- figure out the model that was used to generate this data. Some things to check out for:
- Detect clusters in the social network. Then try checking the correlation between nodes in the same cluster and their location -- test a hypothesis that users in the same cluster are likely to be from the same location as well
- Do some pairs of clusters communicate more with each other than with other clusters? You can check this by computing a cluster X cluster matrix, with each cell containing a correlation value = the volume of communication between the two clusters divided by the total cumulative communication of these two clusters
- Is there a large ratio of intra-cluster communication to inter-cluster communication?
- What is the topic popularity distribution? Are some topics very popular while many other topics are less popular?
- Do some clusters like to talk more about some specific topics?
- What is the span of different topics on the social network? Do some topics spread to a large part of the network, while others remain restricted to a few clusters?
- What is the duration of the bulk of activity of different topics?
- Do different topics exhibit different kinds of time profiles, eg. some peak quickly and decay slowly, some grow slowly then peak then decay slowly, etc?
- You should try to develop interesting metrics to quantify these ideas, and then test your hypotheses using the statistical tools you have learned so far.
- You will be evaluated on different aspects: The cleanliness of your architecture and optimization for quick online Vs time-consuming offline analysis, the informativeness of your dashboard, and the insights you have drawn from analyzing the data and uncovering the underlying model that generated it.
- Deadline: Week of November 6th. Same procedure as before -- meet your TAs regularly, keep them posted of your progress, during the evaluation be prepared with a report and demo. All group members should meet the TAs.
Other interesting datasets
- FAO food trade matrix: This is data of the trade of commodities between different countries.
References on information visualization
Home page of Jeffery Heer (author of Prefuse and D3)
References on introductory statistics