Diving with a view

Part II of my observations from the World Bank Data Dive on poverty and corruption. It might start with the data, but for me the fun is in the analysis, especially visual. I had in fact joined the group fighting corruption because they seemed the most likely to need data exploration and visualization.

Board approvals per month

Board approvals per month

Below is the result of a long day's worth, more or less. I wish I had a graph shining the light of integrity on collusion, coercion or some other evil, but no. Slowed down by data issues, we did not make it that far. I can't say that I'm satisfied with any graph I've done over the week-end but then again, I've done them.

The first one happened while I was idly playing with the project data. By déformation professionnelle, I looked at the number of projects that the Board of Directors had approved per month. With July at the top, it is clear that there is a rush to approve more projects towards the end of the fiscal year, in June.

Is it possible that more cases of corruption happen in projects approved in May and June because the staff takes less time to conduct the due diligence? This question opened the Pandora box of linking disbarment data and the project data. If we were to find project characteristics that lead to higher likelihood of corruption, it could orient the preventive work of the integrity team. It was too much to resist and became our undoing as we spent hours trying to recreate that link, leaving available data sets unused.

Trend share WB board approvals

Trend share WB board approvals

While the true wizards were working on said link, I continued to explore visually the project data. My original graph showed cumulative approvals for 66 years. What if this bunching is an old problem and that the Board now approves a constant number of projects per month? I needed a trend.

I'm afraid this is my best effort of the week-end. About 800 data points visible with a clear enough message: the trend has worsened over the decades and the Board approves a growing share of projects towards the end of the year. The months with a larger share have gotten an increasing share vice versa. Since the mid-1980s, the share has reached 30% regularly in June. This is nearly four times as much as would be expected from an equal distribution per month (1/12 = 8.25%). This finding confirmed that it was still worth exploring the impact of this share of approvals on the due diligence of individual projects. Unfortunately, the data materialized too late and the link was never explored.

We did get an original data set though: the historical list of firms and individuals disbarred by the World Bank. I'm afraid I did nothing worth sharing with it. A few bar graphs showing the number of firms, the average number of days of disbarment per country. No corruption fighting histogram in there, no revolutionary radar graph.

In lieu, here are two of the most interesting visualizations I've seen. The first one is a network diagram of the bidders on World Bank contracts built by Nick Violi with data that he scraped himself (wow). It draws no conclusion, but it makes me curious. What are these clusters? I don't even know what the colors mean, but I'd like to know why some clusters are all yellow, some are mostly blue and some are mixed. G11 is an interesting nod, as it bids on few things but then bids across two clusters. What kind of company can it be? This is the kind of exploratory visualization that makes me want to dive into the data.

Credit: Nick Violi @nvioli

Credit: Nick Violi @nvioli

The second is from a team exploring UNDP's resources allocation. In a scatter plot, it compares the overhead with the expenses of, apparently, hundreds of projects. It might look like a Caribbean hurricane to you, but to me the resulting distribution of the data is surprisingly elegant. The two measures have expenses in common, which accounts for the  slope pattern. The horizontal cut-off at 1.0 is due to budget limits (or one hopes). The color overlay provides a nice analytical tool, suggesting to the reader where to look and how to interpret the data. There are a few startling findings already. A surprising number of projects have spent 2-3 times as much in overhead as in operations. Despite the high quantity of outliers, there is a strong concentration of projects around the target of spending 100% of budget and keeping the overhead low, which suggests good planning and lean implementation.

World Bank DataDive UNDP Capacity & Performance

World Bank DataDive UNDP Capacity & Performance

This graph would benefit from some graphic design flair. The overlay text should be readable and aligned everywhere. The overlay colors could be more visible and helpful. I'd be curious to experiment with empty circles instead of semi-transparent ones. The vertical text could be made horizontal. The light grey frame could be removed.

Knowing the conditions in which these graphs were produced, I wouldn't take the data for granted, nor draw any hard conclusion. But they might inspire a few in-depths analysis. Have a look at a few more on this Tumblr.

Thank you but mostly congratulations to the organizers at the World Bank and DataKind. For an event so open, it is impressive how purposeful it felt. A special thanks to the to data ambassadors of our group, Sisi Wei and Taimur Sajid. I hope that the World Bank, UNDP and other organizers and participants will benefit from the event. I know I did.

MOOC Weeks 5-6: UK Aid to India

For our last assignment of the MOOC, Alberto Cairo decided to give us enough rope to hang ourselves: "do whatever you want". I proceeded to swiftly spend half the allocated time deciding on a topic. Returning to aid, the subject of week 3, was a natural fit and I knew the data would be available. After considering a few generic variations on the themes "where does aid come from" and "where does aid go", I realized I needed an angle. The recent announcement by the UK that they are cutting their aid to India seemed intriguing enough and calling for some data. Then, I set as my goal to create one of these long, vertical infographic, but without resorting to some of the misleading and unhelpful techniques that plagues too may of them. Let's recap some of the lessons of the first four weeks.

  1. Look for a story in the data.
  2. Convey a narrative.
  3. Use good copy to draw the reader in.
  4. Combine several graphs.
  5. Present the same data in different ways.
  6. Use the appropriate graph for the data.
  7. Pick the color scheme carefully.
  8. Label and include legends.

Here is the result.

UK Aid to India. Francis Gagnon

UK Aid to India. Francis Gagnon

The story is that it is a big deal that the UK will cut its aid to India and there are many ways to understand the causes and consequences. It is a delicate topic and I did not want to turn the infographic into an editorial. It is rather designed to help the reader think about the issue and maybe open a few new perspectives, especially since some of the actors have strong opinions about this shift.

It starts by showing the reader how important this decision is: India is a top recipient of UK aid. Then it goes into a comparison of the two countries, to reflect on their relative economic health. This leads into an exploration of poverty in India and finally an overture towards the other potential beneficiaries of this change, showing this policy decision into a larger context. The sources are also an important aspect of an infographic and I wanted to provide them in a clear way to support the credibility of the data above.

This has taken much longer than anticipated. Dataviz nerds, look for a making-of in the coming days.

Week 3: Aid Transparency

The third week's assignment was right up my alley: aid transparency. It is even more disappointing then that I was not able to complete something worthy. The source data comes from the Transparency Index of Publish What You Fund and takes the form of a ranking of aid agencies according to their transparency score. I expected the students to visualize this ranking, making more apparent the comparison between agencies, highlighting their strengths and weaknesses. I decided to try something different by visualizing the indicators themselves. I thought that it could be a nice way of explaining data transparency by detailing how it is measured. Here is my entry.

Aid Transparency Graph FG

The assignment was for an interactive visualization, so the image below shows some of the interactivity that could be prompted by the users, namely a series of definitions and the capacity to select a subset of indicators.

Aid Transparency Graph FG2

Using Adobe InDesign, means drawing every data point and this took way longer than it should, mainly because it does not add so much to the graph to have very precise data. Most people just give a quick look and are mostly interested in the ways in which the data is visualized, more than the result of the visualization. This point was driven home by the multiple hand drawn sketches of fellow students that did not approach accuracy, but that sometimes conveyed clearly enough their concept. The next week, I wouldn't be caught.

Given the call for a narrative, I spent some time finding and writing some analysis. Again, this is not something that I expect anyone to read -- at least, no one has ever commented on the text -- so it does not seem like a good investment of time.

Aid Transparency Graph FG3

In general, I like to use colors to visually group things, but more than once my audience has been more interested to see things grouped by subcategories, so that's what I have done here with the three categories of transparency. See how the colors are grouped. I have to say that it worked better than I had anticipated.

Aid Transparency Graph FG32

This slide shows only the improvement of each indicator. The main message is that all indicators have improved over the last year, although some much more than others.

In the end, I did not get to produce something of the quality I was hoping for. I picked my colors at random, I did not include a legend, I did not push the analysis, etc. But the week was over and another assignment was waiting. The point is not so much to create a perfect infographic, but to learn and this goal was already achieved.