Skip to main content


Showing posts from October, 2011

Scrappy Scapers

In an earlier post I presented some R code for a basic way of collecting text from websites. This is a good place to start for collecting text for use in text analysis. 

However, it clearly has some limitations; 
You need to have all of the URLs already stored in a .csv file.The method of extracting the text from the downloaded HTML code using <gsub> is a bit imprecise. It doesn't remove the text from common links such as "Home" or "About". Both of these problems can be solved in R with a bit of work. But I think for bigger scrapping projects it is probably a good idea to use other languages such as Python or Ruby
ProPublica has an excellent little series on scraping that covers how to gather data from online databases and PDFs. This is a really good public service and enables something sadly unusual in journalism: reproducibility. Their chapter on using Ruby and Nokogiri for scraping the Phizer's doctor payments disclosure database is particularly hel…

Even More Reason To Pay Attention

Remembering back a few posts, I discussed how it looked like a number of US financial regulators and the Departement of Justice seemed to be credibly committing to bad supervision.

This is especially worrying given this recent summary of how Dodd-Frank limits the powers of the Fed/Treasury/FDIC to respond to financial crisis. Though the idea may be to limit moral hazard by credibly committing to not give 2008-style bailouts, I have a hard time believing in this credibility. My initial thought is that no democratically elected government would actually not respond if their economy was collapsing because of a financial crisis. So, if a major crisis hits, these Dodd-Frank provisions will merely slow down the inevitable bailouts (may of the powers can be enacted with congressional approval). There is still moral hazard feeding potential crises, but crises responses will be slower.

As the Economist rightly points out, regulators have even more imperative to prevent a crisis. But to do this…


Just researching the policymaking behind the Irish 2008 "Guarantee Everything" policy and found this nugget. In the one page statement announcing the plan they cite the "international market" turmoil twice as the cause of the 2008 crisis in Ireland (US subprime induced credit crunch -> tightening liquidity markets, yada yada yada).

Not once is the massive domestic real estate bubble mentioned! Sure this doesn't reveal policymakers' total knowledge (they could just not mention the problem, while knowing it exists), but still.

Automated Academics

This WSJ piece on the US income gains over the past decade (summary: unless you have a PhD or MD, you didn't have any income gains) got me thinking:

I'm actually pretty cautious about that number, I would be more interested in the range of the distribution, I think the percent change is being pulled up by all of those physics PhDs who went into finance.

Then again, considering in the that over the past few weeks I've been learning how to automate the collection of data that used to be done by people with masters degrees, maybe PhDs are going to be the ones who automate all of the former undergraduate and masters level work out of existence, keeping the productivity gains for ourselves (conditional on the tax structure). (see also Farhod Manjoo's recent series on this issue in Slate.)

One thing I gleaned from a talk given by the Governor of the California Board of Education last night was that academics largely doesn't even need PhDs (at least at all levels except t…

Simple Text Web Crawler

I put together a simple web crawler for R. It's useful if you are doing any text analysis and need to make .txt files from webpages. If you have a data frame of URLs it will cycle through them and grab all the websites. It strips out the HTML code. Then it saves each webpage as an individual text file.

Thanks to Rex Douglass, also.

 Enjoy (and please feel free to improve)

Recommended -- Mid-October

Here are three articles that I've found pretty interesting over the past few days:


A fairly insightful blog post about the changing view of management, share holders, and corporate cash.


The Guardian sticks it to Murdoch, again.


This is a great article on symmetry in physics. The highly speculative ending is at the very least fun. I hadn't really known much about symmetry and larger Group Theory until reading Alexander Masters' excellent biography of the eccentric mathematician Simon Norton the other day. Also highly recommended.

Real Inflation? (Part 1)

At a recent lunch the conversation turned to how most American's real income hasn't change since the 1970s when we adjust for inflation (see here for some decent graphs). One of the people at the lunch (a person who has written considerably on monetary policy) contested this. His argument is that we are actually very bad at measuring inflation. Prices may rise, but the quality of the goods that we buy is much better now than it was in the seventies. The iPad I buy now is much better than the 1970s TV or radio or all the other things that it replaced in my life and probably cheaper than all of these things combined. On this line of reasoning, inflation is actually overestimated.

There is one obvious flaw with this argument: it misses much of the point. If we were really terrible at measuring inflation in this way, then yes maybe most peoples' income has actually increased. But the bigger issue is that the top sliver of the income distribution has made steady gains since the…