I started a new class from Coursera.org – Introduction to Data Science by Bill Howe from the University of Washington – and I expect the 8-week course to be both challenging and rewarding.
The course covers method of handling and manipulating, analyzing, and visualizing massive data sets.
The first class assignment was to use Python script to analyze the “sentiment” of a large number of Twitter posts. The assignment was not intended to be particularly sophisticated; rather, just to get a taste of what we can do with data science.
I captured about 175,000 tweets over a period of 30 minutes and used the 2,477 words and phrases found in the AFINN-111 list to score the sentiment of each tweet.
After removing all of the tweets with a sentiment of zero (i.e., they didn’t contain any words from AFINN-111), I plotted them in the following histogram. Positive sentiment tweets are marked in green, and negative in red. The further from the middle of the graph, the stronger the sentiment.
Again, this is not a sophisticated analysis (i.e., I didn’t clean up words by removing punctuation, etc.), but you can see that for the 170,000 tweets I analyzed, sentiment is more strongly positive.
The assignment has multiple parts. I’ll continue to post the highlights for this and other assignments from the course.