Archive | Technical RSS feed for this section

13 May 2013 3 Comments

Happiest State?

An assignment for my “Intro to Data Science” course was to estimate which state was “happiest” based on twitter sentiment analysis. Words such as “adore, admire, fun, love, reassure” score positive, words such as “despair, harmful, mediocrity, upset” score negative.

I analyzed 300,000 tweets and came up with the following result. (Note: Darker red means happier sentiment.) Hawaii was the clear winner, but UT-WY-CO-KS-OK form a very strong region of positive sentiment tweets.

Happy State

In my analysis, LA and AL were the only two states that had a negative twitter sentiment, meaning, on average, there were more negative sentiment words than positive.

6 May 2013 1 Comment

Intro to Data Science – Twitter Sentiment

I started a new class from Coursera.orgIntroduction to Data Science by Bill Howe from the University of Washington – and I expect the 8-week course to be both challenging and rewarding.

The course covers method of handling and manipulating, analyzing, and visualizing massive data sets.

The first class assignment was to use Python script to analyze the “sentiment” of a large number of Twitter posts. The assignment was not intended to be particularly sophisticated; rather, just to get a taste of what we can do with data science.

I captured about 175,000 tweets over a period of 30 minutes and used the 2,477 words and phrases found in the AFINN-111 list to score the sentiment of each tweet.

After removing all of the tweets with a sentiment of zero (i.e., they didn’t contain any words from AFINN-111), I plotted them in the following histogram. Positive sentiment tweets are marked in green, and negative in red. The further from the middle of the graph, the stronger the sentiment.

Again, this is not a sophisticated analysis (i.e., I didn’t clean up words by removing punctuation, etc.), but you can see that for the 170,000 tweets I analyzed, sentiment is more strongly positive.


The assignment has multiple parts. I’ll continue to post the highlights for this and other assignments from the course.

3 May 2011 0 Comments

Systems Thinking

Systems Thinking

Here are some basics of Systems Thinking that I pulled together for some colleagues recently.

A process is a series of steps intended to achieve a specified output.

A system is two or more processes where the behavior of each process has an inter-dependent effect on the behavior of the whole. (You can’t change one without affecting the other.)

Processes focus on completing tasks . . . Systems are goal seeking and therefor adapt to changing conditions. (Organizations that focus on completing tasks rather than achieving goals are called bureaucracies.)

A system’s effectiveness as much on the health of the individual processes as it is on the interactions between the processes.

You can’t understand a system by taking it apart. (Example: You can’t understand what a human is by studying the heart, lungs, etc., in isolation. You understand the heart by understanding its function, as well as its effect on other functions of the body.)

Systems are only understood by knowing the purpose of the system as well as the functions of the system, i.e., why things operate as they do. You must understand the ‘why’ before you can design an optimal ‘how.’

You can’t optimize a system by optimizing (problem solving) the parts.

The parts must be designed to optimize the whole. You can only do that if you know the target condition of the system.

In order to understanding, design, or think about a system, you must consider:

  • What is the objective of the system overall?
  • What is the pathway design of materials, information, and services?
  • What are the connections between process steps?
  • How do people do their work at each process step?
  • Each of the design elements must be self-diagnostic. (Jidoka!) Every time a system has a problem, it is an opportunity to better design the pathways, the connections, or the activities.

You can’t design a perfect system, only discover one.

Systems Thinking – Additional Reading:

I’d love to hear any thoughts or recommended reading you may have on systems thinking!

1 January 2011 0 Comments

Pop Can Image Recognition Using Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA)

In 2002, I took a computational intelligence class from NC State University (ECE 592Z), which I completed as a distance-education course. One of our projects was to train a neural network to recognize different pop can images. The instructor gave us a training set of images, but our grade was based on how many images our neural net could identify from a set of images that it had never seen. It was a fun project, but it was a pain to get the neural network to converge.

Since Thanksgiving, I’ve been spending most of my free time learning everything I can about partial least squares regression. I purchased and read about half of “Multivariate and Megavariate Data Analysis Advanced Applications and Method Extensions (Part II)” by Umetrics Academy. I was intrigued by the chapter on image analysis, and wondered if it could be used to recognize the pop cans I used for my neural network project.

I dug up the set of 189 images. They were all small (90×150 px) gray scale images, a sample of which are shown below.

Notice that each image is in a slightly different position and has a varying amount of noise.

The key to using PLS for image analysis is to represent the image as a single vector of information. This is done using a wavelet transformation. I tried to figure out how to do this on my own in Matlab, but wasn’t able to figure out how to get the output of the wavelet transformation into an appropriate 1-dimensional format. Fortunately, Umetrics has a free utility that does this, called SIMCA Codec.

Once the data from the 89 images was imported into SIMCA-P+ (v.12.01), the analysis was incredibly. I selected the orthogonal partial least squares discriminate analysis option, and the output gave 5 validated latent variables. (Note, I also did the analysis using the non-orthogonal PLS method, but the results were not as crisp.) The images below show the images clusters (one color for each type of pop can) plotted against latent variables 1, 2, and 3, and then again using latent variables 1, 2, and 5.

These graphs show a clear distinction of each of the pop can types in latent variable space. In SIMCA-P+, you are able to rotate these graphs in 3D. It was much easier to see the clean split between each of the groups by doing this. In other words, the analysis can clearly recognize which images are of what type of pop.

I continue to be impressed with the power and flexibility of partial least squares analysis. It’s been well worth the time and effort to learn the concepts and how to put it to use in practice. I’ve got some fun projects in mind as to where to go next!

30 December 2010 3 Comments

100 Days of Sleep – Part 4: Can I Plz Haz Deep Sleepz?

In yesterday’s post, I provided the two key elements for getting a good night’s sleep (for me, at least): 1) total amount of sleep, and 2) amount of deep sleep.

Getting enough total sleep is pretty straight forward to understand (if, perhaps, not always easy to achieve). But the data from yesterday showed I need about 60 minutes of deep sleep per night to feel totally refreshed the next day. The problem is, I only get about half that (as shown in the table in the first post of this series).

I decided to dig a little deeper (ha!) into my sleep data to see if there were any noticeable trends around deep sleep. I used the detailed sleep data that was included in the Zeo data export from the web to build the graph below. It shows the probability I’ll be in any particular sleep phase as a function of hours into sleep. In other words, when during the night am I in deep sleep?

The light gray color means there was no data available, e.g., I was already awake of the sensor wasn’t transmitting because I took it off.

The dark gray shows awake time. This shows, on average, it takes my about 10-30 minutes to fall asleep. This isn’t probably too accurate since I’ll often thumb through an RSS reader on my iPhone before deciding to actually go to sleep. The data also shows there is about a 5% probability at any point through the night that I’ll be in a wake state.

The light (ha!) green shows the probability I’ll be in light sleep. It’s obviously the bulk of my sleep.

The medium-green shows when I’m in REM phase. It’s interesting to see the strong peak about 3 hours into sleep, where there’s a 50% chance I’ll be in REM. It then levels off to about 30% for the rest of the night. But after 6.5 hours, the relative probability of REM increases. I believe this is typical.

Lastly, the dark green shows when deep sleep is most likely to occur. The largest peak is about an hour into sleep, where there is a 25% chance I’ll be in deep sleep phase. There is a second, smaller peak around 3.5 – 4.5 hours.

So, what now? I’m really not sure. I’ve looked around and haven’t found much of anything that suggests a way to increase deep phase sleep. I’ve already looked at a number of indicators associated with the Zeo sleep data I collected along with the sleep journal information, and I see absolutely nothing that suggests how I might get more deep sleep. I’ll probably figure out a way to collect more data going forward to help in maximizing deep sleep.

One possible solution would be a polyphasic sleep pattern. Since a large chunk of deep and REM sleep occur in the first 3.5 hours of sleep, would I do better to break my sleep up into two blocks of 3.5 hours? I’m not sure this is practical, but I might give it a try sometime.

So, that’s it for this round of sleep analysis. I’ll probably take another stab in 6 months or so. I have to put in a plug for the people over at Zeo. I appreciate how approachable and helpful they’ve been! Thanks!!

29 December 2010 3 Comments

100 Days of Sleep – Part 3: What Makes a Good Night’s Sleep?

When I started analyzing the sleep data, I was very interested to see if particular phases of sleep correlated more closely to how I felt the next day. Yesterday, I showed some ways the data could be simplified. Today, I’ll use the data to make a complete model.

To understand the relationship between each parameter, I performed a partial least squares (PLS) regression using the data. This is a fairly-high powered method of multivariate analysis that is particularly suited for large sets of variables. As an alternative, I could have reduced my variable set to those identified in yesterday’s post and used a simpler analysis method. But PLS is the best way to a robust model using all of the available data.

For the PLS analysis, I used SIMCA-P+ v.12.

Below is the PLS Loading Plot from my sleep data. The black items are the predictors (i.e., x-variables) while the red items are the predicted (i.e., y-variables). Similar to a PCA plot, the items furthest from the origin of the plot have the most influence. Once again, we see clustering of Total Z, Time in Light, and ZQ.

The analysis clearly demonstrates that Total Z and Time in Deep are the two factors that are most necessary to feel rested the following day.

With this information, I used SIMCA-P+ to construct a model based on the two sleep factors – Total Z and Time in Deep – to predict Day Feel. (I used only Day Feel 1, since it was representative.) The resulting model is shown below.

This plot shows the next day’s feel as a function of the minutes of total sleep I got (ranging from 3-9 hours) and the minutes of time in deep sleep. It clearly shows that Time in Deep is a very important factor. In fact, getting an extra 15 minutes of deep sleep is the equivalent to getting an entire extra hour of sleep!

What do I conclude from this?

  1. I have to get the right amount of Total Sleep AND Deep Sleep to have a good night’s rest.
  2. Getting extra Total Sleep is relatively straightforward (if not always easy). But, I need to figure out how to get adequate Deep Sleep!

Tomorrow, I’ll dive a little deeper (pun) into my Deep Sleep data, and post my next steps and outstanding questions.

28 December 2010 3 Comments

100 Days of Sleep – Part 2: Simplifying the Data

Yesterday, I posted averages for my sleep phases, namely Time to Z, Total Z, Time in Light, Time in REM, Time in Deep, Time in Wake, Awakenings, and the ZQ score. Next I started looking into the data provided by my sleep journal.

The MyZeo web application lets you track a number of things in your sleep journal, the most relevant to this analysis are the Morning and Day Feel scores. These are qualitative scores from 1-5. The Morning Feel is an indicator of how well you feel you slept (Poorly to Very Well), while the 3 Day Feel indicators track how you feld during the day in regard to: 1) Irritable vs Easygoing, 2) Unfocused vs Focused, and 3) Tired vs Energetic.

Since there were so many different factors to consider in both the sleep phase data and the sleep journal information, I performed a principal components analysis (PCA) to see I could reduce the number of variables that could represent the data to a few key elements.

Again, I used JMP v.8 to do the PCA. Without going into the details of interpreting the results, the graph below shows the following:

ZQ, Total Z, Time in Light, and Time in REM are all correlated [blue arrow]. In other words, I don’t need to look at all 4 variables to understand my sleep. This suggests that I could use the single variable “Total Z” and keep most of the prediction power, with the added advantage that it is easy to measure and understand.

Another possibility for simplification is around the “Day Feel” indicators [purple arrow], which all track together, suggesting that, e.g., if I’m going to be tired, I’m also going to be unfocused and irritable. (Note: Anecdotal evidence from my family suggests the data may be accurate!)

It’s also worth noting the analysis indicates “Time in Wake” correlate to “Awakenings,” which seems to make perfect sense. In addition, “Time to Z” (how long it took me to fall asleep) is not at all correlated to the other variables, and has a very weak effect, as noted by the short length of the arrow.

Summarizing today’s insights:

  • The data suggests that the number of variables used to represent my night’s sleep can be simplified to a few key elements.
  • The same is true for the output of my sleep, in other words, how I feed the following day.

Tomorrow, I’ll explore whether it’s possible to build a predictive model that can tell me exactly what type of night’s sleep I’d need to maximize how I feel the next day.

27 December 2010 0 Comments

100 Days of Sleep – Part 1: Top-Level Results

I’m one of those people who needs more sleep than I get, so I’m always interested in finding ways to improve the quality of my sleep. Earlier this year, I purchased a Zeo, which is a small device that measures and records your phases of sleep. You can upload this data to a website which also allows you to manually log a number of other factors associated with sleep in the form of a sleep journal.

I just completed 100 days of recorded data on the Zeo, and I thought it was time to look for some overall patterns. While the website provides a number of simple ways to do this, I decided to download the raw data and analyze it myself.

Overall Numbers

To calculate top-level results for the 100 nights of data, I took out any nights that were statistical outliers (i.e., the headband fell off, etc.); there were only 4 that were obvious outliers.

Here’s what an “average” night looks like, broken down by sleep phases. The majority of the sleep I get is in the light phase, then REM, and deep, with only a small amount of wake time. (Evidently, most people are awake for short periods like this, even if they don’t remember it.)

I also calculated typical ranges for each phase of sleep. The median is the most likely value I would experience for phase. The range is the bottom 10 percentile of the data to the top 90 percentile. In other words, 80% of the data fall into this range, so it is a good representation of the spread of “typical” results look like for me.

ZQ (the Zeo overall score)

51 – 68 – 84

Total Sleep [h:mm]

5:11 – 6:38 – 8:03

Time in REM [h:mm]

1:08 – 1:51 – 2:48

Time in Light [h:mm]

3:05 – 4:04 – 4:57

Time in Deep [h:mm]

0:22 – 0:35 – 0:50

For those interested in such details, the distribution analysis above was performed using JMP v.8, and I’ve provided a screenshot of the output if you’d like to take a look.

There is nothing in these results that are particularly revealing. It’s when I compared the sleep phase data to the sleep journal information that things got interesting! I’ll post those results tomorrow.

moncler outlet best hair extensions best clip in hair extensions best hair extensions best clip in hair extensions best hair extensions best clip in hair extensions best hair extensions best clip in hair extensions best hair extensions best clip in hair extensions best hair extensions best clip in hair extensions