Data Visualization Assignments: 2015

Sunday, November 22, 2015

Week 4- Creating graphs for your data

Reminder from week 1 , week 2, and week 3
I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.

Week 4

Primary data analysis through Python/SAS

The chosen language: Python

In this post:

The python code (screenshot)

The formatted output

1- Statistics

2- Univariate charts (samples)

3- Bivariate Charts (sample)

Results summary and description
The python code (text)
The python raw output

The python code (screenshot):

The Formatted output

1- Statistics

2- Sample uni-variate Graphs

3- Sample Bivariate graphs

Summary

It can be noticed from the bivariate chart above
(THE FULL SIZE OF EACH CHART IS ABOVE)

that there is an inverse relationship between the rate of pray and the number of cigarettes smoked per month. i.e., the more frequent the student pray, the less frequent he/she smokes.

From the first univariate graph above,

we found that most of the sample tried to drink beer/wine when they were between 13 and 15 years old, which is the initial teenage age, a similar trend is found for the cigarettes smoking

, but with less count, i.e. , the teenagers drink beer more than they smoke.
The scatter plot above showed

that the earlier a person starts to smoke, the much cigarettes they smoke daily, as the relationship between them shows a negative trend.

The Python code (text)

# -*- coding: utf-8 -*-

"""

Created on Sun Nov 22 14:30:25 2015

@author: Dr. Mohammad Elnesr

"""

import pandas

import numpy

import seaborn

import matplotlib.pyplot as plt

#import matplotlib.pyplot as plt2

# defining data source...

print ("Welcome")

data = pandas.read_csv('addhealth_pds.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns', None)

#Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

dict={}

#setting variables you will be working with to numeric

def ConvertToNumeric (Variable):

data[Variable] = data[Variable].convert_objects(convert_numeric=True)

#Defining a function that converts any number in a variable to NaN

def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):

data[Variable]=data[Variable].replace(Code1, numpy.nan)

for CodeX in[Code2, Code3, Code4]:

if CodeX != numpy.nan:

data[Variable]=data[Variable].replace(CodeX, numpy.nan)

def PrepareVariable(Variable, Definition, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):

ConvertToNumeric (Variable)

ConvertToNaN (Variable, Code1, Code2, Code3, Code4)

dict[Variable] = Definition

# Applying the correction of each variable depending on the values that make nonsense.

PrepareVariable("H1GI9", "What is your racial background?", 6,8)

PrepareVariable("H1GI20", "In what grade are you?", 96,97,98,99)

PrepareVariable("H1TO30", "How old were you when you tried marijuana for the first time? ", 96,98,99)

PrepareVariable("H1TO14", "How old were you when you tried beer for the first time? ", 96,97,98)

PrepareVariable("H1TO7", "How many cigarettes did you smoke each day?", 96,97,98)

PrepareVariable("H1TO2", "How old were you when you tried cigarettes for the first time? ", 96,97,98)

PrepareVariable("H1TO5", "How many days did you smoke cigarettes?", 96,97,98)

PrepareVariable("H1ED12", "What is your grade in mathematics?", 96,97,98, 6)

PrepareVariable("H1RE6", "How often do you pray?", 6,7,8)

MyVariables = ['H1GI9','H1GI20','H1TO30','H1TO14','H1TO7','H1TO2','H1TO5', 'H1ED12', 'H1RE6']

#subset data to This week's variables

sub1=data[MyVariables]

#make a copy of my new subsetted data

sub2 = sub1.copy()

sub3 = sub1.copy()

plt.new_figure_manager.__new__

for Variable in MyVariables:

desc = sub2[Variable].describe()

print (desc)

print ("-=-=-=-=-=-=-=-=-=-=-=-")

plt.figure()

seaborn.distplot(sub2[Variable].dropna(), kde=False);

plt.xlabel(dict[Variable])

plt.title(dict[Variable] + " Distribution plot of "+Variable)

sub2[Variable] = sub2[Variable].astype('category')

plt.figure()

seaborn.countplot(x=Variable, data=sub2);

plt.xlabel(dict[Variable])

plt.title(dict[Variable] + " Count plot of "+Variable)

plt.show

# bivariate bar graph C->Q

plt.figure()

#sub2['H1GI9'] = sub2['H1GI9'].convert_objects(convert_numeric=True)

#sub2['H1TO2'] = sub2['H1TO2'].convert_objects(convert_numeric=True)

seaborn.factorplot(x="H1GI9", y="H1TO2", data=sub3, kind="bar", ci=None)

plt.xlabel('Race')

plt.ylabel('Age when smoking for the first time')

plt.figure()

scat2 = seaborn.regplot(x="H1TO2", y="H1TO14", data=data)

plt.xlabel('Age when smoking 1st time')

plt.ylabel('Age when drinking beer for 1st time')

plt.title('Scatterplot for the Association Between age when drinking beer and smoking')

plt.figure()

scat2 = seaborn.regplot(x="H1TO2", y="H1TO7", data=data)

plt.xlabel('Age when smoking 1st time')

plt.ylabel('Number of cigarets')

plt.title('Scatterplot for the Association Between Smoking age and number of cigarets')

plt.figure()

scat2 = seaborn.regplot(x="H1TO2", y="H1TO30", data=data)

plt.xlabel('Age when smoking 1st time')

plt.ylabel('Age when taking Marijuana 1st time')

plt.title('Scatterplot for the Association Between Smoking age and taking-marijuana age')

# quartile split (use qcut function & ask for 4 groups - gives you quartile split)

print ('quartiles')

#sub3['H1RE6']=pandas.qcut(sub3.H1RE6, 5, labels=["Daily","Weekly","Monthly","Frequently","Never"])

##sub3['H1ED12']=pandas.qcut(sub3.H1ED12, 3, labels=["1=33rd%tile","2=66th%tile","3=100%tile"])

#c10 = sub3['H1RE6'].value_counts(sort=False, dropna=True)

#print(c10)

# bivariate bar graph C->Q

recode1 = {1: "Daily", 2: "Weekly", 3: "Monthly", 4: "Yearly", 5: "NoPray"}

sub3['H1RE6']= sub3['H1RE6'].map(recode1)

plt.figure()

seaborn.factorplot(x='H1RE6', y='H1TO7', data=sub3, kind="bar", ci=None)

plt.xlabel('How often you pray?')

plt.ylabel('How many cigarettes you take?')

The python text results

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Week 4.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')

Welcome

count 6498.000000

mean 1.560634

std 1.017517

min 1.000000

25% 1.000000

50% 1.000000

75% 2.000000

max 5.000000

Name: H1GI9, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 6337.000000

mean 9.539214

std 1.668300

min 7.000000

25% 8.000000

50% 10.000000

75% 11.000000

max 12.000000

Name: H1GI20, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 6406.000000

mean 3.707774

std 6.318714

min 0.000000

25% 0.000000

50% 0.000000

75% 10.000000

max 18.000000

Name: H1TO30, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 2537.000000

mean 13.363027

std 2.577861

min 1.000000

25% 12.000000

50% 14.000000

75% 15.000000

max 19.000000

Name: H1TO14, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 1653.000000

mean 6.921355

std 8.178834

min 0.000000

25% 1.000000

50% 4.000000

75% 10.000000

max 89.000000

Name: H1TO7, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 3553.000000

mean 9.892767

std 5.877938

min 0.000000

25% 7.000000

50% 12.000000

75% 14.000000

max 20.000000

Name: H1TO2, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 2728.000000

mean 10.085411

std 12.499521

min 0.000000

25% 0.000000

50% 2.000000

75% 25.000000

max 30.000000

Name: H1TO5, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 6275.000000

mean 2.466614

std 1.176472

min 1.000000

25% 2.000000

50% 2.000000

75% 3.000000

max 5.000000

Name: H1ED12, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 5614.000000

mean 2.031350

std 1.283485

min 1.000000

25% 1.000000

50% 2.000000

75% 3.000000

max 5.000000

Name: H1RE6, dtype: float64X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Week 4.py:31: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py:424: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

max_open_warning, RuntimeWarning)

-=-=-=-=-=-=-=-=-=-=-=-

quartiles

<matplotlib.figure.Figure at 0x44e0fc50>

C:\Anaconda3\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

if self._edgecolors == str('face'):

<matplotlib.figure.Figure at 0xb29710>

Sunday, November 15, 2015

Week 3: Making Data Management Decisions

Reminder from week 1 and Week 2
I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.

Week 3

Primary data analysis through Python/SAS

The chosen language: Python

In this post:

The python code (screenshot)
The formatted output
Results summary and description
The python code (text)
The python raw output

The python code (screenshot):

The formatted output:

Results summary and description:

It is noticed from MomCare and DadCare variables that 84.62% of the sample believe that their mother cares very much about them, while 57.69% of Fathers got the same rank! This means that Fathers cares less of their children than mothers do. However, almost 12.5% of the sample gave no valid response about mother!, this percent raised to about 38% of the sample about fathers.

The study showed that 73.75% of the sample sleep sufficient hours (7-9 h/d), 15.7% sleep below normal amount (<7 h/d) and about 10% sleeps more than enough. However, we have 26.25% of the sample need to adjust their sleeping hours for better health.

Finaally, the study showed that almost 91% of the saple have never tried any illegal drug, while 1.11% tried while they are children, 6.37% tried while they are teenagers, and 0.2% only tried while they are adults (>18 y). this means that the teenager stage is the most important stage to take kare of our children not to addict any drug.

The python code (text):

# -*- coding: utf-8 -*-

"""
Created on Sun Nov 15 11:09:42 2015
@author: Dr. Mohammad Elnesr
"""

import pandas

import numpy

# defining data source...

print ("Welcome")

data = pandas.read_csv('addhealth_pds.csv', low_memory=False)

# printing number of data rows (observations) and columns (variables)

print ('Number of data rows: ', len(data))

print('Number of data columns: ', len(data.columns))

# creating a loop that take each variable independetly

for variable in ["H1WP10","H1WP14","H1GH51","H1TO40"]:

    data[variable]=data[variable].convert_objects(convert_numeric = True)

#Defining a function that converts any number in a variable to NaN

def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):

    data[Variable]=data[Variable].replace(Code1, numpy.nan)

    for CodeX in[Code2, Code3, Code4]:

        if CodeX != numpy.nan:

            data[Variable]=data[Variable].replace(CodeX, numpy.nan)

# Applying the correction of each variable depending on the values that make nonsense.

ConvertToNaN ("H1TO40", 96,98,99)

ConvertToNaN ("H1GH51", 96,98)

ConvertToNaN ("H1WP10", 6,7,8)

ConvertToNaN ("H1WP14", 6,7,8,9)

#subset data to This week's variables

sub1=data[['H1TO40','H1GH51','H1WP10','H1WP14']]

#make a copy of my new subsetted data

sub2 = sub1.copy()

"""
New ParentCare variables
MomCare will replace H1WP10, and DadCare Will replace H1WP14
In the original DB, we have 1-5 for 1-Not at all, 2-very little, 3-somewhat, 4-quite a bit, and 5-very much
We will convert them to 3 categories: 1-Little Care, 2- Medium Care, and 3-Much Care
"""

recode1 = {1: '1-Little Care', 2: '1-Little Care', 3: '2- Medium Care', 4: '2- Medium Care', 5: '3-Much Care'}

sub2['MomCare']= sub2['H1WP10'].map(recode1)

sub2['DadCare']= sub2['H1WP14'].map(recode1)

# Recoding number of sleeping hours

recode2 = {}

for i in range (1,7):

    recode2[i]='Below Normal sleeping'

for i in range (7,10):

    recode2[i]='Normal sleeping'

for i in range (10,20):

    recode2[i]='Exceeds normal sleeping'

#print(recode2)

sub2['AreSleepingHoursNormal']= sub2['H1GH51'].map(recode2)

# Recoding number of sleeping hours

recode3 = {0: 'Never tried illegal drugs', 18:'Adult'}#, range(1,13):'Before teenage', range(13,18):'At teenage'

for i in range (1,13):

    recode3[i]='Child'

for i in range (13,18):

    recode3[i]='Teenager'

sub2['TriedDrugsAtWhatAge']= sub2['H1GH51'].map(recode3)

# definig a python dictionary describing the meaning of each variable

dict={"MomCare":"How much your Mother Cares about you?", \

      "DadCare":"How much your Father Cares about you?", \

      "AreSleepingHoursNormal":"Do you have sufficient sleep daily?", \

      "TriedDrugsAtWhatAge":"Have you ever tried illegal drugs? if so then at what age group?"}

for variable in ["MomCare","DadCare","AreSleepingHoursNormal","TriedDrugsAtWhatAge"]:

    # define the frequency distribution     ct1 = sub2[variable].value_counts(sort=False)

    # define the frequency distribution percent    pt1 = sub2[variable].value_counts(sort=False, normalize=True)

    # printing results with definitions    print ("***********************************************")

    print ("Analyzing variable: ", variable)

    print ("...answers the question: ", dict[variable])

    print (ct1)

    print (pt1)

The python raw output:

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')
Welcome
Number of data rows: 6504
Number of data columns: 2829
***********************************************
Analyzing variable: MomCare
...answers the question: How much your Mother Cares about you?
3-Much Care 5504
2-Medium Care 127
1-Little Care 54
2- Medium Care 445
Name: MomCare, dtype: int64
3-Much Care 0.846248
2-Medium Care 0.019526
1-Little Care 0.008303
2- Medium Care 0.068419
Name: MomCare, dtype: float64
***********************************************
Analyzing variable: DadCare
...answers the question: How much your Father Cares about you?
3-Much Care 3752
2-Medium Care 180
1-Little Care 80
2- Medium Care 535
Name: DadCare, dtype: int64
3-Much Care 0.576876
2-Medium Care 0.027675
1-Little Care 0.012300
2- Medium Care 0.082257
Name: DadCare, dtype: float64
***********************************************
Analyzing variable: AreSleepingHoursNormal
...answers the question: Do you have sufficient sleep daily?
Below Normal sleeping 1024
Exceeds normal sleeping 655
Normal sleeping 4797
Name: AreSleepingHoursNormal, dtype: int64
Below Normal sleeping 0.157442
Exceeds normal sleeping 0.100707
Normal sleeping 0.737546
Name: AreSleepingHoursNormal, dtype: float64
***********************************************
Analyzing variable: TriedDrugsAtWhatAge
...answers the question: Have you ever tried illegal drugs? if so then at what age group?
Adult 13
Child 72
Never tried illegal drugs 5903
Teenager 414
Name: TriedDrugsAtWhatAge, dtype: int64
Adult 0.001999
Child 0.011070
Never tried illegal drugs 0.907595
Teenager 0.063653
Name: TriedDrugsAtWhatAge, dtype: float64
X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py:20: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

data[variable]=data[variable].convert_objects(convert_numeric = True)

Sunday, November 8, 2015

Week 2 assignment: Running Your First Program

Reminder from week 1
I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.

Week 2

Primary data analysis through Python/SAS

The chosen language: Python

In this post:

The python code
The formatted output
Results summary and description
The python raw output

The python code:

# -*- coding: utf-8 -*-
"""
Created on Sun Nov 8 11:21:26 2015

@author: Dr. Mohammad Elnesr
"""

import pandas
#import numpy [NO NEED for NUMPY RIGHTNOW]
# defining data source...
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)

# printing number of data rows (observations) and columns (variables)
print ('Number of data rows: ', len(data))
print('Number of data columns: ', len(data.columns))

# definig a python dictionary describing the meaning of each variable
dict={"H1WP10":"How much you think your mother cares about you?","H1RE4":"How important the religion is to you?","H1TO40":"How old were you when you tried illegal drugs?"}

# creating a loop that take each variable independetly
for variable in ["H1TO40","H1WP10","H1RE4"]:
data[variable]=data[variable].convert_objects(convert_numeric = True)
# define the frequency distribution
ct1 = data.groupby(variable).size()
# define the frequency distribution percent
pt1 = data.groupby(variable).size()*100/len(data)
# printing results with definitions
print ("***********************************************")
print ("Analyzing variable: ", variable)
print ("...answers the question: ", dict[variable])
print (ct1)
print (pt1)

The formatted output:

Results summary and description:

It is noticed from H1TO40 variable that 90.76% of the studied sample never tried any illegal drugs, which is fairly good ratio. However, 9.24% of the students tried at least one type of the illegal drugs, most of them started this bad experience at the age of 14 to 16 yeas old. With a close ratio, 84.63% of the students felt that their mother cares about them very much as shown in Table H1WP10. The role of religion appears clearly in Table H1RE4 where 77.3% told that it is either very important or fairly important to them.

The relationship between these three variables (and other variables) will be discussed in the next week.

The python raw output:

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/FirstProgram.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')

Number of data rows: 6504
Number of data columns: 2829
***********************************************
Analyzing variable: H1TO40
...answers the question: How old were you when you tried illegal drugs?
H1TO40
0 5903
1 15
3 6
6 4
9 2
11 12
12 33
13 61
14 85
15 108
16 96
17 64
18 13
96 60
98 36
99 4
dtype: int64
H1TO40
0 90.759533
1 0.230627
3 0.092251
6 0.061501
9 0.030750
11 0.184502
12 0.507380
13 0.937884
14 1.306888
15 1.660517
16 1.476015
17 0.984010
18 0.199877
96 0.922509
98 0.553506
99 0.061501
dtype: float64
***********************************************
Analyzing variable: H1WP10
...answers the question: How much you think your mother cares about you?
H1WP10
1 15
2 39
3 127
4 445
5 5504
6 1
7 370
8 3
dtype: int64
H1WP10
1 0.230627
2 0.599631
3 1.952645
4 6.841943
5 84.624846
6 0.015375
7 5.688807
8 0.046125
dtype: float64
***********************************************
Analyzing variable: H1RE4
...answers the question: How important the religion is to you?
H1RE4
1 2812
2 2218
3 391
4 193
6 3
7 879
8 8
dtype: int64
H1RE4
1 43.234932
2 34.102091
3 6.011685
4 2.967405
6 0.046125
7 13.514760
8 0.123001
dtype: float64

X:/PythonWorkingDirectory/FirstProgram.py:18: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.In [52]:
ct1 = data.groupby(variable).size()