Sunday, November 22, 2015

Week 4- Creating graphs for your data

Reminder from week 1 , week 2, and week 3
           I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.


Week 4

Primary data analysis through Python/SAS

The chosen languagePython



In this post:

  • The python code (screenshot)

  • The formatted output

    • 1- Statistics
    • 2- Univariate charts (samples)
    • 3- Bivariate Charts (sample)

  • Results summary and description
  • The python code (text)
  • The python raw output
The python code (screenshot):


The Formatted output

1- Statistics
 2- Sample uni-variate Graphs





3- Sample Bivariate graphs


Summary

It can be noticed from the bivariate chart above 
(THE FULL SIZE OF EACH CHART IS ABOVE)
that there is an inverse relationship between the rate of pray and the number of cigarettes smoked per month. i.e., the more frequent the student pray, the less frequent he/she smokes.

From the first univariate graph above,
we found that most of the sample tried to drink beer/wine when they were between 13 and 15 years old, which is the initial teenage age, a similar trend is found for the cigarettes smoking
, but with less count, i.e. , the teenagers drink beer more than they smoke.
The scatter plot above showed
that the earlier a person starts to smoke, the much cigarettes they smoke daily, as the relationship between them shows a negative trend.


The Python code (text)


# -*- coding: utf-8 -*-
"""
Created on Sun Nov 22 14:30:25 2015

@author: Dr. Mohammad Elnesr
"""

import pandas
import numpy
import seaborn
import matplotlib.pyplot as  plt
#import matplotlib.pyplot as  plt2

# defining data source...
print ("Welcome")
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)


#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)

dict={}

#setting variables you will be working with to numeric
def ConvertToNumeric (Variable):
    data[Variable] = data[Variable].convert_objects(convert_numeric=True)


#Defining a function that converts any number in a variable to NaN
def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):
    data[Variable]=data[Variable].replace(Code1, numpy.nan)
    for CodeX in[Code2, Code3, Code4]:
        if CodeX != numpy.nan:
            data[Variable]=data[Variable].replace(CodeX, numpy.nan)
def PrepareVariable(Variable, Definition, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):
    ConvertToNumeric (Variable)
    ConvertToNaN (Variable, Code1, Code2, Code3, Code4)
    dict[Variable] = Definition
    
  
            
# Applying the correction of each variable depending on the values that make nonsense.
PrepareVariable("H1GI9", "What is your racial background?", 6,8)
PrepareVariable("H1GI20", "In what grade are you?",  96,97,98,99)
PrepareVariable("H1TO30", "How old were you when you tried marijuana for the first time? ", 96,98,99)
PrepareVariable("H1TO14", "How old were you when you tried beer for the first time? ", 96,97,98)
PrepareVariable("H1TO7", "How many cigarettes did you smoke each day?", 96,97,98)
PrepareVariable("H1TO2", "How old were you when you tried cigarettes for the first time? ", 96,97,98)
PrepareVariable("H1TO5", "How many days did you smoke cigarettes?", 96,97,98)
PrepareVariable("H1ED12", "What is your grade in mathematics?", 96,97,98, 6)
PrepareVariable("H1RE6", "How often do you pray?", 6,7,8)

MyVariables = ['H1GI9','H1GI20','H1TO30','H1TO14','H1TO7','H1TO2','H1TO5', 'H1ED12', 'H1RE6']
#subset data to This week's variables
sub1=data[MyVariables]
#make a copy of my new subsetted data
sub2 = sub1.copy()
sub3 = sub1.copy()

plt.new_figure_manager.__new__



for Variable in MyVariables:
    desc = sub2[Variable].describe()
    print (desc)
    print ("-=-=-=-=-=-=-=-=-=-=-=-")
    
    plt.figure()
    seaborn.distplot(sub2[Variable].dropna(), kde=False);
    plt.xlabel(dict[Variable])
    plt.title(dict[Variable] + " Distribution plot of "+Variable)
    
    sub2[Variable] = sub2[Variable].astype('category')
    plt.figure()
    seaborn.countplot(x=Variable, data=sub2);
    plt.xlabel(dict[Variable])
    plt.title(dict[Variable] + " Count plot of "+Variable)

    plt.show

# bivariate bar graph C->Q
plt.figure()
#sub2['H1GI9'] = sub2['H1GI9'].convert_objects(convert_numeric=True)
#sub2['H1TO2'] = sub2['H1TO2'].convert_objects(convert_numeric=True)

seaborn.factorplot(x="H1GI9", y="H1TO2", data=sub3, kind="bar", ci=None)
plt.xlabel('Race')
plt.ylabel('Age when smoking for the first time')

plt.figure()
scat2 = seaborn.regplot(x="H1TO2", y="H1TO14", data=data)
plt.xlabel('Age when smoking 1st time')
plt.ylabel('Age when drinking beer for 1st time')
plt.title('Scatterplot for the Association Between age when drinking beer and smoking')
plt.figure()
scat2 = seaborn.regplot(x="H1TO2", y="H1TO7", data=data)
plt.xlabel('Age when smoking 1st time')
plt.ylabel('Number of cigarets')
plt.title('Scatterplot for the Association Between Smoking age and number of cigarets')
plt.figure()
scat2 = seaborn.regplot(x="H1TO2", y="H1TO30", data=data)
plt.xlabel('Age when smoking 1st time')
plt.ylabel('Age when taking Marijuana 1st time')
plt.title('Scatterplot for the Association Between Smoking age and taking-marijuana age')


# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
print ('quartiles')
#sub3['H1RE6']=pandas.qcut(sub3.H1RE6, 5, labels=["Daily","Weekly","Monthly","Frequently","Never"])
##sub3['H1ED12']=pandas.qcut(sub3.H1ED12, 3, labels=["1=33rd%tile","2=66th%tile","3=100%tile"])
#c10 = sub3['H1RE6'].value_counts(sort=False, dropna=True)
#print(c10)
# bivariate bar graph C->Q
recode1 = {1: "Daily", 2: "Weekly", 3: "Monthly", 4: "Yearly", 5: "NoPray"}
sub3['H1RE6']= sub3['H1RE6'].map(recode1)
plt.figure()
seaborn.factorplot(x='H1RE6', y='H1TO7', data=sub3, kind="bar", ci=None)
plt.xlabel('How often you pray?')
plt.ylabel('How many cigarettes you take?')


The python text results

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Week 4.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')
Welcome
count   6498.000000
mean       1.560634
std        1.017517
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max        5.000000
Name: H1GI9, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count   6337.000000
mean       9.539214
std        1.668300
min        7.000000
25%        8.000000
50%       10.000000
75%       11.000000
max       12.000000
Name: H1GI20, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count   6406.000000
mean       3.707774
std        6.318714
min        0.000000
25%        0.000000
50%        0.000000
75%       10.000000
max       18.000000
Name: H1TO30, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count   2537.000000
mean      13.363027
std        2.577861
min        1.000000
25%       12.000000
50%       14.000000
75%       15.000000
max       19.000000
Name: H1TO14, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count   1653.000000
mean       6.921355
std        8.178834
min        0.000000
25%        1.000000
50%        4.000000
75%       10.000000
max       89.000000
Name: H1TO7, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count   3553.000000
mean       9.892767
std        5.877938
min        0.000000
25%        7.000000
50%       12.000000
75%       14.000000
max       20.000000
Name: H1TO2, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count   2728.000000
mean      10.085411
std       12.499521
min        0.000000
25%        0.000000
50%        2.000000
75%       25.000000
max       30.000000
Name: H1TO5, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count   6275.000000
mean       2.466614
std        1.176472
min        1.000000
25%        2.000000
50%        2.000000
75%        3.000000
max        5.000000
Name: H1ED12, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count   5614.000000
mean       2.031350
std        1.283485
min        1.000000
25%        1.000000
50%        2.000000
75%        3.000000
max        5.000000
Name: H1RE6, dtype: float64X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Week 4.py:31: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  
C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py:424: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

-=-=-=-=-=-=-=-=-=-=-=-
quartiles


















<matplotlib.figure.Figure at 0x44e0fc50>

C:\Anaconda3\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):



<matplotlib.figure.Figure at 0xb29710>



Sunday, November 15, 2015

Week 3: Making Data Management Decisions

Reminder from week 1 and Week 2
           I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.


Week 3

Primary data analysis through Python/SAS

The chosen languagePython



In this post:

  • The python code (screenshot)
  • The formatted output
  • Results summary and description
  • The python code (text)
  • The python raw output


The python code (screenshot):








The formatted output:

Results summary and description:

     It is noticed from MomCare and DadCare variables that 84.62% of the sample believe that their mother cares very much about them, while 57.69% of Fathers got the same rank! This means that Fathers cares less of their children than mothers do. However, almost 12.5% of the sample gave no valid response about mother!, this percent raised to about 38% of the sample about fathers.

The study showed that 73.75% of the sample sleep sufficient hours (7-9 h/d), 15.7% sleep below normal amount (<7 h/d) and about 10% sleeps more than enough. However, we have 26.25% of the sample need to adjust their sleeping hours for better health.


Finaally, the study showed that almost 91% of the saple have never tried any illegal drug, while 1.11% tried while they are children, 6.37% tried while they are teenagers, and 0.2% only tried while they are adults (>18 y). this means that the teenager stage is the most important stage to take kare of our children not to addict any drug.


The python code (text):


# -*- coding: utf-8 -*-
"""
Created on Sun Nov 15 11:09:42 2015
@author: Dr. Mohammad Elnesr
"""

import pandas
import numpy
# defining data source...
print ("Welcome")
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)
# printing number of data rows (observations) and columns (variables)
print ('Number of data rows: ', len(data))
print('Number of data columns: ', len(data.columns))
# creating a loop that take each variable independetly
for variable in ["H1WP10","H1WP14","H1GH51","H1TO40"]:
    data[variable]=data[variable].convert_objects(convert_numeric = True)
#Defining a function that converts any number in a variable to NaN
def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):
    data[Variable]=data[Variable].replace(Code1, numpy.nan)
    for CodeX in[Code2, Code3, Code4]:
        if CodeX != numpy.nan:
            data[Variable]=data[Variable].replace(CodeX, numpy.nan)
           
# Applying the correction of each variable depending on the values that make nonsense.
ConvertToNaN ("H1TO40", 96,98,99)
ConvertToNaN ("H1GH51", 96,98)
ConvertToNaN ("H1WP10", 6,7,8)
ConvertToNaN ("H1WP14", 6,7,8,9)
#subset data to This week's variables
sub1=data[['H1TO40','H1GH51','H1WP10','H1WP14']]
#make a copy of my new subsetted data
sub2 = sub1.copy()
"""
New ParentCare variables
MomCare will replace H1WP10, and DadCare Will replace H1WP14
In the original DB, we have 1-5 for 1-Not at all, 2-very little, 3-somewhat, 4-quite a bit, and 5-very much
We will convert them to 3 categories: 1-Little Care, 2- Medium Care, and 3-Much Care
"""

recode1 = {1: '1-Little Care', 2: '1-Little Care', 3: '2- Medium Care', 4: '2- Medium Care', 5: '3-Much Care'}
sub2['MomCare']= sub2['H1WP10'].map(recode1)
sub2['DadCare']= sub2['H1WP14'].map(recode1)
# Recoding number of sleeping hours
recode2 = {}
for i in range (1,7):
    recode2[i]='Below Normal sleeping'
for i in range (7,10):
    recode2[i]='Normal sleeping'
for i in range (10,20):
    recode2[i]='Exceeds normal sleeping'
#print(recode2)
sub2['AreSleepingHoursNormal']= sub2['H1GH51'].map(recode2)
# Recoding number of sleeping hours
recode3 = {0: 'Never tried illegal drugs', 18:'Adult'}#, range(1,13):'Before teenage', range(13,18):'At teenage'
for i in range (1,13):
    recode3[i]='Child'
for i in range (13,18):
    recode3[i]='Teenager'

sub2['TriedDrugsAtWhatAge']= sub2['H1GH51'].map(recode3)

# definig a python dictionary describing the meaning of each variable
dict={"MomCare":"How much your Mother Cares about you?", \
      "DadCare":"How much your Father Cares about you?", \
      "AreSleepingHoursNormal":"Do you have sufficient sleep daily?", \
      "TriedDrugsAtWhatAge":"Have you ever tried illegal drugs? if so then at what age group?"}

for variable in ["MomCare","DadCare","AreSleepingHoursNormal","TriedDrugsAtWhatAge"]:

    # define the frequency distribution     ct1 = sub2[variable].value_counts(sort=False)
    # define the frequency distribution percent    pt1 = sub2[variable].value_counts(sort=False, normalize=True)
    # printing results with definitions    print ("***********************************************")
    print ("Analyzing variable: ", variable)
    print ("...answers the question: ", dict[variable])
    print (ct1)
    print (pt1)

The python raw output:

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')
Welcome
Number of data rows:  6504
Number of data columns:  2829
***********************************************
Analyzing variable:  MomCare
...answers the question:  How much your Mother Cares about you?
3-Much Care       5504
2-Medium Care      127
1-Little Care       54
2- Medium Care     445
Name: MomCare, dtype: int64
3-Much Care       0.846248
2-Medium Care     0.019526
1-Little Care     0.008303
2- Medium Care    0.068419
Name: MomCare, dtype: float64
***********************************************
Analyzing variable:  DadCare
...answers the question:  How much your Father Cares about you?
3-Much Care       3752
2-Medium Care      180
1-Little Care       80
2- Medium Care     535
Name: DadCare, dtype: int64
3-Much Care       0.576876
2-Medium Care     0.027675
1-Little Care     0.012300
2- Medium Care    0.082257
Name: DadCare, dtype: float64
***********************************************
Analyzing variable:  AreSleepingHoursNormal
...answers the question:  Do you have sufficient sleep daily?
Below Normal sleeping      1024
Exceeds normal sleeping     655
Normal sleeping            4797
Name: AreSleepingHoursNormal, dtype: int64
Below Normal sleeping      0.157442
Exceeds normal sleeping    0.100707
Normal sleeping            0.737546
Name: AreSleepingHoursNormal, dtype: float64
***********************************************
Analyzing variable:  TriedDrugsAtWhatAge
...answers the question:  Have you ever tried illegal drugs? if so then at what age group?
Adult                          13
Child                          72
Never tried illegal drugs    5903
Teenager                      414
Name: TriedDrugsAtWhatAge, dtype: int64
Adult                        0.001999
Child                        0.011070
Never tried illegal drugs    0.907595
Teenager                     0.063653
Name: TriedDrugsAtWhatAge, dtype: float64
X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py:20: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

  data[variable]=data[variable].convert_objects(convert_numeric = True)



Sunday, November 8, 2015

Week 2 assignment: Running Your First Program

Reminder from week 1
           I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.


Week 2

Primary data analysis through Python/SAS

The chosen language: Python

In this post:

  • The python code
  • The formatted output
  • Results summary and description
  • The python raw output


The python code:

# -*- coding: utf-8 -*-
"""
Created on Sun Nov  8 11:21:26 2015

@author: Dr. Mohammad Elnesr
"""

import pandas
#import numpy [NO NEED for NUMPY RIGHTNOW]
# defining data source...
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)

# printing number of data rows (observations) and columns (variables)
print ('Number of data rows: ', len(data))
print('Number of data columns: ', len(data.columns))

# definig a python dictionary describing the meaning of each variable
dict={"H1WP10":"How much you think your mother cares about you?","H1RE4":"How important the religion is to you?","H1TO40":"How old were you when you tried illegal drugs?"}

# creating a loop that take each variable independetly
for variable in ["H1TO40","H1WP10","H1RE4"]:
    data[variable]=data[variable].convert_objects(convert_numeric = True)
    # define the frequency distribution 
    ct1 = data.groupby(variable).size()
    # define the frequency distribution percent
    pt1 = data.groupby(variable).size()*100/len(data)
    # printing results with definitions
    print ("***********************************************")
    print ("Analyzing variable: ", variable)
    print ("...answers the question: ", dict[variable])
    print (ct1)
    print (pt1)

The formatted output:


Results summary and description:

     It is noticed from H1TO40 variable that 90.76% of the studied sample never tried any illegal drugs, which is fairly good ratio. However, 9.24% of the students tried at least one type of the illegal drugs, most of them started this bad experience at the age of 14 to 16 yeas old. With a close ratio, 84.63% of the students felt that their mother cares about them very much as shown in Table H1WP10. The role of religion appears clearly in Table H1RE4 where 77.3% told that it is either very important or fairly important to them.

    The relationship between these three variables (and other variables) will be discussed in the next week.

The python raw output:

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/FirstProgram.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')

Number of data rows:  6504
Number of data columns:  2829
***********************************************
Analyzing variable:  H1TO40
...answers the question:  How old were you when you tried illegal drugs?
H1TO40
0     5903
1       15
3        6
6        4
9        2
11      12
12      33
13      61
14      85
15     108
16      96
17      64
18      13
96      60
98      36
99       4
dtype: int64
H1TO40
0     90.759533
1      0.230627
3      0.092251
6      0.061501
9      0.030750
11     0.184502
12     0.507380
13     0.937884
14     1.306888
15     1.660517
16     1.476015
17     0.984010
18     0.199877
96     0.922509
98     0.553506
99     0.061501
dtype: float64
***********************************************
Analyzing variable:  H1WP10
...answers the question:  How much you think your mother cares about you?
H1WP10
1      15
2      39
3     127
4     445
5    5504
6       1
7     370
8       3
dtype: int64
H1WP10
1     0.230627
2     0.599631
3     1.952645
4     6.841943
5    84.624846
6     0.015375
7     5.688807
8     0.046125
dtype: float64
***********************************************
Analyzing variable:  H1RE4
...answers the question:  How important the religion is to you?
H1RE4
1    2812
2    2218
3     391
4     193
6       3
7     879
8       8
dtype: int64
H1RE4
1    43.234932
2    34.102091
3     6.011685
4     2.967405
6     0.046125
7    13.514760
8     0.123001
dtype: float64

X:/PythonWorkingDirectory/FirstProgram.py:18: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.In [52]: 
  ct1 = data.groupby(variable).size()