Sunday, November 15, 2015

Week 3: Making Data Management Decisions

Reminder from week 1 and Week 2
           I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.


Week 3

Primary data analysis through Python/SAS

The chosen languagePython



In this post:

  • The python code (screenshot)
  • The formatted output
  • Results summary and description
  • The python code (text)
  • The python raw output


The python code (screenshot):








The formatted output:

Results summary and description:

     It is noticed from MomCare and DadCare variables that 84.62% of the sample believe that their mother cares very much about them, while 57.69% of Fathers got the same rank! This means that Fathers cares less of their children than mothers do. However, almost 12.5% of the sample gave no valid response about mother!, this percent raised to about 38% of the sample about fathers.

The study showed that 73.75% of the sample sleep sufficient hours (7-9 h/d), 15.7% sleep below normal amount (<7 h/d) and about 10% sleeps more than enough. However, we have 26.25% of the sample need to adjust their sleeping hours for better health.


Finaally, the study showed that almost 91% of the saple have never tried any illegal drug, while 1.11% tried while they are children, 6.37% tried while they are teenagers, and 0.2% only tried while they are adults (>18 y). this means that the teenager stage is the most important stage to take kare of our children not to addict any drug.


The python code (text):


# -*- coding: utf-8 -*-
"""
Created on Sun Nov 15 11:09:42 2015
@author: Dr. Mohammad Elnesr
"""

import pandas
import numpy
# defining data source...
print ("Welcome")
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)
# printing number of data rows (observations) and columns (variables)
print ('Number of data rows: ', len(data))
print('Number of data columns: ', len(data.columns))
# creating a loop that take each variable independetly
for variable in ["H1WP10","H1WP14","H1GH51","H1TO40"]:
    data[variable]=data[variable].convert_objects(convert_numeric = True)
#Defining a function that converts any number in a variable to NaN
def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):
    data[Variable]=data[Variable].replace(Code1, numpy.nan)
    for CodeX in[Code2, Code3, Code4]:
        if CodeX != numpy.nan:
            data[Variable]=data[Variable].replace(CodeX, numpy.nan)
           
# Applying the correction of each variable depending on the values that make nonsense.
ConvertToNaN ("H1TO40", 96,98,99)
ConvertToNaN ("H1GH51", 96,98)
ConvertToNaN ("H1WP10", 6,7,8)
ConvertToNaN ("H1WP14", 6,7,8,9)
#subset data to This week's variables
sub1=data[['H1TO40','H1GH51','H1WP10','H1WP14']]
#make a copy of my new subsetted data
sub2 = sub1.copy()
"""
New ParentCare variables
MomCare will replace H1WP10, and DadCare Will replace H1WP14
In the original DB, we have 1-5 for 1-Not at all, 2-very little, 3-somewhat, 4-quite a bit, and 5-very much
We will convert them to 3 categories: 1-Little Care, 2- Medium Care, and 3-Much Care
"""

recode1 = {1: '1-Little Care', 2: '1-Little Care', 3: '2- Medium Care', 4: '2- Medium Care', 5: '3-Much Care'}
sub2['MomCare']= sub2['H1WP10'].map(recode1)
sub2['DadCare']= sub2['H1WP14'].map(recode1)
# Recoding number of sleeping hours
recode2 = {}
for i in range (1,7):
    recode2[i]='Below Normal sleeping'
for i in range (7,10):
    recode2[i]='Normal sleeping'
for i in range (10,20):
    recode2[i]='Exceeds normal sleeping'
#print(recode2)
sub2['AreSleepingHoursNormal']= sub2['H1GH51'].map(recode2)
# Recoding number of sleeping hours
recode3 = {0: 'Never tried illegal drugs', 18:'Adult'}#, range(1,13):'Before teenage', range(13,18):'At teenage'
for i in range (1,13):
    recode3[i]='Child'
for i in range (13,18):
    recode3[i]='Teenager'

sub2['TriedDrugsAtWhatAge']= sub2['H1GH51'].map(recode3)

# definig a python dictionary describing the meaning of each variable
dict={"MomCare":"How much your Mother Cares about you?", \
      "DadCare":"How much your Father Cares about you?", \
      "AreSleepingHoursNormal":"Do you have sufficient sleep daily?", \
      "TriedDrugsAtWhatAge":"Have you ever tried illegal drugs? if so then at what age group?"}

for variable in ["MomCare","DadCare","AreSleepingHoursNormal","TriedDrugsAtWhatAge"]:

    # define the frequency distribution     ct1 = sub2[variable].value_counts(sort=False)
    # define the frequency distribution percent    pt1 = sub2[variable].value_counts(sort=False, normalize=True)
    # printing results with definitions    print ("***********************************************")
    print ("Analyzing variable: ", variable)
    print ("...answers the question: ", dict[variable])
    print (ct1)
    print (pt1)

The python raw output:

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')
Welcome
Number of data rows:  6504
Number of data columns:  2829
***********************************************
Analyzing variable:  MomCare
...answers the question:  How much your Mother Cares about you?
3-Much Care       5504
2-Medium Care      127
1-Little Care       54
2- Medium Care     445
Name: MomCare, dtype: int64
3-Much Care       0.846248
2-Medium Care     0.019526
1-Little Care     0.008303
2- Medium Care    0.068419
Name: MomCare, dtype: float64
***********************************************
Analyzing variable:  DadCare
...answers the question:  How much your Father Cares about you?
3-Much Care       3752
2-Medium Care      180
1-Little Care       80
2- Medium Care     535
Name: DadCare, dtype: int64
3-Much Care       0.576876
2-Medium Care     0.027675
1-Little Care     0.012300
2- Medium Care    0.082257
Name: DadCare, dtype: float64
***********************************************
Analyzing variable:  AreSleepingHoursNormal
...answers the question:  Do you have sufficient sleep daily?
Below Normal sleeping      1024
Exceeds normal sleeping     655
Normal sleeping            4797
Name: AreSleepingHoursNormal, dtype: int64
Below Normal sleeping      0.157442
Exceeds normal sleeping    0.100707
Normal sleeping            0.737546
Name: AreSleepingHoursNormal, dtype: float64
***********************************************
Analyzing variable:  TriedDrugsAtWhatAge
...answers the question:  Have you ever tried illegal drugs? if so then at what age group?
Adult                          13
Child                          72
Never tried illegal drugs    5903
Teenager                      414
Name: TriedDrugsAtWhatAge, dtype: int64
Adult                        0.001999
Child                        0.011070
Never tried illegal drugs    0.907595
Teenager                     0.063653
Name: TriedDrugsAtWhatAge, dtype: float64
X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py:20: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

  data[variable]=data[variable].convert_objects(convert_numeric = True)



No comments:

Post a Comment