Data Visualization Assignments: Week 3: Making Data Management Decisions

Reminder from week 1 and Week 2
I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.

Week 3

Primary data analysis through Python/SAS

The chosen language: Python

In this post:

The python code (screenshot)
The formatted output
Results summary and description
The python code (text)
The python raw output

The python code (screenshot):

The formatted output:

Results summary and description:

It is noticed from MomCare and DadCare variables that 84.62% of the sample believe that their mother cares very much about them, while 57.69% of Fathers got the same rank! This means that Fathers cares less of their children than mothers do. However, almost 12.5% of the sample gave no valid response about mother!, this percent raised to about 38% of the sample about fathers.

The study showed that 73.75% of the sample sleep sufficient hours (7-9 h/d), 15.7% sleep below normal amount (<7 h/d) and about 10% sleeps more than enough. However, we have 26.25% of the sample need to adjust their sleeping hours for better health.

Finaally, the study showed that almost 91% of the saple have never tried any illegal drug, while 1.11% tried while they are children, 6.37% tried while they are teenagers, and 0.2% only tried while they are adults (>18 y). this means that the teenager stage is the most important stage to take kare of our children not to addict any drug.

The python code (text):

# -*- coding: utf-8 -*-

"""
Created on Sun Nov 15 11:09:42 2015
@author: Dr. Mohammad Elnesr
"""

import pandas

import numpy

# defining data source...

print ("Welcome")

data = pandas.read_csv('addhealth_pds.csv', low_memory=False)

# printing number of data rows (observations) and columns (variables)

print ('Number of data rows: ', len(data))

print('Number of data columns: ', len(data.columns))

# creating a loop that take each variable independetly

for variable in ["H1WP10","H1WP14","H1GH51","H1TO40"]:

    data[variable]=data[variable].convert_objects(convert_numeric = True)

#Defining a function that converts any number in a variable to NaN

def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):

    data[Variable]=data[Variable].replace(Code1, numpy.nan)

    for CodeX in[Code2, Code3, Code4]:

        if CodeX != numpy.nan:

            data[Variable]=data[Variable].replace(CodeX, numpy.nan)

# Applying the correction of each variable depending on the values that make nonsense.

ConvertToNaN ("H1TO40", 96,98,99)

ConvertToNaN ("H1GH51", 96,98)

ConvertToNaN ("H1WP10", 6,7,8)

ConvertToNaN ("H1WP14", 6,7,8,9)

#subset data to This week's variables

sub1=data[['H1TO40','H1GH51','H1WP10','H1WP14']]

#make a copy of my new subsetted data

sub2 = sub1.copy()

"""
New ParentCare variables
MomCare will replace H1WP10, and DadCare Will replace H1WP14
In the original DB, we have 1-5 for 1-Not at all, 2-very little, 3-somewhat, 4-quite a bit, and 5-very much
We will convert them to 3 categories: 1-Little Care, 2- Medium Care, and 3-Much Care
"""

recode1 = {1: '1-Little Care', 2: '1-Little Care', 3: '2- Medium Care', 4: '2- Medium Care', 5: '3-Much Care'}

sub2['MomCare']= sub2['H1WP10'].map(recode1)

sub2['DadCare']= sub2['H1WP14'].map(recode1)

# Recoding number of sleeping hours

recode2 = {}

for i in range (1,7):

    recode2[i]='Below Normal sleeping'

for i in range (7,10):

    recode2[i]='Normal sleeping'

for i in range (10,20):

    recode2[i]='Exceeds normal sleeping'

#print(recode2)

sub2['AreSleepingHoursNormal']= sub2['H1GH51'].map(recode2)

# Recoding number of sleeping hours

recode3 = {0: 'Never tried illegal drugs', 18:'Adult'}#, range(1,13):'Before teenage', range(13,18):'At teenage'

for i in range (1,13):

    recode3[i]='Child'

for i in range (13,18):

    recode3[i]='Teenager'

sub2['TriedDrugsAtWhatAge']= sub2['H1GH51'].map(recode3)

# definig a python dictionary describing the meaning of each variable

dict={"MomCare":"How much your Mother Cares about you?", \

      "DadCare":"How much your Father Cares about you?", \

      "AreSleepingHoursNormal":"Do you have sufficient sleep daily?", \

      "TriedDrugsAtWhatAge":"Have you ever tried illegal drugs? if so then at what age group?"}

for variable in ["MomCare","DadCare","AreSleepingHoursNormal","TriedDrugsAtWhatAge"]:

    # define the frequency distribution     ct1 = sub2[variable].value_counts(sort=False)

    # define the frequency distribution percent    pt1 = sub2[variable].value_counts(sort=False, normalize=True)

    # printing results with definitions    print ("***********************************************")

    print ("Analyzing variable: ", variable)

    print ("...answers the question: ", dict[variable])

    print (ct1)

    print (pt1)

The python raw output:

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')
Welcome
Number of data rows: 6504
Number of data columns: 2829
***********************************************
Analyzing variable: MomCare
...answers the question: How much your Mother Cares about you?
3-Much Care 5504
2-Medium Care 127
1-Little Care 54
2- Medium Care 445
Name: MomCare, dtype: int64
3-Much Care 0.846248
2-Medium Care 0.019526
1-Little Care 0.008303
2- Medium Care 0.068419
Name: MomCare, dtype: float64
***********************************************
Analyzing variable: DadCare
...answers the question: How much your Father Cares about you?
3-Much Care 3752
2-Medium Care 180
1-Little Care 80
2- Medium Care 535
Name: DadCare, dtype: int64
3-Much Care 0.576876
2-Medium Care 0.027675
1-Little Care 0.012300
2- Medium Care 0.082257
Name: DadCare, dtype: float64
***********************************************
Analyzing variable: AreSleepingHoursNormal
...answers the question: Do you have sufficient sleep daily?
Below Normal sleeping 1024
Exceeds normal sleeping 655
Normal sleeping 4797
Name: AreSleepingHoursNormal, dtype: int64
Below Normal sleeping 0.157442
Exceeds normal sleeping 0.100707
Normal sleeping 0.737546
Name: AreSleepingHoursNormal, dtype: float64
***********************************************
Analyzing variable: TriedDrugsAtWhatAge
...answers the question: Have you ever tried illegal drugs? if so then at what age group?
Adult 13
Child 72
Never tried illegal drugs 5903
Teenager 414
Name: TriedDrugsAtWhatAge, dtype: int64
Adult 0.001999
Child 0.011070
Never tried illegal drugs 0.907595
Teenager 0.063653
Name: TriedDrugsAtWhatAge, dtype: float64
X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py:20: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

data[variable]=data[variable].convert_objects(convert_numeric = True)

Data Visualization Assignments

Sunday, November 15, 2015

Week 3: Making Data Management Decisions

In this post:

No comments:

Post a Comment