I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.
Week 3
Primary data analysis through Python/SAS
The chosen language: Python
In this post:
- The python code (screenshot)
- The formatted output
- Results summary and description
- The python code (text)
- The python raw output
The python code (screenshot):
The formatted output:
Results summary and description:
It is noticed from MomCare and DadCare variables that 84.62% of the sample believe that their mother cares very much about them, while 57.69% of Fathers got the same rank! This means that Fathers cares less of their children than mothers do. However, almost 12.5% of the sample gave no valid response about mother!, this percent raised to about 38% of the sample about fathers.
The study showed that 73.75% of the sample sleep sufficient hours (7-9 h/d), 15.7% sleep below normal amount (<7 h/d) and about 10% sleeps more than enough. However, we have 26.25% of the sample need to adjust their sleeping hours for better health.
Finaally, the study showed that almost 91% of the saple have never tried any illegal drug, while 1.11% tried while they are children, 6.37% tried while they are teenagers, and 0.2% only tried while they are adults (>18 y). this means that the teenager stage is the most important stage to take kare of our children not to addict any drug.
The study showed that 73.75% of the sample sleep sufficient hours (7-9 h/d), 15.7% sleep below normal amount (<7 h/d) and about 10% sleeps more than enough. However, we have 26.25% of the sample need to adjust their sleeping hours for better health.
Finaally, the study showed that almost 91% of the saple have never tried any illegal drug, while 1.11% tried while they are children, 6.37% tried while they are teenagers, and 0.2% only tried while they are adults (>18 y). this means that the teenager stage is the most important stage to take kare of our children not to addict any drug.
The python code (text):
# -*- coding: utf-8 -*-
"""
Created on Sun Nov 15 11:09:42 2015
@author: Dr. Mohammad Elnesr
"""
import pandas
import numpy
# defining data source...
print ("Welcome")
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)
# printing number of data rows (observations) and columns (variables)
print ('Number of data rows: ', len(data))
print('Number of data columns: ', len(data.columns))
# creating a loop that take each variable independetly
for variable in ["H1WP10","H1WP14","H1GH51","H1TO40"]:
data[variable]=data[variable].convert_objects(convert_numeric = True)
#Defining a function that converts any number in a variable to NaN
def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):
data[Variable]=data[Variable].replace(Code1, numpy.nan)
for CodeX in[Code2, Code3, Code4]:
if CodeX != numpy.nan:
data[Variable]=data[Variable].replace(CodeX, numpy.nan)
# Applying the correction of each variable depending on the values that make nonsense.
ConvertToNaN ("H1TO40", 96,98,99)
ConvertToNaN ("H1GH51", 96,98)
ConvertToNaN ("H1WP10", 6,7,8)
ConvertToNaN ("H1WP14", 6,7,8,9)
#subset data to This week's variables
sub1=data[['H1TO40','H1GH51','H1WP10','H1WP14']]
#make a copy of my new subsetted data
sub2 = sub1.copy()
"""
New ParentCare variables
MomCare will replace H1WP10, and DadCare Will replace H1WP14
In the original DB, we have 1-5 for 1-Not at all, 2-very little, 3-somewhat, 4-quite a bit, and 5-very much
We will convert them to 3 categories: 1-Little Care, 2- Medium Care, and 3-Much Care
"""
recode1 = {1: '1-Little Care', 2: '1-Little Care', 3: '2- Medium Care', 4: '2- Medium Care', 5: '3-Much Care'}
sub2['MomCare']= sub2['H1WP10'].map(recode1)
sub2['DadCare']= sub2['H1WP14'].map(recode1)
# Recoding number of sleeping hours
recode2 = {}
for i in range (1,7):
recode2[i]='Below Normal sleeping'
for i in range (7,10):
recode2[i]='Normal sleeping'
for i in range (10,20):
recode2[i]='Exceeds normal sleeping'
#print(recode2)
sub2['AreSleepingHoursNormal']= sub2['H1GH51'].map(recode2)
# Recoding number of sleeping hours
recode3 = {0: 'Never tried illegal drugs', 18:'Adult'}#, range(1,13):'Before teenage', range(13,18):'At teenage'
for i in range (1,13):
recode3[i]='Child'
for i in range (13,18):
recode3[i]='Teenager'
sub2['TriedDrugsAtWhatAge']= sub2['H1GH51'].map(recode3)
# definig a python dictionary describing the meaning of each variable
dict={"MomCare":"How much your Mother Cares about you?", \
"DadCare":"How much your Father Cares about you?", \
"AreSleepingHoursNormal":"Do you have sufficient sleep daily?", \
"TriedDrugsAtWhatAge":"Have you ever tried illegal drugs? if so then at what age group?"}
for variable in ["MomCare","DadCare","AreSleepingHoursNormal","TriedDrugsAtWhatAge"]:
# define the frequency distribution ct1 = sub2[variable].value_counts(sort=False)
# define the frequency distribution percent pt1 = sub2[variable].value_counts(sort=False, normalize=True)
# printing results with definitions print ("***********************************************")
print ("Analyzing variable: ", variable)
print ("...answers the question: ", dict[variable])
print (ct1)
print (pt1)
"""
Created on Sun Nov 15 11:09:42 2015
@author: Dr. Mohammad Elnesr
"""
import pandas
import numpy
# defining data source...
print ("Welcome")
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)
# printing number of data rows (observations) and columns (variables)
print ('Number of data rows: ', len(data))
print('Number of data columns: ', len(data.columns))
# creating a loop that take each variable independetly
for variable in ["H1WP10","H1WP14","H1GH51","H1TO40"]:
data[variable]=data[variable].convert_objects(convert_numeric = True)
#Defining a function that converts any number in a variable to NaN
def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):
data[Variable]=data[Variable].replace(Code1, numpy.nan)
for CodeX in[Code2, Code3, Code4]:
if CodeX != numpy.nan:
data[Variable]=data[Variable].replace(CodeX, numpy.nan)
# Applying the correction of each variable depending on the values that make nonsense.
ConvertToNaN ("H1TO40", 96,98,99)
ConvertToNaN ("H1GH51", 96,98)
ConvertToNaN ("H1WP10", 6,7,8)
ConvertToNaN ("H1WP14", 6,7,8,9)
#subset data to This week's variables
sub1=data[['H1TO40','H1GH51','H1WP10','H1WP14']]
#make a copy of my new subsetted data
sub2 = sub1.copy()
"""
New ParentCare variables
MomCare will replace H1WP10, and DadCare Will replace H1WP14
In the original DB, we have 1-5 for 1-Not at all, 2-very little, 3-somewhat, 4-quite a bit, and 5-very much
We will convert them to 3 categories: 1-Little Care, 2- Medium Care, and 3-Much Care
"""
recode1 = {1: '1-Little Care', 2: '1-Little Care', 3: '2- Medium Care', 4: '2- Medium Care', 5: '3-Much Care'}
sub2['MomCare']= sub2['H1WP10'].map(recode1)
sub2['DadCare']= sub2['H1WP14'].map(recode1)
# Recoding number of sleeping hours
recode2 = {}
for i in range (1,7):
recode2[i]='Below Normal sleeping'
for i in range (7,10):
recode2[i]='Normal sleeping'
for i in range (10,20):
recode2[i]='Exceeds normal sleeping'
#print(recode2)
sub2['AreSleepingHoursNormal']= sub2['H1GH51'].map(recode2)
# Recoding number of sleeping hours
recode3 = {0: 'Never tried illegal drugs', 18:'Adult'}#, range(1,13):'Before teenage', range(13,18):'At teenage'
for i in range (1,13):
recode3[i]='Child'
for i in range (13,18):
recode3[i]='Teenager'
sub2['TriedDrugsAtWhatAge']= sub2['H1GH51'].map(recode3)
# definig a python dictionary describing the meaning of each variable
dict={"MomCare":"How much your Mother Cares about you?", \
"DadCare":"How much your Father Cares about you?", \
"AreSleepingHoursNormal":"Do you have sufficient sleep daily?", \
"TriedDrugsAtWhatAge":"Have you ever tried illegal drugs? if so then at what age group?"}
for variable in ["MomCare","DadCare","AreSleepingHoursNormal","TriedDrugsAtWhatAge"]:
# define the frequency distribution ct1 = sub2[variable].value_counts(sort=False)
# define the frequency distribution percent pt1 = sub2[variable].value_counts(sort=False, normalize=True)
# printing results with definitions print ("***********************************************")
print ("Analyzing variable: ", variable)
print ("...answers the question: ", dict[variable])
print (ct1)
print (pt1)
The python raw output:
Welcome
Number of data rows: 6504
Number of data columns: 2829
***********************************************
Analyzing variable: MomCare
...answers the question: How much your Mother Cares about you?
3-Much Care 5504
2-Medium Care 127
1-Little Care 54
2- Medium Care 445
Name: MomCare, dtype: int64
3-Much Care 0.846248
2-Medium Care 0.019526
1-Little Care 0.008303
2- Medium Care 0.068419
Name: MomCare, dtype: float64
***********************************************
Analyzing variable: DadCare
...answers the question: How much your Father Cares about you?
3-Much Care 3752
2-Medium Care 180
1-Little Care 80
2- Medium Care 535
Name: DadCare, dtype: int64
3-Much Care 0.576876
2-Medium Care 0.027675
1-Little Care 0.012300
2- Medium Care 0.082257
Name: DadCare, dtype: float64
***********************************************
Analyzing variable: AreSleepingHoursNormal
...answers the question: Do you have sufficient sleep daily?
Below Normal sleeping 1024
Exceeds normal sleeping 655
Normal sleeping 4797
Name: AreSleepingHoursNormal, dtype: int64
Below Normal sleeping 0.157442
Exceeds normal sleeping 0.100707
Normal sleeping 0.737546
Name: AreSleepingHoursNormal, dtype: float64
***********************************************
Analyzing variable: TriedDrugsAtWhatAge
...answers the question: Have you ever tried illegal drugs? if so then at what age group?
Adult 13
Child 72
Never tried illegal drugs 5903
Teenager 414
Name: TriedDrugsAtWhatAge, dtype: int64
Adult 0.001999
Child 0.011070
Never tried illegal drugs 0.907595
Teenager 0.063653
Name: TriedDrugsAtWhatAge, dtype: float64
X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Program 2.py:20: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
data[variable]=data[variable].convert_objects(convert_numeric = True)


No comments:
Post a Comment