I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.
Week 4
Primary data analysis through Python/SAS
The chosen language: Python
In this post:
- The python code (screenshot)
- The formatted output
- 1- Statistics
- 2- Univariate charts (samples)
- 3- Bivariate Charts (sample)
- Results summary and description
- The python code (text)
- The python raw output
The python code (screenshot):
The Formatted output
1- Statistics
2- Sample uni-variate Graphs
3- Sample Bivariate graphs
Summary
It can be noticed from the bivariate chart above
(THE FULL SIZE OF EACH CHART IS ABOVE)
that there is an inverse relationship between the rate of pray and the number of cigarettes smoked per month. i.e., the more frequent the student pray, the less frequent he/she smokes.
From the first univariate graph above,
we found that most of the sample tried to drink beer/wine when they were between 13 and 15 years old, which is the initial teenage age, a similar trend is found for the cigarettes smoking
, but with less count, i.e. , the teenagers drink beer more than they smoke.
The scatter plot above showed
that the earlier a person starts to smoke, the much cigarettes they smoke daily, as the relationship between them shows a negative trend.
The Python code (text)
# -*- coding: utf-8 -*-
"""
Created on Sun Nov 22 14:30:25 2015
@author: Dr. Mohammad Elnesr
"""
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
#import matplotlib.pyplot as plt2
# defining data source...
print ("Welcome")
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
dict={}
#setting variables you will be working with to numeric
def ConvertToNumeric (Variable):
data[Variable] = data[Variable].convert_objects(convert_numeric=True)
#Defining a function that converts any number in a variable to NaN
def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):
data[Variable]=data[Variable].replace(Code1, numpy.nan)
for CodeX in[Code2, Code3, Code4]:
if CodeX != numpy.nan:
data[Variable]=data[Variable].replace(CodeX, numpy.nan)
def PrepareVariable(Variable, Definition, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):
ConvertToNumeric (Variable)
ConvertToNaN (Variable, Code1, Code2, Code3, Code4)
dict[Variable] = Definition
# Applying the correction of each variable depending on the values that make nonsense.
PrepareVariable("H1GI9", "What is your racial background?", 6,8)
PrepareVariable("H1GI20", "In what grade are you?", 96,97,98,99)
PrepareVariable("H1TO30", "How old were you when you tried marijuana for the first time? ", 96,98,99)
PrepareVariable("H1TO14", "How old were you when you tried beer for the first time? ", 96,97,98)
PrepareVariable("H1TO7", "How many cigarettes did you smoke each day?", 96,97,98)
PrepareVariable("H1TO2", "How old were you when you tried cigarettes for the first time? ", 96,97,98)
PrepareVariable("H1TO5", "How many days did you smoke cigarettes?", 96,97,98)
PrepareVariable("H1ED12", "What is your grade in mathematics?", 96,97,98, 6)
PrepareVariable("H1RE6", "How often do you pray?", 6,7,8)
MyVariables = ['H1GI9','H1GI20','H1TO30','H1TO14','H1TO7','H1TO2','H1TO5', 'H1ED12', 'H1RE6']
#subset data to This week's variables
sub1=data[MyVariables]
#make a copy of my new subsetted data
sub2 = sub1.copy()
sub3 = sub1.copy()
plt.new_figure_manager.__new__
for Variable in MyVariables:
desc = sub2[Variable].describe()
print (desc)
print ("-=-=-=-=-=-=-=-=-=-=-=-")
plt.figure()
seaborn.distplot(sub2[Variable].dropna(), kde=False);
plt.xlabel(dict[Variable])
plt.title(dict[Variable] + " Distribution plot of "+Variable)
sub2[Variable] = sub2[Variable].astype('category')
plt.figure()
seaborn.countplot(x=Variable, data=sub2);
plt.xlabel(dict[Variable])
plt.title(dict[Variable] + " Count plot of "+Variable)
plt.show
# bivariate bar graph C->Q
plt.figure()
#sub2['H1GI9'] = sub2['H1GI9'].convert_objects(convert_numeric=True)
#sub2['H1TO2'] = sub2['H1TO2'].convert_objects(convert_numeric=True)
seaborn.factorplot(x="H1GI9", y="H1TO2", data=sub3, kind="bar", ci=None)
plt.xlabel('Race')
plt.ylabel('Age when smoking for the first time')
plt.figure()
scat2 = seaborn.regplot(x="H1TO2", y="H1TO14", data=data)
plt.xlabel('Age when smoking 1st time')
plt.ylabel('Age when drinking beer for 1st time')
plt.title('Scatterplot for the Association Between age when drinking beer and smoking')
plt.figure()
scat2 = seaborn.regplot(x="H1TO2", y="H1TO7", data=data)
plt.xlabel('Age when smoking 1st time')
plt.ylabel('Number of cigarets')
plt.title('Scatterplot for the Association Between Smoking age and number of cigarets')
plt.figure()
scat2 = seaborn.regplot(x="H1TO2", y="H1TO30", data=data)
plt.xlabel('Age when smoking 1st time')
plt.ylabel('Age when taking Marijuana 1st time')
plt.title('Scatterplot for the Association Between Smoking age and taking-marijuana age')
# quartile split (use qcut function & ask for 4 groups - gives you quartile split)
print ('quartiles')
#sub3['H1RE6']=pandas.qcut(sub3.H1RE6, 5, labels=["Daily","Weekly","Monthly","Frequently","Never"])
##sub3['H1ED12']=pandas.qcut(sub3.H1ED12, 3, labels=["1=33rd%tile","2=66th%tile","3=100%tile"])
#c10 = sub3['H1RE6'].value_counts(sort=False, dropna=True)
#print(c10)
# bivariate bar graph C->Q
recode1 = {1: "Daily", 2: "Weekly", 3: "Monthly", 4: "Yearly", 5: "NoPray"}
sub3['H1RE6']= sub3['H1RE6'].map(recode1)
plt.figure()
seaborn.factorplot(x='H1RE6', y='H1TO7', data=sub3, kind="bar", ci=None)
plt.xlabel('How often you pray?')
plt.ylabel('How many cigarettes you take?')
The python text results
runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Week 4.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')
Welcome
count 6498.000000
mean 1.560634
std 1.017517
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 5.000000
Name: H1GI9, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count 6337.000000
mean 9.539214
std 1.668300
min 7.000000
25% 8.000000
50% 10.000000
75% 11.000000
max 12.000000
Name: H1GI20, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count 6406.000000
mean 3.707774
std 6.318714
min 0.000000
25% 0.000000
50% 0.000000
75% 10.000000
max 18.000000
Name: H1TO30, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count 2537.000000
mean 13.363027
std 2.577861
min 1.000000
25% 12.000000
50% 14.000000
75% 15.000000
max 19.000000
Name: H1TO14, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count 1653.000000
mean 6.921355
std 8.178834
min 0.000000
25% 1.000000
50% 4.000000
75% 10.000000
max 89.000000
Name: H1TO7, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count 3553.000000
mean 9.892767
std 5.877938
min 0.000000
25% 7.000000
50% 12.000000
75% 14.000000
max 20.000000
Name: H1TO2, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count 2728.000000
mean 10.085411
std 12.499521
min 0.000000
25% 0.000000
50% 2.000000
75% 25.000000
max 30.000000
Name: H1TO5, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count 6275.000000
mean 2.466614
std 1.176472
min 1.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 5.000000
Name: H1ED12, dtype: float64
-=-=-=-=-=-=-=-=-=-=-=-
count 5614.000000
mean 2.031350
std 1.283485
min 1.000000
25% 1.000000
50% 2.000000
75% 3.000000
max 5.000000
Name: H1RE6, dtype: float64X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Week 4.py:31: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py:424: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
max_open_warning, RuntimeWarning)
-=-=-=-=-=-=-=-=-=-=-=-
quartiles


















<matplotlib.figure.Figure at 0x44e0fc50>

C:\Anaconda3\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):



<matplotlib.figure.Figure at 0xb29710>













No comments:
Post a Comment