Data Visualization Assignments: Week 4- Creating graphs for your data

Reminder from week 1 , week 2, and week 3
I have chosen the AddHealth dataset. After looking through its codebook, I found that I am interested in studying the dependencies between Drug Abuse of teenagers and some factors that most likely affect or been affected by it.

Week 4

Primary data analysis through Python/SAS

The chosen language: Python

In this post:

The python code (screenshot)

The formatted output

1- Statistics

2- Univariate charts (samples)

3- Bivariate Charts (sample)

Results summary and description
The python code (text)
The python raw output

The python code (screenshot):

The Formatted output

1- Statistics

2- Sample uni-variate Graphs

3- Sample Bivariate graphs

Summary

It can be noticed from the bivariate chart above
(THE FULL SIZE OF EACH CHART IS ABOVE)

that there is an inverse relationship between the rate of pray and the number of cigarettes smoked per month. i.e., the more frequent the student pray, the less frequent he/she smokes.

From the first univariate graph above,

we found that most of the sample tried to drink beer/wine when they were between 13 and 15 years old, which is the initial teenage age, a similar trend is found for the cigarettes smoking

, but with less count, i.e. , the teenagers drink beer more than they smoke.
The scatter plot above showed

that the earlier a person starts to smoke, the much cigarettes they smoke daily, as the relationship between them shows a negative trend.

The Python code (text)

# -*- coding: utf-8 -*-

"""

Created on Sun Nov 22 14:30:25 2015

@author: Dr. Mohammad Elnesr

"""

import pandas

import numpy

import seaborn

import matplotlib.pyplot as plt

#import matplotlib.pyplot as plt2

# defining data source...

print ("Welcome")

data = pandas.read_csv('addhealth_pds.csv', low_memory=False)

#Set PANDAS to show all columns in DataFrame

pandas.set_option('display.max_columns', None)

#Set PANDAS to show all rows in DataFrame

pandas.set_option('display.max_rows', None)

# bug fix for display formats to avoid run time errors

pandas.set_option('display.float_format', lambda x:'%f'%x)

dict={}

#setting variables you will be working with to numeric

def ConvertToNumeric (Variable):

data[Variable] = data[Variable].convert_objects(convert_numeric=True)

#Defining a function that converts any number in a variable to NaN

def ConvertToNaN (Variable, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):

data[Variable]=data[Variable].replace(Code1, numpy.nan)

for CodeX in[Code2, Code3, Code4]:

if CodeX != numpy.nan:

data[Variable]=data[Variable].replace(CodeX, numpy.nan)

def PrepareVariable(Variable, Definition, Code1, Code2=numpy.nan, Code3=numpy.nan, Code4=numpy.nan):

ConvertToNumeric (Variable)

ConvertToNaN (Variable, Code1, Code2, Code3, Code4)

dict[Variable] = Definition

# Applying the correction of each variable depending on the values that make nonsense.

PrepareVariable("H1GI9", "What is your racial background?", 6,8)

PrepareVariable("H1GI20", "In what grade are you?", 96,97,98,99)

PrepareVariable("H1TO30", "How old were you when you tried marijuana for the first time? ", 96,98,99)

PrepareVariable("H1TO14", "How old were you when you tried beer for the first time? ", 96,97,98)

PrepareVariable("H1TO7", "How many cigarettes did you smoke each day?", 96,97,98)

PrepareVariable("H1TO2", "How old were you when you tried cigarettes for the first time? ", 96,97,98)

PrepareVariable("H1TO5", "How many days did you smoke cigarettes?", 96,97,98)

PrepareVariable("H1ED12", "What is your grade in mathematics?", 96,97,98, 6)

PrepareVariable("H1RE6", "How often do you pray?", 6,7,8)

MyVariables = ['H1GI9','H1GI20','H1TO30','H1TO14','H1TO7','H1TO2','H1TO5', 'H1ED12', 'H1RE6']

#subset data to This week's variables

sub1=data[MyVariables]

#make a copy of my new subsetted data

sub2 = sub1.copy()

sub3 = sub1.copy()

plt.new_figure_manager.__new__

for Variable in MyVariables:

desc = sub2[Variable].describe()

print (desc)

print ("-=-=-=-=-=-=-=-=-=-=-=-")

plt.figure()

seaborn.distplot(sub2[Variable].dropna(), kde=False);

plt.xlabel(dict[Variable])

plt.title(dict[Variable] + " Distribution plot of "+Variable)

sub2[Variable] = sub2[Variable].astype('category')

plt.figure()

seaborn.countplot(x=Variable, data=sub2);

plt.xlabel(dict[Variable])

plt.title(dict[Variable] + " Count plot of "+Variable)

plt.show

# bivariate bar graph C->Q

plt.figure()

#sub2['H1GI9'] = sub2['H1GI9'].convert_objects(convert_numeric=True)

#sub2['H1TO2'] = sub2['H1TO2'].convert_objects(convert_numeric=True)

seaborn.factorplot(x="H1GI9", y="H1TO2", data=sub3, kind="bar", ci=None)

plt.xlabel('Race')

plt.ylabel('Age when smoking for the first time')

plt.figure()

scat2 = seaborn.regplot(x="H1TO2", y="H1TO14", data=data)

plt.xlabel('Age when smoking 1st time')

plt.ylabel('Age when drinking beer for 1st time')

plt.title('Scatterplot for the Association Between age when drinking beer and smoking')

plt.figure()

scat2 = seaborn.regplot(x="H1TO2", y="H1TO7", data=data)

plt.xlabel('Age when smoking 1st time')

plt.ylabel('Number of cigarets')

plt.title('Scatterplot for the Association Between Smoking age and number of cigarets')

plt.figure()

scat2 = seaborn.regplot(x="H1TO2", y="H1TO30", data=data)

plt.xlabel('Age when smoking 1st time')

plt.ylabel('Age when taking Marijuana 1st time')

plt.title('Scatterplot for the Association Between Smoking age and taking-marijuana age')

# quartile split (use qcut function & ask for 4 groups - gives you quartile split)

print ('quartiles')

#sub3['H1RE6']=pandas.qcut(sub3.H1RE6, 5, labels=["Daily","Weekly","Monthly","Frequently","Never"])

##sub3['H1ED12']=pandas.qcut(sub3.H1ED12, 3, labels=["1=33rd%tile","2=66th%tile","3=100%tile"])

#c10 = sub3['H1RE6'].value_counts(sort=False, dropna=True)

#print(c10)

# bivariate bar graph C->Q

recode1 = {1: "Daily", 2: "Weekly", 3: "Monthly", 4: "Yearly", 5: "NoPray"}

sub3['H1RE6']= sub3['H1RE6'].map(recode1)

plt.figure()

seaborn.factorplot(x='H1RE6', y='H1TO7', data=sub3, kind="bar", ci=None)

plt.xlabel('How often you pray?')

plt.ylabel('How many cigarettes you take?')

The python text results

runfile('X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Week 4.py', wdir='X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory')

Welcome

count 6498.000000

mean 1.560634

std 1.017517

min 1.000000

25% 1.000000

50% 1.000000

75% 2.000000

max 5.000000

Name: H1GI9, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 6337.000000

mean 9.539214

std 1.668300

min 7.000000

25% 8.000000

50% 10.000000

75% 11.000000

max 12.000000

Name: H1GI20, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 6406.000000

mean 3.707774

std 6.318714

min 0.000000

25% 0.000000

50% 0.000000

75% 10.000000

max 18.000000

Name: H1TO30, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 2537.000000

mean 13.363027

std 2.577861

min 1.000000

25% 12.000000

50% 14.000000

75% 15.000000

max 19.000000

Name: H1TO14, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 1653.000000

mean 6.921355

std 8.178834

min 0.000000

25% 1.000000

50% 4.000000

75% 10.000000

max 89.000000

Name: H1TO7, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 3553.000000

mean 9.892767

std 5.877938

min 0.000000

25% 7.000000

50% 12.000000

75% 14.000000

max 20.000000

Name: H1TO2, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 2728.000000

mean 10.085411

std 12.499521

min 0.000000

25% 0.000000

50% 2.000000

75% 25.000000

max 30.000000

Name: H1TO5, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 6275.000000

mean 2.466614

std 1.176472

min 1.000000

25% 2.000000

50% 2.000000

75% 3.000000

max 5.000000

Name: H1ED12, dtype: float64

-=-=-=-=-=-=-=-=-=-=-=-

count 5614.000000

mean 2.031350

std 1.283485

min 1.000000

25% 1.000000

50% 2.000000

75% 3.000000

max 5.000000

Name: H1RE6, dtype: float64X:/Dropbox/@CurrentWork/Data Analysis/PythonWorkingDirectory/Week 4.py:31: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py:424: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

max_open_warning, RuntimeWarning)

-=-=-=-=-=-=-=-=-=-=-=-

quartiles

<matplotlib.figure.Figure at 0x44e0fc50>

C:\Anaconda3\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

if self._edgecolors == str('face'):

<matplotlib.figure.Figure at 0xb29710>

Data Visualization Assignments

Sunday, November 22, 2015

Week 4- Creating graphs for your data

In this post:

No comments:

Post a Comment