Advertisement
elena1234

frequency table ( cross table ), groupby and normalization in Python

May 12th, 2022 (edited)
193
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 1.88 KB | None | 0 0
  1. import numpy as np
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. import seaborn as sns
  5. from scipy import stats
  6.    
  7. da = pd.read_csv("C:/Users/eli/Desktop/YtPruboBEemdqA7UJJ_tgg_63e179e3722f4ef783f58ff6e395feb7_nhanes_2015_2016.csv")
  8.    
  9. da["DMDEDUC2x"] = da.DMDEDUC2.replace({1: "<9", 2: "9-11", 3: "HS/GED", 4: "Some college/AA", 5: "College",
  10.                                        7: "Refused", 9: "Don't know"})
  11. da["DMDMARTLx"] = da.DMDMARTL.replace({1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated", 5: "Never married",
  12.                                       6: "Living w/partner", 77: "Refused"})
  13. db = da.loc[(da.DMDEDUC2x != "Don't know") & (da.DMDMARTLx != "Refused"), :]
  14.  
  15. # Now we can create a contingency table, counting the number of people in each cell defined by a combination of education and marital status.
  16. x = pd.crosstab(db.DMDEDUC2x, da.DMDMARTLx)
  17.  
  18. # Normalize data
  19. # A contingency table can be normalized in three ways -- we can make the rows sum to 1,
  20. # the columns sum to 1, or the whole table sum to 1. Below we normalize within rows.
  21. # This gives us the proportion of people in each educational attainment category who fall into each group of
  22. # the marital status variable.
  23. # Normalizing within the rows.
  24. x.apply(lambda z: z/z.sum(), axis=1)
  25. print(x)
  26.  
  27. # We can also normalize within the columns.
  28. x.apply(lambda z: z/z.sum(), axis=0)
  29.  
  30. # # The following line does these steps, reading the code from left to right:
  31. # 1 Group the data by every combination of gender, education, and marital status
  32. # 2 Count the number of people in each cell using the 'size' method
  33. # 3 Pivot the marital status results into the columns (using unstack)
  34. # 4 Fill any empty cells with 0
  35. # 5 Normalize the data by row
  36. b = da.groupby(["RIAGENDRx", "DMDEDUC2x", "DMDMARTLx"]).size().unstack().fillna(0).apply(lambda x: x/x.sum(), axis=1)
  37. print(b.loc[:, ["Married"]].unstack())
  38.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement