Exploratory health data analysis using ChatGPT Plus (data file and prompts in the description)

แชร์
ฝัง
  • เผยแพร่เมื่อ 28 ก.ค. 2024
  • Hi there! I am Professor Jay from the Department of Biostatistics and Bioinformatics in the Milken Institute School of Public Health at The George Washington University.
    Learn more about the School at publichealth.gwu.edu/
    One of the courses that I teach at GWU is Biostatistical Applications for Public Health. This is a postgraduate introduction to biostatistics. At the start of the course I cover the important topic of exploratory data analysis (EDA) using summary statistics and data visialization.
    In this short video tutorial I show you how to use the code interpreter in ChatGPT Plus to do your EDA.
    The video tutorial is for students in the School or indeed anyone interested in the use of a large language model such as ChatGPT for EDA.
    The heartLLM.csv file is available at github.com/juanklopper/Tutori...
    The prompts that I used are shown below.
    The CSV file contains 7 columns. Age describes the age of participants in years, BinarySex describes the gender of each participant, with two classes, M for male and F for female. Cholesterol is the serum cholesterol in mg/dL. RestingECG is a multilevel variable with classes Normal for a normal ECG, ST for ST-segment elevation, and LVH for left-ventricular hypertrophy. MaxHR is a continuous variable measured in beats per minute describing the maximum heart rate reached during exercise. ExerciseAngina is a binary variable with two classes, N for no and Y for yes and describes whether angina was induced by exercise. HeartDisease is a binary response variable a with 0 for no heart disease and 1 for the presence of heart disease.
    Provide summary statistics for the continuous numerical variables. Include the number of observations, the number of missing data, the mean, median, standard deviation, variance, minimum, maximum, range, quartiles, and interquartile range. Generate a table of the results.
    Create a table of the frequency and relative frequency of the classes for the categorical variables.
    Create a table of the summary statistics of the numerical variables used before, but filter only for those with a normal resting ECG.
    Use the same summary statistics for the continuous numerical variables as before, but only for the Age column and group the results by the classes of the response variable. Use "No heart disease" for 0 and "Heart disease" for 0 as row values for the two classes of the HeartDisease variable.
    Create a contingency table using the RestingECG and HeartDisease columns. Include row and column totals in the table. Also generate a table of expected values under the null hypothesis that the variable are not associated.
    Create a histogram of the Age column. Use the title "Distribution of participant age". Also use the horizontal axis label "Age [years]" and the vertical axis title "Count". Use light orange as the bar color. Create bins with a minimum of 20 and a maximum of 80, with a step-size of 10.
    Create a scatter plot of Age and MaxHR. Add the title "Scatter plot of Age and Maximum Heart Rate". Use the horizontal axis title "Age [years]" and the vertical axis title "Serum cholesterol [mg/dL]". Group the markers by the classes in the HeartDisease column. Add a legend named "Heart Disease" with the class names "No heart disease" for 0 and "Heart disease" for 1. Insert grid lines.
    Create separate lists for the Age values for each of the two classes in the HeartDisease column. Determine if the data meets the assumptions for the use of an equal variance T-test to determine if there is a difference in the mean age values for each list.
    Please perform a Mann-Whitney U test to compare the two lists of age values. The null hypothesis is that there is no difference between the two lists and the alternative hypothesis is that there is a difference between the two lists. Use a 5% level of significance. Write a full comment of the results.

ความคิดเห็น •