Overview

One of the most important functions of the working statistician is to investigate and answer significant research questions by analyzing real-world data, using a variety of elementary and advanced modeling techniques, and to distill the results into reports that are accessible to the non-statistician.

You will work in small groups to research a topic of interest to you, and then summarize your results in a technical report submitted to me and shared with the class.

Project Goals

  • Investigate a real-world data set by performing exploratory data analysis and visualization.

  • Formulate a research question and hypothesis.

  • Create a data biography by exploring its context and source.

  • Perform appropriate statistical inference to answer the research question.

  • Craft a clear, engaging narrative answering your research question in a technical report and short pre-recorded presentation.

Project Timeline

Name Description Due Date
Group Formation Submit the name of 1 other person you’d like to be in a group with. 5pm Friday, February 10th (Week 3)
Research Proposal Identify the data set and research question your group will investigate. 5pm Friday, March 3rd (Week 6)
Assignment 1 Data Exploration via wrangling, summarizing, and visualization 5pm Friday, March 17th (Week 8)
Assignment 2 Data Biography to contextualize data. 5pm Friday, April 14th (Week 10)
Assignment 3 Statistical analysis to answer research question. 5pm Friday, May 5th (Week 13)
Technical Report Draft Draft of technical report, outlining results of Assignments 1 - 3 5pm Friday, May 12th (Week 14)
Technical Report Final Final draft of technical report, outlining results of Assignments 1 - 3 5pm Friday, May 19th (Last Day of Finals Week)

Group Expectations

As members of both the STA 209 and Grinnell College community of scholars, we expect all students to engage with this project in a manner that respects the STA 209 Code of Conduct.

In particular, for this group project, each group member should…

  • Respond to messages or other discussion promptly (within 1 business day at the latest).
  • Attend scheduled meetings, and give preemptive notice when attendance isn’t possible.
  • Make significant contribution to each assignment.
  • Allow each other group member opportunity to make significant contribution to each assignment.
  • Communicate personal timeline for finishing tasks.
  • Respect others’ personal timeline for finishing tasks.
  • Provide charitable and constructive feedback on other group members’ work.
  • Incorporate feedback from other group members to improve work.

After each project component, each student will be asked to complete a self-evaluation survey reflecting on their contributions to the project. Individual project grades will reflect both the degree of individual contribution, as well as the overall project quality.


Research Proposal

Goals

  • Determine your research question(s), along with the dataset and variables.

  • Describe the significance of an answer to the question in the context of the data.

Notes

  • The project assignments will be fairly open-ended and much less prescribed than your lab or homework assignments, mimicking a more real-world situation where you are tasked with extracting knowledge from data.
  • Think carefully when selecting your research questions since you will explore these same questions for the whole project.
  • Make sure everyone in the group is interested in the selected research questions.
  • Make sure to read the provided background information about the data.

Tasks

  1. As a group, determine
    • Which dataset you want to investigate for your project.
    • Two potential research questions you want to explore involving this dataset.
      • Each question should relate to at least two of the variables in the dataset.
      • The questions should all have the same general theme but may involve different variables.
      • The questions can (and likely will) relate to subsets of the data. For example, maybe you want to focus on how COVID-related behaviors differ between residents of two states in the US or you want to focus on protests in a given year and region of the world.
  2. In a one page proposal:
    • Explicitly state your two research questions
    • Indicate the dataset and variables you will work with
    • Discuss the utility of an answer to each of your research questions, or describe why an answer would be interesting or relevant to your group (at least 1 paragraph for each question).
  3. Turn in the .pdf of your research proposal on Gradescope by 5pm PST on Friday, March 3rd.

Crafting Research Questions

Usually you should start with a research question and then search for data to help you address the question. For feasibility reasons, we are asking you to work backwards. Here are some tips for generating your research questions:

  • Read over the background information about the dataset that interests you and your group and start considering what relationships you might want to explore.
  • Pick out a few specific variables and (re)frame your question around exploring the relationship between those variables.
  • Make sure your question is focused enough that it can be answered with the data at hand.
  • Here are some generic examples to get your group started:
    • EX: Does country A have a higher rate of X than country B?
    • EX: Is X positively related to Y? (In other words, as X increases, does Y tend to increase?)
    • EX: Is there evidence that trend X is becoming more popular over time?
    • EX: Is there a relationship between X and Y?
    • EX: How well do the following factors, X and Y, predict the variable Z?
    • EX: Are there differences in X by Y?

Rubric

You will be assessed on the following:

  • The degree to which the research question is of appropriate scope for the project, and can be answered by the data at hand.
  • The depth, nuance, or insight that an answer to the research question could provide about the data set or a population.
  • The quality and technical correctness of the writing.
  • Whether the proposal contains all required parts.
  • The originality of work.
  • Each student’s individual contributions to this part of the project.

Assignment 1

Goals

  • Confirm/revise your research question, along with the dataset and variables.
  • Practice inspecting data.
  • Practice visualizing and summarizing data.

Notes

  • If you find after preliminary data exploration and analysis that one of your research questions is not answerable using the data at hand, you are welcome to select a new research question after consulting with your instructor.

Tasks

  1. As a group, determine whether you want to revise either of the research questions you included in the “Research Proposal” you wish to investigate.

  2. Start investigating the two research questions you identified in the “Research Proposal”:

    • Producing useful summaries of the variables and their relationships.
    • Graphing each variable and the relationships between variables.
    • Completing any useful data wrangling.
  3. In an Rmd file, write a 3 - 4 page summary (including figures) that:

    • States each research question and provides some initial answers/findings related to the questions
    • Introduces the data and addresses what/who the data represent (for your variables of interest)
    • Presents at least two summary statistics related to each research question and discusses what they suggest about the data.
    • Presents at least two data visualizations related to each research question and discusses what they suggest about the data.
    • Includes your R code.
  4. Turn in the .pdf of your summary on Gradescope by 5pm PST on Friday, March 17th.

If all members of the group consent, groups may request an extension on the assignment until Monday, April 2nd (after spring break), but this request must be made prior to the assignment due date.

Rubric

You will be assessed on the following:

  • The informativeness of your summary with respect to one or both of your research question
  • The appropriateness of the chosen graphs and summary statistics
  • The degree to which each graph makes appropriate use of geoms and their aesthetics, scale, and context
  • The degree to which the graphs are clear and engaging
  • The degree to which the graphs, summary statistics, and narrative support each other
  • The degree to which the text and code are well organized and well-written
  • The originality of work
  • Each student’s individual contributions to the project.

Assignment 2

Goals

  • Create a data biography by answering the following key questions about the data:
    • Where did the data come from?
    • When were the data collected?
    • Why were the data collected?
    • How were the data collected?
    • Who are the data supposed to represent?
      • Who is present? Who is absent?
      • What evidence is there that the data are representative? What evidence is there that the data are not representative?
  • Better understand the context of our data to reduce the assumptions and biases we are placing on the data.

Notes

  • Your group should do some investigation here to answer these questions, rather than just relying on the information provided in data codebooks.

  • You should cite your sources at the end of your data biography, using your preferred citation style (but enough information should be included that a reader can track down your source).

Assignment

  1. Write a 2-3 page data biography that attempts to answer the questions provided in the Goals.
    • Your write-up should be presented as a narrative, using complete sentences and paragraphs.
  2. Turn in the pdf of your biography on Gradescope by 5pm on Friday April 14th.

Rubric

You will be assessed on the following:

  • The informativeness of your data biography with respect to each the key questions provided in the Goals Section
  • The degree to which the text is supported by references and the appropriateness of the selected references
  • The degree to which the text is well organized and well-written
  • Each student’s individual contributions to the project.

Further Reading

The following articles discuss the importance of data biographies, and outline the process of creating a good data biography:


Assignment 3

Goals

  • Conduct statistical inference on your research questions.

Assignment

  1. Conduct a hypothesis test for your research questions.

    • For the hypothesis test,
      • Explicitly state the hypotheses in both words and symbols.
      • Include the method used, the test statistic, and the p-value.
      • Determine an appropriate significance level based on the consequences for type I/II errors
      • Check assumptions. (If violated, still finish the test but be cautious in your conclusion.)
      • Interpret the p-value in the context of the problem.
      • Discuss conclusions about the conjecture.
      • Describe whether the observed effect has practical significance, based on your understanding of the data context.
  2. Construct a confidence interval for your research question.

    • For the confidence interval,
      • Include the method used, confidence level, and interval values.
      • Describe why you choose the confidence level you did, based on the relationship between confidence level and margin of error, as well as the specific data context.
      • Check assumptions. (If violated, still construct the confidence interval but be cautious in your conclusion.)
      • Discuss conclusions about the conjecture.
  3. Write a 1-2 page summary of your findings that includes all the pieces specified in 1. and 2. Include appropriate visualizations for the confidence intervals and hypothesis tests.

  4. Turn in the pdf of your summary on Gradescope by 5pm Friday, May 5th.

Rubric

You will be assessed on the following:

  • For the hypothesis test,
    • Selecting an appropriate parameter of interest
    • Including the correct method, correct test statistic, and correct p-value.
    • Checking assumptions.
    • Correctly interpreting the p-value in the context of the problem.
    • Accurately discussing conclusions about the conjecture.
  • For the confidence interval:
    • Including the correct method and interval values.
    • Including confidence level.
    • Checking assumptions.
    • Accurately discussing conclusions about the conjecture.
  • The degree to which the text is well organized and well-written
  • Each student’s individaul contributions to the project.

Technical Report

Goals

  • Craft a clear, engaging, accurate story about one of your research questions.

Assignment

  1. Create a 3-5 page technical report that addresses the following:
    • Your research question
    • Your data source
    • Exploratory graphs and summary statistics and what they tell you about your research question
    • An inference procedure (and any assumptions) and interpretation of results
    • Conclusions about your research question
  2. The technical reports should be uploaded to gradescope by 5:00pm on Friday, May 19th.

Notes

  • In this project assignment, you likely won’t need to conduct any additional analysis. Instead, you will be summarizing content from the previous project assignments.
    • However, it is okay if you do conduct additional analysis
  • Reports will be graded both for content and for how well the material is discussed
  • You should address your topic and statistical content at a level that is appropriate for a STA 209 audience.

Technical Report Details

  • Simply combining your work in Assignments 1, 2, and 3 will produce a document that is much longer than 3 - 5 pages.
    • Instead, think carefully about what the most important details are for your analysis, and curate your previous assignments to highlight and support these.
  • You do not need to include the code used to perform your analysis in the .pdf document itself. You should however, include the output of the code (summary statistics, visualizations, and the results of any inference where appropriate).
    • To have code run when you knit, but not display, replace the chunk header {r} with {r echo = F}.
    • If you also don’t want the output of the code to display, use {r echo = F, include = F }
  • You can control the size of included graphics by adding {r fig.width =..., fig.height=...} to your chunk options, where ... is replaced with the desired width/height of graphic in inches.
  • Your report can have a title page listing the project title, the project authors, and date. This page does not count towards the page limit.

Rubric

You will be assessed on the following:

  • Length: Technical reports that are not between the 3-5 pages single-spaced will be penalized.
  • Content: Demonstrates a full and accurate understanding of the material presented.
    • The report addresses each item listed in (1).
  • Style The degree to which the text is well organized and well-written
  • Sources: At least 1 appropriate reference (other than our textbooks) is included. The references should be on a separate page from the report and are not included in the page count.
  • Each student’s individual contributions to the project.