Overview
One of the most important functions of the working statistician is to
investigate and answer significant research questions by analyzing
real-world data, using a variety of elementary and advanced modeling
techniques, and to distill the results into reports that are accessible
to the non-statistician.
You will work in small groups to research a topic of interest to you,
and then summarize your results in a technical report submitted to me
and shared with the class.
Project Goals
Investigate a real-world data set by performing exploratory data
analysis and visualization.
Formulate a research question and hypothesis.
Create a data biography by exploring its context and
source.
Perform appropriate statistical inference to answer the research
question.
Craft a clear, engaging narrative answering your research
question in a technical report and short pre-recorded
presentation.
Project Timeline
Group Formation |
Submit the name of 1 other person you’d like to be in a group
with. |
5pm Friday, February 10th (Week 3) |
Research Proposal |
Identify the data set and research question your group will
investigate. |
5pm Friday, March 3rd (Week 6) |
Assignment 1 |
Data Exploration via wrangling, summarizing, and visualization |
5pm Friday, March 17th (Week 8) |
Assignment 2 |
Data Biography to contextualize data. |
5pm Friday, April 14th (Week 10) |
Assignment 3 |
Statistical analysis to answer research question. |
5pm Friday, May 5th (Week 13) |
Technical Report Draft |
Draft of technical report, outlining results of Assignments 1 -
3 |
5pm Friday, May 12th (Week 14) |
Technical Report Final |
Final draft of technical report, outlining results of Assignments 1
- 3 |
5pm Friday, May 19th (Last Day of Finals Week) |
Group Expectations
As members of both the STA 209 and Grinnell College community of
scholars, we expect all students to engage with this project in a manner
that respects the STA 209 Code
of Conduct.
In particular, for this group project, each group member should…
- Respond to messages or other discussion promptly (within 1 business
day at the latest).
- Attend scheduled meetings, and give preemptive notice when
attendance isn’t possible.
- Make significant contribution to each assignment.
- Allow each other group member opportunity to make significant
contribution to each assignment.
- Communicate personal timeline for finishing tasks.
- Respect others’ personal timeline for finishing tasks.
- Provide charitable and constructive feedback on other group members’
work.
- Incorporate feedback from other group members to improve work.
After each project component, each student will be asked to complete
a self-evaluation survey reflecting on their contributions to the
project. Individual project grades will reflect both the degree
of individual contribution, as well as the overall project
quality.
Research Proposal
Goals
Determine your research question(s), along with the dataset and
variables.
Describe the significance of an answer to the question in the
context of the data.
Notes
- The project assignments will be fairly open-ended and much less
prescribed than your lab or homework assignments, mimicking a more
real-world situation where you are tasked with extracting knowledge from
data.
- Think carefully when selecting your research questions since you
will explore these same questions for the whole project.
- Make sure everyone in the group is interested in the selected
research questions.
- Make sure to read the provided background information about the
data.
Tasks
- As a group, determine
- Which dataset you want to investigate for your project.
- Two potential research questions you want to explore involving this
dataset.
- Each question should relate to at least two of the variables in the
dataset.
- The questions should all have the same general theme but may involve
different variables.
- The questions can (and likely will) relate to subsets of the data.
For example, maybe you want to focus on how COVID-related behaviors
differ between residents of two states in the US or you want to focus on
protests in a given year and region of the world.
- In a one page proposal:
- Explicitly state your two research questions
- Indicate the dataset and variables you will work with
- Discuss the utility of an answer to each of your research questions,
or describe why an answer would be interesting or relevant to your group
(at least 1 paragraph for each question).
- Turn in the .pdf of your research proposal on Gradescope by 5pm PST
on Friday, March 3rd.
Crafting Research Questions
Usually you should start with a research question and then search for
data to help you address the question. For feasibility reasons, we are
asking you to work backwards. Here are some tips for generating your
research questions:
- Read over the background information about the dataset that
interests you and your group and start considering what relationships
you might want to explore.
- Pick out a few specific variables and (re)frame your question around
exploring the relationship between those variables.
- Make sure your question is focused enough that it can be answered
with the data at hand.
- Here are some generic examples to get your group started:
- EX: Does country A have a higher rate of X than country B?
- EX: Is X positively related to Y? (In other words, as X increases,
does Y tend to increase?)
- EX: Is there evidence that trend X is becoming more popular over
time?
- EX: Is there a relationship between X and Y?
- EX: How well do the following factors, X and Y, predict the variable
Z?
- EX: Are there differences in X by Y?
Rubric
You will be assessed on the following:
- The degree to which the research question is of appropriate scope
for the project, and can be answered by the data at hand.
- The depth, nuance, or insight that an answer to the research
question could provide about the data set or a population.
- The quality and technical correctness of the writing.
- Whether the proposal contains all required parts.
- The originality of work.
- Each student’s individual contributions to this part of the
project.
Assignment 1
Goals
- Confirm/revise your research question, along with the dataset and
variables.
- Practice inspecting data.
- Practice visualizing and summarizing data.
Notes
- If you find after preliminary data exploration and analysis that one
of your research questions is not answerable using the data at hand, you
are welcome to select a new research question after consulting with your
instructor.
Tasks
As a group, determine whether you want to revise either of the
research questions you included in the “Research Proposal” you wish to
investigate.
Start investigating the two research questions you identified in
the “Research Proposal”:
- Producing useful summaries of the variables and their
relationships.
- Graphing each variable and the relationships between variables.
- Completing any useful data wrangling.
In an Rmd file, write a 3 - 4 page summary (including figures)
that:
- States each research question and provides some initial
answers/findings related to the questions
- Introduces the data and addresses what/who the data represent (for
your variables of interest)
- Presents at least two summary statistics related to each research
question and discusses what they suggest about the data.
- Presents at least two data visualizations related to each research
question and discusses what they suggest about the data.
- Includes your R code.
Turn in the .pdf of your summary on Gradescope by 5pm PST on
Friday, March 17th.
If all members of the group consent, groups may request an
extension on the assignment until Monday, April 2nd (after spring
break), but this request must be made prior to the assignment due
date.
Rubric
You will be assessed on the following:
- The informativeness of your summary with respect to one or both of
your research question
- The appropriateness of the chosen graphs and summary statistics
- The degree to which each graph makes appropriate use of
geoms
and their aesthetics, scale, and context
- The degree to which the graphs are clear and engaging
- The degree to which the graphs, summary statistics, and narrative
support each other
- The degree to which the text and code are well organized and
well-written
- The originality of work
- Each student’s individual contributions to the project.
Assignment 2
Goals
- Create a data biography by answering the following key
questions about the data:
- Where did the data come from?
- When were the data collected?
- Why were the data collected?
- How were the data collected?
- Who are the data supposed to represent?
- Who is present? Who is absent?
- What evidence is there that the data are representative? What
evidence is there that the data are not representative?
- Better understand the context of our data to reduce the assumptions
and biases we are placing on the data.
Notes
Your group should do some investigation here to answer these
questions, rather than just relying on the information provided in data
codebooks.
You should cite your sources at the end of your data biography,
using your preferred citation style (but enough information should be
included that a reader can track down your source).
Assignment
- Write a 2-3 page data biography that attempts to answer the
questions provided in the Goals.
- Your write-up should be presented as a narrative, using complete
sentences and paragraphs.
- Turn in the pdf of your biography on Gradescope by 5pm on Friday
April 14th.
Rubric
You will be assessed on the following:
- The informativeness of your data biography with respect to
each the key questions provided in the Goals Section
- The degree to which the text is supported by references and the
appropriateness of the selected references
- The degree to which the text is well organized and well-written
- Each student’s individual contributions to the project.
Further Reading
The following articles discuss the importance of data biographies,
and outline the process of creating a good data biography:
Assignment 3
Goals
- Conduct statistical inference on your research questions.
Assignment
Conduct a hypothesis test for your research questions.
- For the hypothesis test,
- Explicitly state the hypotheses in both words and symbols.
- Include the method used, the test statistic, and the p-value.
- Determine an appropriate significance level based on the
consequences for type I/II errors
- Check assumptions. (If violated, still finish the test but be
cautious in your conclusion.)
- Interpret the p-value in the context of the problem.
- Discuss conclusions about the conjecture.
- Describe whether the observed effect has practical significance,
based on your understanding of the data context.
Construct a confidence interval for your research question.
- For the confidence interval,
- Include the method used, confidence level, and interval values.
- Describe why you choose the confidence level you did, based on the
relationship between confidence level and margin of error, as well as
the specific data context.
- Check assumptions. (If violated, still construct the confidence
interval but be cautious in your conclusion.)
- Discuss conclusions about the conjecture.
Write a 1-2 page summary of your findings that includes all the
pieces specified in 1. and 2. Include appropriate visualizations for the
confidence intervals and hypothesis tests.
Turn in the pdf of your summary on Gradescope by 5pm Friday, May
5th.
Rubric
You will be assessed on the following:
- For the hypothesis test,
- Selecting an appropriate parameter of interest
- Including the correct method, correct test statistic, and correct
p-value.
- Checking assumptions.
- Correctly interpreting the p-value in the context of the
problem.
- Accurately discussing conclusions about the conjecture.
- For the confidence interval:
- Including the correct method and interval values.
- Including confidence level.
- Checking assumptions.
- Accurately discussing conclusions about the conjecture.
- The degree to which the text is well organized and well-written
- Each student’s individaul contributions to the project.
Technical Report
Goals
- Craft a clear, engaging, accurate story about one of your research
questions.
Assignment
- Create a 3-5 page technical report that addresses the following:
- Your research question
- Your data source
- Exploratory graphs and summary statistics and what they tell you
about your research question
- An inference procedure (and any assumptions) and interpretation of
results
- Conclusions about your research question
- The technical reports should be uploaded to gradescope by 5:00pm on
Friday, May 19th.
Notes
- In this project assignment, you likely won’t need to conduct any
additional analysis. Instead, you will be summarizing content from the
previous project assignments.
- However, it is okay if you do conduct additional analysis
- Reports will be graded both for content and for how well the
material is discussed
- You should address your topic and statistical content at a level
that is appropriate for a STA 209 audience.
Technical Report Details
- Simply combining your work in Assignments 1, 2, and 3 will produce a
document that is much longer than 3 - 5 pages.
- Instead, think carefully about what the most important details are
for your analysis, and curate your previous assignments to highlight and
support these.
- You do not need to include the code used to perform
your analysis in the .pdf document itself. You should however, include
the output of the code (summary statistics, visualizations, and the
results of any inference where appropriate).
- To have code run when you knit, but not display, replace the chunk
header
{r}
with {r echo = F}
.
- If you also don’t want the output of the code to display, use
{r echo = F, include = F }
- You can control the size of included graphics by adding
{r fig.width =..., fig.height=...}
to your chunk options,
where ...
is replaced with the desired width/height of
graphic in inches.
- Your report can have a title page listing the project title, the
project authors, and date. This page does not count towards the page
limit.
Rubric
You will be assessed on the following:
- Length: Technical reports that are not between the
3-5 pages single-spaced will be penalized.
- Content: Demonstrates a full and accurate
understanding of the material presented.
- The report addresses each item listed in (1).
- Style The degree to which the text is well
organized and well-written
- Sources: At least 1 appropriate reference (other
than our textbooks) is included. The references should be on a separate
page from the report and are not included in the page count.
- Each student’s individual contributions to the project.