Aggregate Daily Data to Month & Year Intervals in R | Base R & dplyr Example | lubridate Package
ฝัง
- เผยแพร่เมื่อ 15 ต.ค. 2024
- How to summarize and group daily data into monthly intervals in the R programming language. More details: statisticsglob...
R code of this video:
set.seed(8965379) # Create random example data
data <- data.frame(date = sample(seq(as.Date("2020/01/01"),
by = "day",
length.out = 1000),
100, replace = TRUE),
value = round(rnorm(100, 5, 2), 2))
data_new1 <- data # Duplicate data
data_new1$year <- strftime(data_new1$date, "%Y") # Create year column
data_new1$month <- strftime(data_new1$date, "%m") # Create month column
data_aggr1 <- aggregate(value ~ month + year, # Aggregate data
data_new1,
FUN = sum)
install.packages("lubridate") # Install & load lubridate
library("lubridate")
data_new2 <- data # Duplicate data
data_new2$year_month <- floor_date(data_new2$date, # Create year-month column
"month")
install.packages("dplyr") # Install dplyr package
library("dplyr") # Load dplyr
data_aggr2 <- data_new2 %>% # Aggregate data
group_by(year_month) %>%
dplyr::summarize(value = sum(value)) %>%
as.data.frame()
Follow me on Social Media:
Facebook - Statistics Globe Page: / statisticsglobecom
Facebook - Group for Discussions & Questions: / statisticsglobe
LinkedIn - Statistics Globe Page: / statisticsglobe
LinkedIn - Group for Discussions & Questions: / 12555223
Twitter: / joachimschork
Music by bensound.com
Incredible how everytime Icome across something I have to do in R, you have an answer for it! Thanks for all the effort in publishing this amazing content.
Thank you so much for the kind words Mateus, this is great to hear! :)
I've been hitting a wall trying to create birth cohorts from POSixt data, but this video has been so helpful. Thanks so much!
This is great to hear, thanks for the kind comment!
Hi, I want to convert 5 climatic variables which are in hourly intervals to daily. How can I do that please ?
Hello,
Sorry for the late response. Do you still need help?
Regards,
Cansu
Hi Joachim, thanks a lot for your very helpful videos. For me, when it comes to the aggregation, it worked the following code withouth having to replicate the original data frame (df)
df_agg%
group_by(lubridate::floor_date(df$date,'month'))%>%
dplyr::summarize(value = sum(value))%>%
data.frame
colnames(df_agg)
Hey Jon, thanks a lot for the very kind words, and for the alternative R code! :)
You are continuously providing important tutors for us . Please continue in this manner. Personally I got sufficient knowledge . Keep shining .....
If you have time we will discuss about statistics .
Thanks a lot for this wonderful feedback Wakjira! Glad you find my tutorials helpful! :) If you have any more detailed questions you want to discuss, you may have a look at the Statistics Globe Facebook Group: facebook.com/groups/statisticsglobe Regards, Joachim
Dear I have a question for you. How can I correlate more two variable at once.
Have a look here: statisticsglobe.com/correlation-matrix-in-r
@@StatisticsGlobe I got it, thank you bro.
I am just wondering why this channel is underrated! ?.. greatly explained Sir. Much respect. : From a New Subscriber.
Much appreciated Lester! The channel is constantly growing, so I'm happy about it! :)
Joachim, you always have anwers to my R questions. Thank you very much!
Thank you very much John, glad to hear it!
Joachim,
I am usually very pleased with your examples and explanations. However, this time I had several difficulties. First, I found your code very hard to read - even though I made sure I was running HD and magnifying the page. Even though that got better later you might consider enlarging your workspace a little. Second, you did not explain the floor function of the lubridate package. I was at first perplexed but then realized that by "flooring" the month you went back to the first day of the month. Another thing I found interesting is that the tidy method (using lubridate and dplyr) was a lot easier to follow and at the end generated a smaller data set than basic does (hence saving memory and running faster). In spite of these little issues I found this a very educational exercise. Thanks!
Thanks a lot for the honest feedback Haraldur, this definitely helps to improve future videos! I agree, I should have explained the floor function in more detail. Regarding the video quality, have you watched the video on desktop or mobile? On my end, the video quality seems to be fine. Thanks, Joachim
Joachim,
I typically watch on a large screen using an iMac pro. So the screen size is not an issue. I also make sure that the video quality is the highest. You may just have to enlarger your font or zoom in a bit when you show the code. In this instance I struggled making out the ~ sign. It looked very similar to a = at that scale. I knew it was a formula so had to be a ~ which I confirmed by looking at the help file for the aggregate function.
OK, thanks for getting back to me on this. It's sometimes a trade-off between showing all code on the same page vs. enlarging the code. I'll keep your feedback in mind for the future. Thanks again, Joachim
Joachim,
You are welcome. I am a professor at a large American university and believe in treating my students fairly but being critical when that is required so they can learn and grow. I never ask them to do something that I could not do myself (this is a problem with many prof that run labs and have no idea how their equipment works). Many folks don't like to have their mistakes pointed out. But it is human to err and I found that constructive criticism when well intentioned typically helps all of us.
Absolutely, without feedback it's difficult to improve. So please keep on telling me whenever you see something you'd like to be changed! :)
Hi Joachim! Once again a brilliant video. Thank you for such a clear explanation! I have one doubt. Instead to aggregating the data into a YR and Month column, I want to have an aggregation such that month is aggregated into 1st half of the year and 2nd hald (ie Jan-Jun is 1st half, and Jul to Dec is 2nd). And calculate the mean per half for each year. any advise on how I can go about this?
Hello Kivan,
Maybe you take a look at this discussion; it looks helpful for your case.
Regards,
Cansu
Hi. Wondering if you help me with an issue? Every time I go to create a year and month column, I keep receiving an error that reads “error in ‘$
Hey Leah, please have a look here: statisticsglobe.com/error-replacement-has-x-rows-data-has-y-in-r
Hello, You are just amazing. I have a question. I have hourly temperature data for 30 years. I have to convert this hourly data to daily max, daily min, daily mean. I am just curious to know how I can do this using R. Please help me out in this.
Hello Sheetal,
Could you please share with us which format is used for the recorded time?
Regards,
Cansu
Hello Cansu. Thankyou for asking. I put date and time in one column together. Earlier it was in CHR format. Then I convert it into POSIXct, format: "1990-01-01 00:00:00" "1990-01-01 01:00:00" . So now how i convert my hourly data to daily max, daily min and daily mean. and then from daily to monthly min, max, and mean. Thankyou
@@SheetalKumari-gx3tl hello again,
I think you can adapt the code for your use as follows.
set.seed(8965379) # Create random example data
data
what if I want to get not the sum of each interval but the average of each interval?
Thank you for your help!
Hello Nicola!
You can use FUN = mean.
Best,
Cansu
Hi Joachim, thank you very much for a super clear and detailed explanation. I have a question and would be great if you could advise. Supposed there is a long data file where each person in the ID column has a different number of visits on different dates? How do you calculate the cumulative days per ID per visit based on their first visit date? Highly appreciate for any feedback, thank you!!
Hey Sok, thank you so much for the great feedback, glad you find the tutorial helpful! Could you illustrate your data and desired output with an example? I have some difficulties to understand what you want to do. Regards, Joachim
@@StatisticsGlobe Hi Joachim,thank you so much for your kind reply.I found my answer after watching another video from you on the lead and lag function.I was supposed to create a lagged variable for dates that I’m working with.Super thankful to your videos and I have managed to solve my issues already☺️
This is great to hear! I'm glad you found a solution for your problem! :)
Hi Joachim, thanks for the informative video. I'm new in r and self-learning. i want to make time series graphs using ggplot2 and I have weekly data such as 2023-06-20, 2023-06-28 etc. I want to make a plot showing the date as it is with an interval of two weeks. Pls any guide or tips on how to go about it?
Hello!
library(ggplot2)
library(lubridate)
# Creating a sample data frame
date_seq
Hi cansu,
Thank you very much for the support. This has solved my problem.
Hey @@rohdoflactem4692 !
I am glad that it helped. Have a good one!
Best,
Cansu
Hi, I want to calculate mean of values from certain months. my data from 2010 to 2020. I want to calculate average of values from November 2010 to April 2011, and then November 2011 to April 2012, and so on with same pattern until April 2020. Do you have a video about this?
Hello Fitra,
Is it monthly data, and does it always include the months between November and April for each year? If so please see the for loop I have created below to compute the average for the given periods from 2010 to 2013.
set.seed(100)
df
Hey Joachim, quick question: how do you aggregate quarterly data when it starts in q3? I've been using
aggregate( ts(tesla_model$sales, start=c(2017,3,) , frequency=4), FUN=sum, nfrequency=1)
but it doesn't add them correctly and the new function has a start date of 2017.5 and the end date of 2020.5 which is incorrect. The correct end date is 2021. Can you please help? Thank you!
Hey Milka, are you still searching for help? I just came back from holidays and couldn't respond earlier. Regards, Joachim
Hi Joachim, I do have a question actually. Why do you add the final as.data.frame() in the final step? Great video by the way.
Hey John, thanks for the kind words! :) The as.data.frame function converts the tibble output provided by the dplyr package to a data.frame. Personally, I prefer to work with data.frame objects instead of tibbles. However, this is an optional step that you may skip, in case you want to create a tibble output. Regards, Joachim
@@StatisticsGlobe Got it thanks
You are welcome, great to hear! :)
Strange... the strftime() function didn't work for me until I imported lubridate. Isn't that supposed to be part of Base R?
strftime is a Base R function. Are you sure that it's related to this?
@@StatisticsGlobeYeah, obviously something is going wrong on my end. But I imported lubridate and it worked. Strange.
Hmm, yeah that's strange. Unfortunately, I also don't know why this happens, but I'm happy that it seems to work now. :)
Sir, I just learned about the r program 3 days ago. so I'm asking how to visualize sales data only in a particular year. while my data displays month and year dates from 2019 to 2024. what function should I use to display only sales data 2020-01 - 2020-12?
Hi Joachim, I was wondering if you have a video or could tell me how I could aggregate daily data into weekly data that I could group by subject? For example, taking an athletes distance covered in daily practices and converting that to the total distance they've ran throughout the whole week?
Hey David, you can do that by grouping your data based on multiple columns. See here: statisticsglobe.com/group-data-frame-by-multiple-columns-in-r
@@StatisticsGlobe Thanks for the help! would you happen to know how to turn it from daily data into weekly data? It's still considering it as different groups because the dates differ. I've tried using the floor_date to group it by week, but it gave me a "subscript is out of bounds" error. Thanks again!
Glad it is useful! :) You may convert days to week numbers as explained here: statisticsglobe.com/get-week-number-of-date-in-r
god bless you my friend !
Thank you so much, Samuel! :)
Hlo sir can you provide some video on analysing anomalies ( day of week and month of year). In time series analysis
Hello Nidhi,
Noted!
Best,
Cansu
Hi thanks alot.I have a grided data sets which has daily precipitation time series data.I try to apply using package Lubridate.but it says "error in object[[ name,exact=T]]:subscript out of bounds.I think because of gridded data sets.could you give explaination how to aggreagte daily time series data in to monthly for gridded data thanks alot
Hey Daniel, could you explain what you mean with gridded data? Could you share some example data?
Hello sir , in datanew1 after getting months and years , i only want to extract aprils (month4) and make a new data frame that contains Aprils only how do i do ?
Hello,
You can filter your data frame by the filter function of dplyr to filter the month April. See our tutorial: statisticsglobe.com/r-filter-function-dplyr-package
Best,
Cansu
@@cansustatisticsglobe thank you 😍
@@filsvainqueurByiringiro welcome!
@@cansustatisticsglobe thank you
Hii, i have a netcdf file with lon, lat, time dimensions and my variable is precipitation. I want to aggregate the daily data to obtain the monthly average so I tried this video and its not working, do you know if the method changes a little or something?
Great video tho!!
Hey Adrian, thanks for the kind words! The method should still work. Could you share the code you have used? Regards, Joachim
@@StatisticsGlobe Sure! This is what I'm using
library(ncdf4)
setwd("/Users/adriantorres/Downloads/SIEMVRA")
nc
It seems like you are modifying the month and year values in a different data set (i.e. time) as the one you use within the aggregate function (i.e. prcut). All relevant variables would have to be stored in the same data set.
@@StatisticsGlobe but i cannot use the time$date, "%Y" because it says that it is invalid for atomic vectors and i dont know where to modify it since it is a netcdf and I have no idea how that can affect my dimensions
I assume prcut is a data frame? If yes, you would have to add the year and month variables to this data frame and modify it within the data frame accordingly. Afterwards, you can use the aggregate function as you did above.
Sir, i may use your help again , lets say i want to make a dataframe containing values of days between(01 march to 06 may) of every year from 1982 up to 2022) , how do i proceed thank you !
Hello,
Please try this solution.
# Create a list of date sequences for each year from 1982 to 2022
list_of_dates
@@cansustatisticsglobe thank you 🥰 , it works , but i wanted to relate to this video of yours in data_new1, instead of selecting a particular month , i can replace the selected month with days between 1st March up to 6th may
Hello @@filsvainqueurByiringiro ,
Are you trying to filter the date column based on the months? I am not sure if I get your intention well.
Best,
Cansu
@@cansustatisticsglobe , lets say these dates are associated with data and i wanted that this list comes with associated value in the final dataframe ) example i have dairly rainfall from 1982 to 2022 , and i want to extract rainfall associated with dates between 1st march to 06 may of each year and finaly calculate the average dairly rainfall for these 42 years
Hello @@filsvainqueurByiringiro,
I think this script should be doing what you asked for.
# Load required libraries
#install.packages("lubridate")
#install.packages("dplyr")
library(lubridate)
library(dplyr)
# Sample data (you should replace this with your data)
df %
group_by(year = year(date)) %>%
summarise(avg_rainfall = mean(rainfall))
head(results)
# # A tibble: 6 × 2
# year avg_rainfall
#
# 1 1982 -0.0620
# 2 1983 0.103
# 3 1984 -0.223
# 4 1985 -0.0627
# 5 1986 0.114
# 6 1987 -0.157
Please let me know if something is unclear with the implementation or if it doesn't compute what you asked for.
Best,
Cansu
Hey. Do you have a tutorial on how to plot this aggregated data?
Hey, I do not have a specific video on this, but you may use the typical kinds of plots that you would also use for unaggregated data. This strongly depends on the content of your data and the way how you want to visualize it.
Thank you!! This has helped so much!
You're welcome. Glad it helped! 🙂
we could also do # add month variable/factor to data > data$month monthvalue
Hey, thank you for sharing this additional code!
What if you require the month as “Jan, Feb, Mar” ?
Hey S C, you may convert the numeric values to month abbreviations as explained here: statisticsglobe.com/convert-numeric-to-month-names-and-abbreviations-r Regards, Joachim
In the create month strftime function simply change the "%b" to "%m":
data_new1$month
Great help!
Thanks a lot. Glad it helped! 🙂
hi, how to write a xlsx file in R to read the file?
Hello Razia,
Do you like to write the data to an xlsx file or read from an xlsx file?
Regards,
Cansu
@@cansustatisticsglobe hi, I wanted to read from xlsx file and I somehow succeeded but now I stuck in computing the monthly mean of the data. Thanks
@@raziasultanashaky3665 Hello Razia,
Normally, the method given in this tutorial should work. Do you get any errors when you try to do it? If so, can you share it? I need some more details to abe able to help you.
Regards,
Cansu
@@cansustatisticsglobe hi cansu, the format of date has been changed like 45-11-51 instead of 12-03-21. How to share? Should I send it to you by email?
@@raziasultanashaky3665 Did it change after importing the xlsx file to R or after employing the dplyr functions? Unfortunately, I don't share my email address, but you can share the code here.
Best,
Cansu
floor_date - interesting function
Indeed Henrique, it is :)
Thanks for all
You are welcome! Thanks for your kind comment, hope the video was helpful!