Introduction to the dplyr R package

Roger Peng

มุมมอง 66 706

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 30 ธ.ค. 2014
วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 46

@ArunRangarajan 7 ปีที่แล้ว ⁺³
Short, complete and crystal clear! You absolutely rock, Dr Roger Peng!
@pensivenincompoop2016 7 ปีที่แล้ว
I am new to R and I am learning it for my phylogenetics and statistics and I can already tell that this package is very useful. Thanks for the tutorial!
@anthonychariton9952 6 ปีที่แล้ว ⁺¹
Brilliant overview, thank you kindly for this
@PandiMengri 4 ปีที่แล้ว
This is exactly what I was looking for! Thank you, Roger! :)
@bodobruckner9600 9 ปีที่แล้ว
Good, flawless and fast, as we have got to appreciate in Roger Peng´s and friends´ Coursera courses :-)
@ChristopherSkyi 9 ปีที่แล้ว ⁺¹¹
To get chicago.rds, go here: github.com/DataScienceSpecialization/courses/blob/master/03_GettingData/dplyr/chicago.rds
@jyotijain5157 8 ปีที่แล้ว
Thank You.
@lalaithan 7 ปีที่แล้ว
Can someone explain why it is that get all "NA"s when I input chicago
@gmshadowtraders 8 ปีที่แล้ว
Dude you rock! You look a lot like the other R expert Professor Andrew Ng :)
@kvafsu225 3 ปีที่แล้ว
Really nice video.Thanks.
@calefalejandrorodriguezcue3754 7 ปีที่แล้ว
Hi Roger. Thanks for this video.
I have a DataFrame in R that has several variables (at least three).
What I would like to do is to make a pivot table but showing sub totals for each of the variables. I've achieved this with only 2 variables but, unfortunately, when I add a third or a fourth variable doesn't add its sub total in its parent variable.
Do you know how to do this in R?
I've also tried it in pandas pivot_table but I've got the same.
Please help :'(
@michelemelchiori7628 9 ปีที่แล้ว
V Nice! Please consider to add the explanation of joins that are important too
@c.deg.7982 5 ปีที่แล้ว
For some reason I cannot get tally() or count() to work inside the summarize() function for a dataset grouped by a catagorical variable...
@kevinmaeir1612 6 ปีที่แล้ว
Hey, I have a table with 4 columns. 2 of them are list of diferents dates and in the another are numbers. I want to compare the columns of dates and get a new table just with the number of the same date. Can you help me? thks
@WahranRai 5 ปีที่แล้ว
14:27 assigning work variables and split one instruction per ligne is useful for debugging and facilitate the readibility of the code !!!
@MrAlivallo 5 ปีที่แล้ว
so the hardest part of getting started with 'dplyr' is getting the data wrangled to match for manipulation. How do I do this inside {r} ? If I do this in PowerBI it is all Drag/Drop/Click. Why doesnt this exist for RStudio?
@carriballa 9 ปีที่แล้ว ⁺¹
Thanks Roger, where can I get the data set from? I tried looking for it.
@claveralvaro6245 5 ปีที่แล้ว
You can do it even from excell , just make sure you got the right kind of variables to work with. And also look for the packages you need to load the data in case of a xlsx format (excel file) is the package called "readxl". But if you are like , too lazy or something there are some default data files to work with like "iris" or "crabs" just put it as dataframe into a variable, print it and KAPOO YAH !
@linussunil83 8 ปีที่แล้ว ⁺²
can someone explain me the step where he mutates tempcat column in df. i dont understand arguments used for factor : factor(1*(tmpd
@rohanshingade7228 8 ปีที่แล้ว ⁺⁵
1 multiplied by (tmpd < 80). If we simply typle (tmpd < 80) we get logical vector. But we multiply it by 1 we will get a numeric vector.
@linussunil83 8 ปีที่แล้ว
Thanks buddy
@AllenMartin-hp5yf ปีที่แล้ว
What/where is the website you downloaded "chicago" from?
@kevintan6484 8 ปีที่แล้ว ⁺²
Hello everyone, I am such a beginner in R. I could not even import the Chicago.rds file right, I click the import data on the right hand side and I select the file and it turn to be messy code.
So, I imported my own data (name data1) set from a txt file and try to follow the steps in the video.
I can only success few of them, please help me out.
I have checked many times that I have downloaded "dplyr" package, and I even try to reinstall the R and R studio, my R version is 3.2.4
data 1 looks like this:
V1 V2 V3 V4
Product Names Qty Numeric No.1 Numeric No.2
1. head(select(data1, V1:V3)) returns:
Error in head(select(data1, V1)) : could not find function "select"
2. data1.f = filter(data1, V4 > 50) returns:
Error in filter(data1, V4 > 50) : object 'V4' not found
Then I tried: data1.f = filter(data1, "V4" > 50)
it worked, but when I View the data1.f, there are still numbers bigger smaller than 50 in V4
Then I tried: data1.f = filter(data1, data1$V4 > 50)
I View all the "N/A" shown in the frame
3. Rename
data.1 = rename(data.1, V1 = Productnames, V2 = Qty) returns:
Error in rename(data.1, V1 = Productnames, V2 = Qty) : unused arguments (V1 = Productnames, V2 = Qty)
4. Group_by:
goodbad = group_by(data1, tempcat) returns:
Error: could not find function "group
I am really appreciate you guys for helping me out of the wood!!
@lobbielobbie1766 8 ปีที่แล้ว ⁺¹
Hey Kelvin,
It is quite difficult by just looking at the error messages without the dataset and reproducible examples.
Here's a code sample which you can try. I am using RStudio and you can find a good dplyr cheat sheet at www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf. If you are worried or confused by the %>% pipe in the code, it just mean 'passing the results of one statement to the next' in layman terms. In addition, downloading the package means you are getting the package ready to be used. To use any package in your code, you need to import the package into your code using library() as shown.
# import libraries
library(dplyr)
# create a data frame with named columns
set.seed(888)
MyDF 50
MyFilter %
filter(SalesAmount > 50)
View(MyFilter)
# create a new sales commission variable using 1% of TotalSales
MySales %
mutate(MyCommission = 0.01 * SalesAmount)
View(MySales)
# sum totals by SalesID
MySummary %
group_by(SalesID) %>%
summarise(NumbOfSales = n(), TotalSales = sum(SalesAmount),
TotalCommission = sum(MyCommission))
View(MySummary)
# sum sales amount by LocationID
MyLocationSales %
group_by(LocationID) %>%
summarise(LocationSalesTotal = sum(SalesAmount))
View(MyLocationSales)
HTH,
Lobbie
@claudiuskerth9497 9 ปีที่แล้ว ⁺¹
where can chicago.rds be downloaded from?
It isn't the same dataset as in the gamair package
may thanks
@michelemelchiori7628 9 ปีที่แล้ว ⁺²
github.com/DataScienceSpecialization/courses/blob/master/03_GettingData/dplyr/chicago.rds
then click on "Raw" button
@ghtyu99 6 ปีที่แล้ว
I have tried several times to download this dataset from GitHiub using the link above and also receive an error message (see below) whether or not I use the "View Raw" button. I am running R for Mac OS R 3.3.3 GUI 1.69 Mavericks build (7328). Does anyone have a workaround or correction? Thanks.
"Error: bad restore file magic number (file may be corrupted) -- no data loadedIn addition: Warning message:
file ‘chicago.rds’ has magic number 'X'
Use of save versions prior to 2 is deprecated".
@MultiHunter36 5 ปีที่แล้ว
why am I not able to use select function?
Error in select(chicago, city:dptp) : could not find function "select"
>
@rrmaximiliano 5 ปีที่แล้ว
Maybe you didn't load the dplyr package. Use library(dplyr)
@tuanlong9238 6 ปีที่แล้ว ⁺¹
my god, look like he uses R original version, supper =)))
@yousfoss4367 4 ปีที่แล้ว
thks grand prof
@mikebosko9077 9 ปีที่แล้ว
I'm new to R, what is meant by 'making sure all the factors are annotated'? I understand factors, but annotated how? Thanks much! -Mike
@mdev1187 9 ปีที่แล้ว
@3:14 it's the *levels* of any factors present (there aren't any in the chicago data.frame), so you can control if and when levels are kept or dropped.
Usually I'd want retain levels of an *ordered* factor (like a Year), but not unordered ones (like City). If data is missing for a Year (derived from date variable) in one City I wouldn't want to lose that Year as a level, so make Year an Ordered Factor before filtering. If City were a factor I probably wouldn't want to retain every level after filtering, so it's best left as a character variable so the issue doesn't arise.
@jdlopez131 5 ปีที่แล้ว
Isn't sqldf package a lot better than dplyr? I mean sql commands :) need I say more?
@kunalbali810 9 ปีที่แล้ว
I have two dataframe suppose like
latitude longitude values
20 11 3.5
20 12 1.5
20 13 4.5
20 14 4
21 11 1.2
21 12 1.4
21 13 1.4
21 14 1.8
and
latitude longitude values
20 11 3
20 12 1
20 13 4
20 14 4
21 11 1
21 12 1
21 13 1.4
21 14 1.2
now i need to get the result like
20 11 3.32
20 12 1.25
20 13 4.25
20 14 4
21 11 1.1
21 12 1.2
21 13 1.4
21 14 1.5
You see i just did the mean of 3rd column with each rows So how can i do that as i am dealing with atmospheric data so i need to do this please tell me how to do ??
@sushantchoudhary6393 9 ปีที่แล้ว
you could just say dataframe3$values = dataframe1$values + dataframe2$values.
How you got 3.32 there in the third table though is ... it's not the mean of 3 and 3.5, just so we're on the same page.
@sushantchoudhary6393 9 ปีที่แล้ว
Sorry forgot to divide by 2.
dataframe3$values = dataframe3$values/2
@kunalbali810 9 ปีที่แล้ว
Sushant Choudhary Do you know how to plot standard error or standard bar plot in time series graph ??
@sushantchoudhary6393 9 ปีที่แล้ว
Yes, I do. To say any more than that, I would need a more precise question, though.
@kunalbali810 9 ปีที่แล้ว ⁺¹
Sushant Choudhary Great. I have 13 year data of each months starting from 2002-09-01 to 2014-12-01. Now i need to plot the annualy mean graph with standard deviation and monthly mean (2002-2014) with standard deviation. The data is below . Hope you have got my point.
africa_co china_co SM_CO
2002-09-01 2.05 2.11 2.09
2002-10-01 2.125 2.095 2.21
2002-11-01 2.035 2.175 2.095
2002-12-01 2.095 2.175 1.905
2003-01-01 2.15 2.29 1.815
2003-02-01 2.12 2.33 1.775
2003-03-01 2.025 2.475 1.875
2003-04-01 1.92 2.415 1.765
2003-05-01 1.885 2.335 1.585
2003-06-01 1.775 2.35 1.56
2003-07-01 1.87 1.91 1.59
2003-08-01 2.035 1.945 1.755
2003-09-01 2.145 1.95 2.125
2003-10-01 2.12 2.025 1.98
2003-11-01 2 2.12 1.89
2003-12-01 2.04 2.195 1.85
2004-01-01 2.105 2.285 1.72
2004-02-01 2.14 2.335 1.81
2004-03-01 2.07 2.52 1.75
2004-04-01 1.915 2.45 1.68
2004-05-01 1.82 2.185 1.57
2004-06-01 1.775 2.085 1.545
2004-07-01 1.88 1.91 1.62
2004-08-01 1.965 1.97 1.755
2004-09-01 2.09 2.035 2.33
2004-10-01 2.095 2.075 2.17
2004-11-01 1.98 2.075 2.02
2004-12-01 2.13 2.145 1.89
2005-01-01 2.185 2.34 1.78
2005-02-01 2.11 2.365 1.7
2005-03-01 2.005 2.535 1.725
2005-04-01 1.91 2.505 1.655
2005-05-01 1.805 2.26 1.585
2005-06-01 1.77 2.065 1.495
2005-07-01 1.85 1.87 1.59
2005-08-01 2.025 1.885 1.95
2005-09-01 2.19 1.955 2.365
2005-10-01 2.18 2.035 2.455
2005-11-01 2.09 2.065 2.08
2005-12-01 2.165 2.275 1.845
2006-01-01 2.115 2.265 1.72
2006-02-01 2.06 2.25 1.685
2006-03-01 1.905 2.38 1.69
2006-04-01 1.8 2.31 1.645
2006-05-01 1.74 2.135 1.545
2006-06-01 1.73 1.955 1.5
2006-07-01 1.795 1.885 1.515
2006-08-01 1.995 1.99 1.775
2006-09-01 2.09 2.1 2.205
2006-10-01 2.01 2.17 2.03
2006-11-01 2.005 2.165 1.9
2006-12-01 2.125 2.195 1.885
2007-01-01 2.215 2.315 1.8
2007-02-01 2.2 2.42 1.865
2007-03-01 2.17 2.535 1.825
2007-04-01 1.955 2.57 1.715
2007-05-01 1.81 2.225 1.585
2007-06-01 1.72 2.13 1.51
2007-07-01 1.84 1.87 1.53
2007-08-01 1.98 1.945 1.815
2007-09-01 2.115 2.05 2.54
2007-10-01 2.14 2.065 2.52
2007-11-01 2.005 2.07 2.03
2007-12-01 2.12 2.15 1.75
2008-01-01 2.115 2.25 1.71
2008-02-01 2.2 2.355 1.765
2008-03-01 2.09 2.45 1.815
2008-04-01 1.84 2.36 1.725
2008-05-01 1.75 2.265 1.545
2008-06-01 1.74 2.055 1.485
2008-07-01 1.85 1.855 1.525
2008-08-01 1.99 1.88 1.7
2008-09-01 2.095 1.885 1.995
2008-10-01 2.01 1.865 2.08
2008-11-01 1.98 1.865 1.915
2008-12-01 2.07 2.005 1.755
2009-01-01 2.125 2.18 1.695
2009-02-01 1.975 2.155 1.665
2009-03-01 1.945 2.375 1.635
2009-04-01 1.84 2.37 1.655
2009-05-01 1.73 2.17 1.565
2009-06-01 1.73 1.975 1.49
2009-07-01 1.83 1.83 1.48
2009-08-01 1.925 1.91 1.635
2009-09-01 2.04 1.91 1.82
2009-10-01 1.985 1.97 1.895
2009-11-01 1.925 1.95 1.89
2009-12-01 2.055 2.105 1.87
2010-01-01 2.09 2.125 1.74
2010-02-01 2.02 2.225 1.705
2010-03-01 1.95 2.415 1.7
2010-04-01 1.92 2.395 1.67
2010-05-01 1.775 2.16 1.555
2010-06-01 1.735 2.01 1.53
2010-07-01 1.835 1.83 1.55
2010-08-01 1.995 1.865 1.91
2010-09-01 2.16 1.91 2.38
2010-10-01 2.275 1.885 2.47
2010-11-01 2.045 1.97 1.91
2010-12-01 2.01 2.045 1.75
2011-01-01 2.12 2.245 1.675
2011-02-01 2.115 2.265 1.685
2011-03-01 2.06 2.35 1.685
2011-04-01 1.865 2.355 1.635
2011-05-01 1.755 2.075 1.54
2011-06-01 1.72 1.93 1.475
2011-07-01 1.84 1.89 1.51
2011-08-01 2.025 1.87 1.64
2011-09-01 2.175 1.92 2.04
2011-10-01 1.94 1.925 1.9
2011-11-01 1.895 1.88 1.735
2011-12-01 2.045 2.095 1.77
2012-01-01 2.155 2.215 1.705
2012-02-01 2.15 2.28 1.7
2012-03-01 2.065 2.385 1.685
2012-04-01 1.965 2.34 1.625
2012-05-01 1.765 2.2 1.535
2012-06-01 1.78 2.045 1.465
2012-07-01 1.82 1.93 1.5
2012-08-01 2.025 1.935 1.685
2012-09-01 2.11 1.955 2.07
2012-10-01 2.005 1.995 2.005
2012-11-01 1.94 1.925 1.9
2012-12-01 1.965 2.065 1.755
2013-01-01 2.065 2.17 1.64
2013-02-01 2.085 2.205 1.715
2013-03-01 1.975 2.305 1.7
2013-04-01 1.86 2.355 1.6
2013-05-01 1.8 2.1 1.54
2013-06-01 1.8 1.855 1.505
2013-07-01 1.9 1.775 1.52
2013-08-01 2.115 1.795 1.64
2013-09-01 2.085 1.865 1.825
2013-10-01 1.905 1.895 1.85
2013-11-01 1.895 1.895 1.685
2013-12-01 1.915 2.04 1.68
2014-01-01 2.07 2.115 1.645
2014-02-01 2.075 2.175 1.69
2014-03-01 2.035 2.34 1.73
2014-04-01 1.855 2.435 1.635
2014-05-01 1.725 2.09 1.545
2014-06-01 1.745 1.99 1.465
2014-07-01 1.8 1.775 1.48
2014-08-01 1.95 1.875 1.675
2014-09-01 2.005 1.835 1.915
2014-10-01 1.99 1.89 1.92
2014-11-01 1.975 1.92 1.79
2014-12-01 1.985 2.07 1.73
@Dwright3316 9 ปีที่แล้ว
What version of R is Dr. Peng using here?
I have downloaded R version 3.2.1 (2015-06-18). But, unfortunately, I cannot use the "chicago.rds" package -- error message -- is not available (for R version 3.2.1)
Is there any workarounds for this? Or would I need to uninstall my current version of R and find the older version in order to install/load this package?
Thank you! I'm new to programming in R, so any help would be greatly appreciated!
@lalaithan 7 ปีที่แล้ว
It's a dataset, not a package.

ต่อไป

เล่นอัตโนมัติ