Introduction to the dplyr R package


ความคิดเห็น • 46

  • @ArunRangarajan
    @ArunRangarajan 7 ปีที่แล้ว +3

    Short, complete and crystal clear! You absolutely rock, Dr Roger Peng!

  • @pensivenincompoop2016
    @pensivenincompoop2016 7 ปีที่แล้ว

    I am new to R and I am learning it for my phylogenetics and statistics and I can already tell that this package is very useful. Thanks for the tutorial!

  • @anthonychariton9952
    @anthonychariton9952 6 ปีที่แล้ว +1

    Brilliant overview, thank you kindly for this

  • @PandiMengri
    @PandiMengri 4 ปีที่แล้ว

    This is exactly what I was looking for! Thank you, Roger! :)

  • @bodobruckner9600
    @bodobruckner9600 9 ปีที่แล้ว

    Good, flawless and fast, as we have got to appreciate in Roger Peng´s and friends´ Coursera courses :-)

  • @ChristopherSkyi
    @ChristopherSkyi 9 ปีที่แล้ว +11

    To get chicago.rds, go here:

  • @lalaithan
    @lalaithan 7 ปีที่แล้ว

    Can someone explain why it is that get all "NA"s when I input chicago

  • @gmshadowtraders
    @gmshadowtraders 8 ปีที่แล้ว

    Dude you rock! You look a lot like the other R expert Professor Andrew Ng :)

  • @kvafsu225
    @kvafsu225 3 ปีที่แล้ว

    Really nice video.Thanks.

  • @calefalejandrorodriguezcue3754
    @calefalejandrorodriguezcue3754 7 ปีที่แล้ว

    Hi Roger. Thanks for this video.
    I have a DataFrame in R that has several variables (at least three).
    What I would like to do is to make a pivot table but showing sub totals for each of the variables. I've achieved this with only 2 variables but, unfortunately, when I add a third or a fourth variable doesn't add its sub total in its parent variable.
    Do you know how to do this in R?
    I've also tried it in pandas pivot_table but I've got the same.
    Please help :'(

  • @michelemelchiori7628
    @michelemelchiori7628 9 ปีที่แล้ว

    V Nice! Please consider to add the explanation of joins that are important too

  • @c.deg.7982
    @c.deg.7982 5 ปีที่แล้ว

    For some reason I cannot get tally() or count() to work inside the summarize() function for a dataset grouped by a catagorical variable...

  • @kevinmaeir1612
    @kevinmaeir1612 6 ปีที่แล้ว

    Hey, I have a table with 4 columns. 2 of them are list of diferents dates and in the another are numbers. I want to compare the columns of dates and get a new table just with the number of the same date. Can you help me? thks

  • @WahranRai
    @WahranRai 5 ปีที่แล้ว

    14:27 assigning work variables and split one instruction per ligne is useful for debugging and facilitate the readibility of the code !!!

  • @MrAlivallo
    @MrAlivallo 5 ปีที่แล้ว

    so the hardest part of getting started with 'dplyr' is getting the data wrangled to match for manipulation. How do I do this inside {r} ? If I do this in PowerBI it is all Drag/Drop/Click. Why doesnt this exist for RStudio?

  • @carriballa
    @carriballa 9 ปีที่แล้ว +1

    Thanks Roger, where can I get the data set from? I tried looking for it.

    • @claveralvaro6245
      @claveralvaro6245 5 ปีที่แล้ว

      You can do it even from excell , just make sure you got the right kind of variables to work with. And also look for the packages you need to load the data in case of a xlsx format (excel file) is the package called "readxl". But if you are like , too lazy or something there are some default data files to work with like "iris" or "crabs" just put it as dataframe into a variable, print it and KAPOO YAH !

  • @linussunil83
    @linussunil83 8 ปีที่แล้ว +2

    can someone explain me the step where he mutates tempcat column in df. i dont understand arguments used for factor : factor(1*(tmpd

    • @rohanshingade7228
      @rohanshingade7228 8 ปีที่แล้ว +5

      1 multiplied by (tmpd < 80). If we simply typle (tmpd < 80) we get logical vector. But we multiply it by 1 we will get a numeric vector.

    • @linussunil83
      @linussunil83 8 ปีที่แล้ว

      Thanks buddy

  • @AllenMartin-hp5yf
    @AllenMartin-hp5yf ปีที่แล้ว

    What/where is the website you downloaded "chicago" from?

  • @kevintan6484
    @kevintan6484 8 ปีที่แล้ว +2

    Hello everyone, I am such a beginner in R. I could not even import the Chicago.rds file right, I click the import data on the right hand side and I select the file and it turn to be messy code.
    So, I imported my own data (name data1) set from a txt file and try to follow the steps in the video.
    I can only success few of them, please help me out.
    I have checked many times that I have downloaded "dplyr" package, and I even try to reinstall the R and R studio, my R version is 3.2.4
    data 1 looks like this:
    V1 V2 V3 V4
    Product Names Qty Numeric No.1 Numeric No.2
    1. head(select(data1, V1:V3)) returns:
    Error in head(select(data1, V1)) : could not find function "select"
    2. data1.f = filter(data1, V4 > 50) returns:
    Error in filter(data1, V4 > 50) : object 'V4' not found
    Then I tried: data1.f = filter(data1, "V4" > 50)
    it worked, but when I View the data1.f, there are still numbers bigger smaller than 50 in V4
    Then I tried: data1.f = filter(data1, data1$V4 > 50)
    I View all the "N/A" shown in the frame
    3. Rename
    data.1 = rename(data.1, V1 = Productnames, V2 = Qty) returns:
    Error in rename(data.1, V1 = Productnames, V2 = Qty) : unused arguments (V1 = Productnames, V2 = Qty)
    4. Group_by:
    goodbad = group_by(data1, tempcat) returns:
    Error: could not find function "group
    I am really appreciate you guys for helping me out of the wood!!

    • @lobbielobbie1766
      @lobbielobbie1766 8 ปีที่แล้ว +1

      Hey Kelvin,
      It is quite difficult by just looking at the error messages without the dataset and reproducible examples.
      Here's a code sample which you can try. I am using RStudio and you can find a good dplyr cheat sheet at If you are worried or confused by the %>% pipe in the code, it just mean 'passing the results of one statement to the next' in layman terms. In addition, downloading the package means you are getting the package ready to be used. To use any package in your code, you need to import the package into your code using library() as shown.
      # import libraries
      # create a data frame with named columns
      MyDF 50
      MyFilter %
      filter(SalesAmount > 50)
      # create a new sales commission variable using 1% of TotalSales
      MySales %
      mutate(MyCommission = 0.01 * SalesAmount)
      # sum totals by SalesID
      MySummary %
      group_by(SalesID) %>%
      summarise(NumbOfSales = n(), TotalSales = sum(SalesAmount),
      TotalCommission = sum(MyCommission))
      # sum sales amount by LocationID
      MyLocationSales %
      group_by(LocationID) %>%
      summarise(LocationSalesTotal = sum(SalesAmount))

  • @claudiuskerth9497
    @claudiuskerth9497 9 ปีที่แล้ว +1

    where can chicago.rds be downloaded from?
    It isn't the same dataset as in the gamair package
    may thanks

    • @michelemelchiori7628
      @michelemelchiori7628 9 ปีที่แล้ว +2
      then click on "Raw" button

    • @ghtyu99
      @ghtyu99 6 ปีที่แล้ว

      I have tried several times to download this dataset from GitHiub using the link above and also receive an error message (see below) whether or not I use the "View Raw" button. I am running R for Mac OS R 3.3.3 GUI 1.69 Mavericks build (7328). Does anyone have a workaround or correction? Thanks.
      "Error: bad restore file magic number (file may be corrupted) -- no data loadedIn addition: Warning message:
      file ‘chicago.rds’ has magic number 'X'
      Use of save versions prior to 2 is deprecated".

  • @MultiHunter36
    @MultiHunter36 5 ปีที่แล้ว

    why am I not able to use select function?
    Error in select(chicago, city:dptp) : could not find function "select"

    • @rrmaximiliano
      @rrmaximiliano 5 ปีที่แล้ว

      Maybe you didn't load the dplyr package. Use library(dplyr)

  • @tuanlong9238
    @tuanlong9238 6 ปีที่แล้ว +1

    my god, look like he uses R original version, supper =)))

  • @yousfoss4367
    @yousfoss4367 4 ปีที่แล้ว

    thks grand prof

  • @mikebosko9077
    @mikebosko9077 9 ปีที่แล้ว

    I'm new to R, what is meant by 'making sure all the factors are annotated'? I understand factors, but annotated how? Thanks much! -Mike

    • @mdev1187
      @mdev1187 9 ปีที่แล้ว

      @3:14 it's the *levels* of any factors present (there aren't any in the chicago data.frame), so you can control if and when levels are kept or dropped.
      Usually I'd want retain levels of an *ordered* factor (like a Year), but not unordered ones (like City). If data is missing for a Year (derived from date variable) in one City I wouldn't want to lose that Year as a level, so make Year an Ordered Factor before filtering. If City were a factor I probably wouldn't want to retain every level after filtering, so it's best left as a character variable so the issue doesn't arise.

  • @jdlopez131
    @jdlopez131 5 ปีที่แล้ว

    Isn't sqldf package a lot better than dplyr? I mean sql commands :) need I say more?

  • @kunalbali810
    @kunalbali810 9 ปีที่แล้ว

    I have two dataframe suppose like
    latitude longitude values
    20 11 3.5
    20 12 1.5
    20 13 4.5
    20 14 4
    21 11 1.2
    21 12 1.4
    21 13 1.4
    21 14 1.8
    latitude longitude values
    20 11 3
    20 12 1
    20 13 4
    20 14 4
    21 11 1
    21 12 1
    21 13 1.4
    21 14 1.2
    now i need to get the result like
    20 11 3.32
    20 12 1.25
    20 13 4.25
    20 14 4
    21 11 1.1
    21 12 1.2
    21 13 1.4
    21 14 1.5
    You see i just did the mean of 3rd column with each rows So how can i do that as i am dealing with atmospheric data so i need to do this please tell me how to do ??

    • @sushantchoudhary6393
      @sushantchoudhary6393 9 ปีที่แล้ว

      you could just say dataframe3$values = dataframe1$values + dataframe2$values.
      How you got 3.32 there in the third table though is ... it's not the mean of 3 and 3.5, just so we're on the same page.

    • @sushantchoudhary6393
      @sushantchoudhary6393 9 ปีที่แล้ว

      Sorry forgot to divide by 2.
      dataframe3$values = dataframe3$values/2

    • @kunalbali810
      @kunalbali810 9 ปีที่แล้ว

      Sushant Choudhary Do you know how to plot standard error or standard bar plot in time series graph ??

    • @sushantchoudhary6393
      @sushantchoudhary6393 9 ปีที่แล้ว

      Yes, I do. To say any more than that, I would need a more precise question, though.

    • @kunalbali810
      @kunalbali810 9 ปีที่แล้ว +1

      Sushant Choudhary Great. I have 13 year data of each months starting from 2002-09-01 to 2014-12-01. Now i need to plot the annualy mean graph with standard deviation and monthly mean (2002-2014) with standard deviation. The data is below . Hope you have got my point.
      africa_co china_co SM_CO
      2002-09-01 2.05 2.11 2.09
      2002-10-01 2.125 2.095 2.21
      2002-11-01 2.035 2.175 2.095
      2002-12-01 2.095 2.175 1.905
      2003-01-01 2.15 2.29 1.815
      2003-02-01 2.12 2.33 1.775
      2003-03-01 2.025 2.475 1.875
      2003-04-01 1.92 2.415 1.765
      2003-05-01 1.885 2.335 1.585
      2003-06-01 1.775 2.35 1.56
      2003-07-01 1.87 1.91 1.59
      2003-08-01 2.035 1.945 1.755
      2003-09-01 2.145 1.95 2.125
      2003-10-01 2.12 2.025 1.98
      2003-11-01 2 2.12 1.89
      2003-12-01 2.04 2.195 1.85
      2004-01-01 2.105 2.285 1.72
      2004-02-01 2.14 2.335 1.81
      2004-03-01 2.07 2.52 1.75
      2004-04-01 1.915 2.45 1.68
      2004-05-01 1.82 2.185 1.57
      2004-06-01 1.775 2.085 1.545
      2004-07-01 1.88 1.91 1.62
      2004-08-01 1.965 1.97 1.755
      2004-09-01 2.09 2.035 2.33
      2004-10-01 2.095 2.075 2.17
      2004-11-01 1.98 2.075 2.02
      2004-12-01 2.13 2.145 1.89
      2005-01-01 2.185 2.34 1.78
      2005-02-01 2.11 2.365 1.7
      2005-03-01 2.005 2.535 1.725
      2005-04-01 1.91 2.505 1.655
      2005-05-01 1.805 2.26 1.585
      2005-06-01 1.77 2.065 1.495
      2005-07-01 1.85 1.87 1.59
      2005-08-01 2.025 1.885 1.95
      2005-09-01 2.19 1.955 2.365
      2005-10-01 2.18 2.035 2.455
      2005-11-01 2.09 2.065 2.08
      2005-12-01 2.165 2.275 1.845
      2006-01-01 2.115 2.265 1.72
      2006-02-01 2.06 2.25 1.685
      2006-03-01 1.905 2.38 1.69
      2006-04-01 1.8 2.31 1.645
      2006-05-01 1.74 2.135 1.545
      2006-06-01 1.73 1.955 1.5
      2006-07-01 1.795 1.885 1.515
      2006-08-01 1.995 1.99 1.775
      2006-09-01 2.09 2.1 2.205
      2006-10-01 2.01 2.17 2.03
      2006-11-01 2.005 2.165 1.9
      2006-12-01 2.125 2.195 1.885
      2007-01-01 2.215 2.315 1.8
      2007-02-01 2.2 2.42 1.865
      2007-03-01 2.17 2.535 1.825
      2007-04-01 1.955 2.57 1.715
      2007-05-01 1.81 2.225 1.585
      2007-06-01 1.72 2.13 1.51
      2007-07-01 1.84 1.87 1.53
      2007-08-01 1.98 1.945 1.815
      2007-09-01 2.115 2.05 2.54
      2007-10-01 2.14 2.065 2.52
      2007-11-01 2.005 2.07 2.03
      2007-12-01 2.12 2.15 1.75
      2008-01-01 2.115 2.25 1.71
      2008-02-01 2.2 2.355 1.765
      2008-03-01 2.09 2.45 1.815
      2008-04-01 1.84 2.36 1.725
      2008-05-01 1.75 2.265 1.545
      2008-06-01 1.74 2.055 1.485
      2008-07-01 1.85 1.855 1.525
      2008-08-01 1.99 1.88 1.7
      2008-09-01 2.095 1.885 1.995
      2008-10-01 2.01 1.865 2.08
      2008-11-01 1.98 1.865 1.915
      2008-12-01 2.07 2.005 1.755
      2009-01-01 2.125 2.18 1.695
      2009-02-01 1.975 2.155 1.665
      2009-03-01 1.945 2.375 1.635
      2009-04-01 1.84 2.37 1.655
      2009-05-01 1.73 2.17 1.565
      2009-06-01 1.73 1.975 1.49
      2009-07-01 1.83 1.83 1.48
      2009-08-01 1.925 1.91 1.635
      2009-09-01 2.04 1.91 1.82
      2009-10-01 1.985 1.97 1.895
      2009-11-01 1.925 1.95 1.89
      2009-12-01 2.055 2.105 1.87
      2010-01-01 2.09 2.125 1.74
      2010-02-01 2.02 2.225 1.705
      2010-03-01 1.95 2.415 1.7
      2010-04-01 1.92 2.395 1.67
      2010-05-01 1.775 2.16 1.555
      2010-06-01 1.735 2.01 1.53
      2010-07-01 1.835 1.83 1.55
      2010-08-01 1.995 1.865 1.91
      2010-09-01 2.16 1.91 2.38
      2010-10-01 2.275 1.885 2.47
      2010-11-01 2.045 1.97 1.91
      2010-12-01 2.01 2.045 1.75
      2011-01-01 2.12 2.245 1.675
      2011-02-01 2.115 2.265 1.685
      2011-03-01 2.06 2.35 1.685
      2011-04-01 1.865 2.355 1.635
      2011-05-01 1.755 2.075 1.54
      2011-06-01 1.72 1.93 1.475
      2011-07-01 1.84 1.89 1.51
      2011-08-01 2.025 1.87 1.64
      2011-09-01 2.175 1.92 2.04
      2011-10-01 1.94 1.925 1.9
      2011-11-01 1.895 1.88 1.735
      2011-12-01 2.045 2.095 1.77
      2012-01-01 2.155 2.215 1.705
      2012-02-01 2.15 2.28 1.7
      2012-03-01 2.065 2.385 1.685
      2012-04-01 1.965 2.34 1.625
      2012-05-01 1.765 2.2 1.535
      2012-06-01 1.78 2.045 1.465
      2012-07-01 1.82 1.93 1.5
      2012-08-01 2.025 1.935 1.685
      2012-09-01 2.11 1.955 2.07
      2012-10-01 2.005 1.995 2.005
      2012-11-01 1.94 1.925 1.9
      2012-12-01 1.965 2.065 1.755
      2013-01-01 2.065 2.17 1.64
      2013-02-01 2.085 2.205 1.715
      2013-03-01 1.975 2.305 1.7
      2013-04-01 1.86 2.355 1.6
      2013-05-01 1.8 2.1 1.54
      2013-06-01 1.8 1.855 1.505
      2013-07-01 1.9 1.775 1.52
      2013-08-01 2.115 1.795 1.64
      2013-09-01 2.085 1.865 1.825
      2013-10-01 1.905 1.895 1.85
      2013-11-01 1.895 1.895 1.685
      2013-12-01 1.915 2.04 1.68
      2014-01-01 2.07 2.115 1.645
      2014-02-01 2.075 2.175 1.69
      2014-03-01 2.035 2.34 1.73
      2014-04-01 1.855 2.435 1.635
      2014-05-01 1.725 2.09 1.545
      2014-06-01 1.745 1.99 1.465
      2014-07-01 1.8 1.775 1.48
      2014-08-01 1.95 1.875 1.675
      2014-09-01 2.005 1.835 1.915
      2014-10-01 1.99 1.89 1.92
      2014-11-01 1.975 1.92 1.79
      2014-12-01 1.985 2.07 1.73

  • @Dwright3316
    @Dwright3316 9 ปีที่แล้ว

    What version of R is Dr. Peng using here?
    I have downloaded R version 3.2.1 (2015-06-18). But, unfortunately, I cannot use the "chicago.rds" package -- error message -- is not available (for R version 3.2.1)
    Is there any workarounds for this? Or would I need to uninstall my current version of R and find the older version in order to install/load this package?
    Thank you! I'm new to programming in R, so any help would be greatly appreciated!

    • @lalaithan
      @lalaithan 7 ปีที่แล้ว

      It's a dataset, not a package.