Intro to toRvik

Hey, everyone, I’m Andrew Weatherman, creator of toRvik and lover of college basketball analytics. The goal of toRvik is to expand access to reliable, high-quality CBB statistics. While analogous packages exist to pull data, like Saiem Gilani’s brilliant hoopR, toRvik requires no paid subscription or set-up and can be immediately utilized by anyone with just a few lines of code.

Install toRvik

# You can install using {pacman} with the following code:
if (!requireNamespace('pacman', quietly = TRUE)){
  install.packages('pacman')
}
pacman::p_load_current_gh("andreweatherman/toRvik", dependencies = TRUE, update = TRUE)

Overview of Barttorvik and toRvik

toRvik is a package of scrapers that pull data from Barttorvik, a popular college basketball analytics website, and return it in tidy format. Barttorvik splits its data on a number of variables and hosts detailed player and game statistics, while serving as a reputable, industry-recognized metric rating system. Generally speaking, all data is avaliable back to the 2007-08 season. More information about Barttorvik, its data, and its metric rating system can be found here.

Package functions are syntactically structured to point to their data source (e.g. by ‘player,’ ‘game,’ etc.) and should be considered get functions by nature. As of toRvik version 1.0.3, the package exports nearly 30 functions covering the website and its data. Some highlights include:

Quick start with ratings

toRvik requires no set-up and can be instantly executed in any session. To understand the package, the T-Rank functions, pulling and splitting Barttorvik’s metric rating system, are an excellent place to start. Let’s take a glance at the top teams in T-Rank using toRvik:

tictoc::tic()
toRvik::bart_ratings(year=2022) %>% 
  utils::head(10)
#> ── Team Ratings: 2022 ────────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:24 EDT
#> # A tibble: 10 × 19
#>    team    conf  barthag barth…¹ adj_o adj_o…² adj_d adj_d…³ adj_t adj_t…⁴   wab
#>    <chr>   <chr>   <dbl>   <int> <dbl>   <int> <dbl>   <int> <dbl>   <int> <dbl>
#>  1 Gonzaga WCC     0.966       1  120.       4  89.9       9  72.6       5  6.71
#>  2 Houston Amer    0.959       2  117.      10  88.5       6  63.7     336  6.15
#>  3 Kansas  B12     0.958       3  120.       5  91.3      13  69.1      71 10.4 
#>  4 Texas … B12     0.951       4  111.      41  85.4       1  66.3     223  6.57
#>  5 Baylor  B12     0.949       5  118.       8  91.3      14  67.6     149  8.91
#>  6 Duke    ACC     0.944       6  123.       1  96.0      53  67.4     161  7.19
#>  7 Tennes… SEC     0.944       7  111.      34  87.1       3  67.4     164  7.96
#>  8 Villan… BE      0.935       8  117.       9  93.0      26  62.2     350  7.29
#>  9 Arizona P12     0.934       9  118.       7  93.7      35  72.3       9  8.76
#> 10 UCLA    P12     0.932      10  116.      12  92.2      20  65.4     274  5.06
#> # … with 8 more variables: nc_elite_sos <int>, nc_fut_sos <dbl>,
#> #   nc_cur_sos <dbl>, ov_elite_sos <int>, ov_fut_sos <dbl>, ov_cur_sos <dbl>,
#> #   seed <dbl>, year <int>, and abbreviated variable names ¹​barthag_rk,
#> #   ²​adj_o_rk, ³​adj_d_rk, ⁴​adj_t_rk
tictoc::toc()
#> 0.37 sec elapsed

Here, the bart_ratings function returned the top ten teams in T-Rank in the current season. We are also presented with each team’s adjusted efficiencies, their adjusted tempo, and two forms of strength of schedule (documented in bart_ratings). But what if we want these same measures in home games only? We would use bart_factors and input ‘H’ as location:

tictoc::tic()
toRvik::bart_factors(location='H') %>%
  utils::head(10)
#> ── Team Factors ──────────────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:25 EDT
#> # A tibble: 10 × 22
#>    team     conf  rating  rank adj_o adj_o…¹ adj_d adj_d…² tempo off_ppp off_efg
#>    <chr>    <chr>  <dbl> <dbl> <dbl>   <dbl> <dbl>   <dbl> <dbl>   <dbl>   <dbl>
#>  1 Houston  Amer    32.7     1  116.      15  83.0       1  66.1    117.    54.2
#>  2 Gonzaga  WCC     29.7     2  120.       5  89.9      18  72.7    123.    60.1
#>  3 Baylor   B12     28.8     3  116.       9  87.6       9  69.3    116.    55.0
#>  4 Villano… BE      28.8     4  123.       2  94.0      50  63.2    122.    57.6
#>  5 Purdue   B10     28.4     5  124.       1  96.0      81  67.9    125.    58.2
#>  6 Auburn   SEC     27.6     6  115.      17  87.5       8  72.9    113.    53.1
#>  7 Texas T… B12     27.5     7  116.      14  88.2      11  68.6    116.    57.3
#>  8 Tenness… SEC     27.5     8  113.      28  86.0       6  69.5    112.    53.1
#>  9 UCLA     P12     26.5     9  116.      11  89.7      16  69.6    115.    54.3
#> 10 Texas    B12     25.3    10  109.      65  83.4       2  63.8    109.    51.6
#> # … with 11 more variables: off_to <dbl>, off_or <dbl>, off_ftr <dbl>,
#> #   def_ppp <dbl>, def_efg <dbl>, def_to <dbl>, def_or <dbl>, def_ftr <dbl>,
#> #   wins <int>, losses <int>, games <int>, and abbreviated variable names
#> #   ¹​adj_o_rank, ²​adj_d_rank
tictoc::toc()
#> 0.43 sec elapsed

And now, we have four factor data and metric ratings for home games only. The bart_factors function, and the analogous bart_conf_factors, takes venue, game type, date range, and opponent strength as additional splits. Great, but what if we want to explore rating trends over time? toRvik gives us that ability with bart_archive, a function that pulls adjusted ratings and projected records from the morning of a desired date:

tictoc::tic()
toRvik::bart_archive('20220113') %>%
  utils::head(10)
#> ── Archive Ratings ───────────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:25 EDT
#> # A tibble: 10 × 16
#>     rank team   conf  record barthag adj_o adj_o…¹ adj_d adj_d…² adj_t…³ adj_t…⁴
#>    <int> <chr>  <chr> <chr>    <dbl> <dbl>   <int> <dbl>   <int>   <dbl>   <int>
#>  1   125 Abile… WAC   11-5    0.612  100.      239  96.2      46    71.3      58
#>  2   260 Air F… MWC   8-5     0.321   93.7     331 100.      114    63.9     342
#>  3   151 Akron  MAC   9-4     0.535  105.      130 104.      188    64.9     327
#>  4    21 Alaba… SEC   11-5    0.880  117.       10  98.7      89    72.3      28
#>  5   330 Alaba… SWAC  4-10    0.155   88.5     356 103.      156    67.6     234
#>  6   305 Alaba… SWAC  4-13    0.199   95.8     309 108.      285    72.0      37
#>  7   303 Albany AE    5-10    0.205   91.3     351 103.      158    66.2     305
#>  8   262 Alcor… SWAC  4-11    0.316   98.5     266 105.      224    68.0     207
#>  9   347 Ameri… Pat   4-10    0.0910  93.9     329 115.      347    67.2     262
#> 10   188 Appal… SB    9-9     0.451  100.      235 102.      144    64.3     338
#> # … with 5 more variables: proj_record <chr>, proj_conf_record <chr>,
#> #   wab <dbl>, wab_rk <int>, date <chr>, and abbreviated variable names
#> #   ¹​adj_o_rk, ²​adj_d_rk, ³​adj_tempo, ⁴​adj_tempo_rk
tictoc::toc()
#> 0.33 sec elapsed

Exploring player and game data

Perhaps the most valuable functions in toRvik concern granular analysis. The package gives us the ability to explore advanced statistics at a game-by-game level for every Division 1 player since the 2007-08 season using bart_player_game.

tictoc::tic()
toRvik::bart_player_game(year=2022, stat='advanced', team='Duke') %>%
  dplyr::arrange(desc(net)) %>%
  utils::head(10)
#> ── Player Game Stats ─────────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:26 EDT
#> # A tibble: 10 × 25
#>    date       year player exp   team  conf  opp   result   min   pts   usg  ortg
#>    <chr>     <dbl> <chr>  <chr> <chr> <chr> <chr> <chr>  <dbl> <dbl> <dbl> <dbl>
#>  1 2021-12-…  2022 AJ Gr… Fr    Duke  ACC   Sout… W         22    19  16.7  214.
#>  2 2021-11-…  2022 Wende… Jr    Duke  ACC   Army  W         35    19  22.9  142.
#>  3 2021-11-…  2022 Wende… Jr    Duke  ACC   Lafa… W         29    23  25.2  159.
#>  4 2022-01-…  2022 Mark … So    Duke  ACC   Nort… W         27    19  25    144.
#>  5 2022-03-…  2022 Mark … So    Duke  ACC   Cal … W         32    15  19.5  156.
#>  6 2021-11-…  2022 Paolo… Fr    Duke  ACC   The … W         31    28  29.3  157.
#>  7 2022-03-…  2022 Paolo… Fr    Duke  ACC   Texa… W         37    22  23.6  146.
#>  8 2022-01-…  2022 AJ Gr… Fr    Duke  ACC   Loui… W         34    22  17.2  163.
#>  9 2022-03-…  2022 Trevo… Fr    Duke  ACC   Pitt… W         34    27  25.9  175.
#> 10 2021-11-…  2022 AJ Gr… Fr    Duke  ACC   Lafa… W         21    18  16.4  188.
#> # … with 13 more variables: or_pct <dbl>, dr_pct <dbl>, ast_pct <dbl>,
#> #   to_pct <dbl>, stl_pct <dbl>, blk_pct <dbl>, bpm <dbl>, obpm <dbl>,
#> #   dbpm <dbl>, net <dbl>, poss <dbl>, id <dbl>, game_id <chr>
tictoc::toc()
#> 0.59 sec elapsed

Here, bart_player_game returned the 10 highest individual net BPMs by a Duke player this season. The function takes ‘box,’ ‘shooting,’ and ‘advanced’ as stat inputs, and I welcome you to explore each one in your own session. But what if we want to investigate similar performance at a season level? Well, bart_player_season gives us that option – also taking ‘box,’ ‘shooting,’ and ‘advanced’ as stat inputs.

tictoc::tic()
toRvik::bart_player_season(year=2022, stat='shooting', team='Duke') %>%
  dplyr::arrange(desc(mid_a)) %>%
  utils::head(5)
#> ── Player Season Stats ───────────────────────────────────────── toRvik 1.1.0 ──
#> ℹ Data updated: 2022-09-09 08:24:26 EDT
#> # A tibble: 5 × 33
#>   player pos   exp   team  conf      g   mpg   ppg p_per   usg  ortg   efg    ts
#>   <chr>  <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Paolo… Wing… Fr    Duke  ACC      39  33.0 17.2   20.9  27.2  111.  52    55.7
#> 2 Wende… Comb… Jr    Duke  ACC      39  33.9 13.4   15.8  20.3  121.  56.9  60.5
#> 3 AJ Gr… Wing… Fr    Duke  ACC      39  24.3 10.4   17.1  16.9  127.  61.3  63.0
#> 4 Jerem… Comb… So    Duke  ACC      39  29    8.62  11.9  17.7  105.  47.7  51.5
#> 5 Trevo… Comb… Fr    Duke  ACC      36  30.2 11.5   15.2  20.1  110.  49.6  52.0
#> # … with 20 more variables: ftm <int>, fta <int>, ft_pct <dbl>, two_m <int>,
#> #   two_a <int>, two_pct <dbl>, three_m <int>, three_a <int>, three_pct <dbl>,
#> #   dunk_m <dbl>, dunk_a <dbl>, dunk_pct <dbl>, rim_m <dbl>, rim_a <dbl>,
#> #   rim_pct <dbl>, mid_m <dbl>, mid_a <dbl>, mid_pct <dbl>, year <int>,
#> #   id <int>
tictoc::toc()
#> 0.38 sec elapsed

And now, we have a tibble of season-long shooting data for Duke players, sorted by number of mid-range attempts. Advanced metric data can be pulled by team on a per-game basis using bart_team_schedule, and total team shooting splits can be accessed using bart_team_shooting. Game box data can be pulled with bart_game_total.

Investigating the NCAA tournament

Lastly for this introductory vignette, we will explore toRvik functions for scraping tournament data. Frequent any time on social media in college basketball circles in March, and you will undoubtedly hear about ‘team sheets,’ detailed repositories of strength and quality metrics used by the seeding and selection committee. With bart_tourney_sheets, you can pull ‘quick-hit’ team sheets in tidy format with just a single line of code:

tictoc::tic()
toRvik::bart_tourney_sheets(year=2022) %>%
  utils::head(10)
#> # A tibble: 10 × 16
#>    team     seed   net   kpi   sor res_avg   bpi    kp   sag qual_…¹ q1a   q1   
#>    <chr>   <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl> <chr> <chr>
#>  1 Gonzaga     1     1     5     7     6       1     1     1     1   5-2   10-3 
#>  2 Arizona     1     2     3     2     2.5     3     2     2     2.3 4-2   6-3  
#>  3 Houston     5     3    13    14    13.5     2     4     5     3.7 0-3   1-4  
#>  4 Baylor      1     4     2     4     3       6     5     4     5   4-4   10-5 
#>  5 Kentuc…     2     5     9     5     7       4     3     6     4.3 3-6   9-7  
#>  6 Kansas      1     6     1     1     1       8     6     3     5.7 4-4   12-5 
#>  7 Tennes…     3     7     4     3     3.5     5     7     7     6.3 4-7   11-7 
#>  8 Villan…     2     8     7     8     7.5     7    11     9     9   5-4   7-6  
#>  9 Texas …     3     9    17    12    14.5    13     9    14    12   5-5   8-9  
#> 10 UCLA        4    10    11    15    13       9     8    10     9   2-4   5-4  
#> # … with 4 more variables: q2 <chr>, q1_2 <chr>, q3 <chr>, q4 <chr>, and
#> #   abbreviated variable name ¹​qual_avg
tictoc::toc()
#> 0.97 sec elapsed

Returned are sheets of top teams sorted by their NCAA NET ranking. Because this function relies on NET data, it is only available back to the 2018-19 season. In-season performance is valuable, but what if you want to investigate just tournament data? Well, toRvik gives you two options to do so: bart_tourney_odds and bart_tourney_results. The former returns metric-adjusted round probabilities by split. Let’s explore round odds for the 2022 NCAA Tournament:

tictoc::tic()
toRvik::bart_tourney_odds(year=2022, odds='pre') %>%
  dplyr::arrange(desc(s16)) %>%
  utils::head(10)
#> # A tibble: 10 × 11
#>     seed region  team       conf    r64   r32   s16    e8    f4    f2 champ
#>    <dbl> <chr>   <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1 West    Gonzaga    WCC     100  96.6  81.9  69.6  52    38.5  27.5
#>  2     1 Midwest Kansas     B12     100  96.3  73.7  48.7  32.5  17.7   8.5
#>  3     1 South   Arizona    P12     100  94.8  72.7  37.3  21.2  12     5.4
#>  4     1 East    Baylor     B12     100  94.9  72.5  42.9  25.2  11.1   5.8
#>  5     2 Midwest Auburn     SEC     100  91.5  70    48.4  24.8  11.7   4.8
#>  6     2 West    Duke       ACC     100  94.1  69.8  38.9  15.5   8.2   4  
#>  7     3 West    Texas Tech B12     100  92.6  68.4  40.9  17.1   9.5   5  
#>  8     3 South   Tennessee  SEC     100  92.3  67.5  41    20.8  11.6   5.2
#>  9     5 Midwest Iowa       B10     100  84.3  64.5  32.2  19.3   9.2   3.7
#> 10     2 South   Villanova  BE      100  90.8  63.6  34.6  16.1   8.4   3.5
tictoc::toc()
#> 0.24 sec elapsed

With the ‘odds’ argument set to ‘pre,’ we returned pre-tournament odds and sorted by likelihood to reach the second weekend (Sweet 16). bart_tourney_odds also takes current odds (‘current’), odds based on recent performance (‘recent’), and odds based on games against strong opponents (‘t100’). This data is similarly available starting with the 2019 tournament. Now, what if we want to explore tournament results?

tictoc::tic()
toRvik::bart_tourney_results(min_year=2011, max_year=2021, type='conf') %>%
  utils::head(5)
#> # A tibble: 5 × 18
#>   conf   pake  pase  wins  loss w_percent   r64   r32   s16    e8    f4    f2
#>   <chr> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 P12    11.2  11.4    55    38     0.591    38    27    18     8     2     0
#> 2 SEC    10.9  15.5    78    48     0.619    49    33    21    14     7     2
#> 3 MVC     4.1   6.1    19    15     0.559    15    11     4     2     2     0
#> 4 ACC     3.6  -0.3   102    61     0.626    64    44    31    15     5     4
#> 5 Horz    2.6   3       5    10     0.333    10     1     1     1     1     1
#> # … with 6 more variables: champ <dbl>, top2 <dbl>, f4_percent <dbl>,
#> #   champ_percent <dbl>, from <dbl>, to <dbl>
tictoc::toc()
#> 0.5 sec elapsed

With bart_tourney_results, we can return raw and adjusted outcomes by split. Here, we returned aggregate conference results from 2011 to 2021, sorted by PAKE – the number of wins attained above or below KenPom expectation. The function splits by team (‘team’), conference (‘conf’), coach (‘coach’), and seed (‘seed’) and includes data starting in 2000.

And you’re off!

toRvik includes several additional functions and capabilities that I did not describe here; take time to explore them and those detailed in this introduction. If you have any questions, feel free to message me on Twitter. If you run into any bugs, please open an issue on the GitHub. Happy exploring!