Using tabulizer
we’re able to extract information from PDFs so it comes in really handy when people publish data as a PDF! This post takes you through using tabulizer
and tidyverse
packages to scrape and clean up some budget data from PASS, an association for the Microsoft Data Platform community. The goal is to mainly show some of the tricks of the data wrangling trade that you may need to utilise when you scrape data from PDFs.
library(tabulizer)
library(tidyverse)
library(zoo)
library(tidytext)
Reading the PDF
With tabulizer
, if the data is relatively well formatted in a PDF you can use tabulizer::extract_tables()
. This gives you a bunch of data.frames which you can process. Unfortunately, in the case of the PASS budget with 22 pages of tables, including tables that span multiple pages, we’re not so lucky!
We need to fall back to tabulizer::extract_text()
and do a lot of wrangling to reconstruct the tables.
"http://www.pass.org/Portals/0/Governance%202016/Financials/pass-budget-2017.pdf?ver=2017-01-25-235556-197" %>%
tabulizer::extract_text() ->
rawtxt
str_trunc(rawtxt, 1000)
## [1] "Department Summary Budget 2016 Budget 2017\r\nREVENUE\r\nCorporate Administration - 110 11,000.00$ 11,000.00$ \r\nInformation Technology - 111 74,000.00$ 106,480.00$ \r\nBoard Support - 112 -$ -$ \r\nMember Services - 114 6,000.00$ 22,500.00$ \r\nMarketing - 115 -$ -$ \r\nChapters - 120 -$ -$ \r\nVolunteer Programs & Engagement - 130 -$ -$ \r\nSpecial Projects - 140 -$ -$ \r\nSQL Saturday - 150 -$ 38,400.00$ \r\nVirtual Events - 160 51,662.50$ 68,800.00$ \r\nGlobal Growth - 170 -$ ..."
Converting the results to tabular data
The PDF contents are a continuous string so we need to split this up. Each line seems to be seperated by \r\n
and we can use the tidytext
package to easily split these lines out into seperate elements.
rawtxt %>%
tokenize(tokenizer = tokenizer_line()) %>%
head()
## [[1]]
## [1] "Department Summary Budget 2016 Budget 2017"
##
## [[2]]
## [1] "REVENUE"
##
## [[3]]
## [1] "Corporate Administration - 110 11,000.00$ 11,000.00$ "
##
## [[4]]
## [1] "Information Technology - 111 74,000.00$ 106,480.00$ "
##
## [[5]]
## [1] "Board Support - 112 -$ -$ "
##
## [[6]]
## [1] "Member Services - 114 6,000.00$ 22,500.00$ "
There’s now a load of spaces between the budget item, the 2016 amount, and the 2017 amount. We need to remove excess spaces and transform each of these lines into individual elements using str_split()
.
rawtxt %>%
tokenize(tokenizer = tokenizer_line()) %>%
str_replace_all("\\s+", " ") %>%
str_trim(side = "both") %>%
str_split(" ") %>%
head()
## [[1]]
## [1] "Department" "Summary" "Budget" "2016" "Budget"
## [6] "2017"
##
## [[2]]
## [1] "REVENUE"
##
## [[3]]
## [1] "Corporate" "Administration" "-" "110"
## [5] "11,000.00$" "11,000.00$"
##
## [[4]]
## [1] "Information" "Technology" "-" "111" "74,000.00$"
## [6] "106,480.00$"
##
## [[5]]
## [1] "Board" "Support" "-" "112" "-$" "-$"
##
## [[6]]
## [1] "Member" "Services" "-" "114" "6,000.00$"
## [6] "22,500.00$"
Looking at the splits, we have too many elements from the budget line item names. We need to consolidate the left-hand side contents.
This function looks at each set of split elements and processes them. If there are enough elements and the last one ends with a $
we can combine everything bar the last two elements into the line item. If there is no $
line or it’s too short, we can put everything into the line item.
combineLHS <- function(x) {
n <- length(x)
if (str_detect(x[n],"\\$$")&n>=3) {
data_frame(lineItem=paste(x[1:(n - 2)], collapse = " "),
b2016=x[n - 1],
b2017=x[n])
} else {
data_frame(lineItem=paste(x, collapse = " "),
b2016="",
b2017="")
}
}
This is a non vectorised function which means it can’t be applied to every line at once. To get around this we can leverage purrr::map_df()
to apply the function to every element and combine into one big data.frame.
rawtxt %>%
tokenize(tokenizer = tokenizer_line()) %>%
str_replace_all("\\s+", " ") %>%
str_trim(side = "both") %>%
str_split(" ") %>%
map_df(combineLHS) ->
rawdata
rawdata %>%
head()
## # A tibble: 6 x 3
## lineItem b2016 b2017
## <chr> <chr> <chr>
## 1 Department Summary Budget 2016 Budget 2017 "" ""
## 2 REVENUE "" ""
## 3 Corporate Administration - 110 11,000.00$ 11,000.00$
## 4 Information Technology - 111 74,000.00$ 106,480.00$
## 5 Board Support - 112 -$ -$
## 6 Member Services - 114 6,000.00$ 22,500.00$
Cleaning up
Now that we have a tabular data set, it now needs to be made useful. We now need to do stuff like like translating the character formatting of the money into a numeric value and tracking whether a row is a title, revenue, or expense item.
I wrote a function to clean up the monetary amounts.
moneycleaner <- function(x) {
x %>%
str_replace_all("[$,]|[[:space:]]", "") %>%
# Handle $0 amounts
str_replace_all("^-$", "0") %>%
# Handle negative amounts
ifelse(str_detect(., "\\("),
paste0("-", str_replace_all(., "[()]", "")),
.) %>%
as.numeric()
}
I’m going to write inline comments about what each bit is doing to avoid a lot of repetition here as I clean up the data.
rawdata %>%
mutate(
## Count the switches between table types
TblChange = cumsum(lineItem %in%
c("REVENUE", "EXPENSE")),
## Check if row is a total line for a table
Total = str_detect(lineItem, "^TOTAL"),
## Check if row is a total based on two tables
Net = str_detect(lineItem, "^NET"),
## Work out if the row contains line items or titles
Title = b2016 == "" & b2017 == "" ,
## Use MOD2 to determine which table type a row belonged to
Type = ifelse(TblChange %% 2 == 0, "Expense", "Revenue"),
## Identify the department if the row has budget 2017
Dept = ifelse(str_detect(lineItem,"Budget 2017"),
str_replace(str_extract(lineItem, "^.+-")," -",""),
NA),
## We're only interested in actual line items
IgnoreRow=Total|Net|Title
) %>%
mutate(
## Use `zoo`s rolling functions to carry the last non-NA value forward. Essentially doign a fill-down between values
Dept = zoo::na.locf(Dept, na.rm = FALSE),
b2016=moneycleaner(b2016),
b2017=moneycleaner(b2017)) %>%
mutate(
## Invert the sign of expenses
b2016=ifelse(Type=="Expense",-1,1)*b2016,
b2017=ifelse(Type=="Expense",-1,1)*b2017
) ->
alldata
## Warning in function_list[[k]](value): NAs introduced by coercion
## # A tibble: 6 x 10
## lineItem b2016 b2017 TblCh~ Total Net Title Type Dept Ignor~
## <chr> <dbl> <dbl> <int> <lgl> <lgl> <lgl> <chr> <chr> <lgl>
## 1 Department Sum~ NA NA 0 F F T Expe~ <NA> T
## 2 REVENUE NA NA 1 F F T Reve~ <NA> T
## 3 Corporate Admi~ 11000 11000 1 F F F Reve~ <NA> F
## 4 Information Te~ 74000 106480 1 F F F Reve~ <NA> F
## 5 Board Support ~ 0 0 1 F F F Reve~ <NA> F
## 6 Member Service~ 6000 22500 1 F F F Reve~ <NA> F
Now we can filter to only rows we need.
alldata %>%
filter(!IgnoreRow)->
flaggeddata
We can further split this into summary tables (the first two) and the detail tables (everything after).
flaggeddata %>%
filter(TblChange <= 2) %>%
select(
## Remove extraneous rows
-(TblChange:Title ), -IgnoreRow
)->
summarydata
flaggeddata %>%
filter(TblChange > 2) %>%
select(
## Remove extraneous rows
-(TblChange:Title ), -IgnoreRow
) ->
detaildata
## # A tibble: 6 x 5
## lineItem b2016 b2017 Type Dept
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Gain/Loss on Revaluation of Euro 0 0 Revenue Corporate~
## 2 Interest - Miscellaneous 11000 11000 Revenue Corporate~
## 3 Misc FY Adjustments 0 0 Revenue Corporate~
## 4 Unrealized Gain/Loss on Investments 0 0 Revenue Corporate~
## 5 Accounting - General - 3500 - 5000 Expense Corporate~
## 6 Annual Report - 37.0 - 37.0 Expense Corporate~
What’s next?
Now that the data is in a consumable format, we can now analyse the budget. That’ll be a future blog post and in the interim, you can download the budget as a CSV to give it a go yourself or improve the code above.