I wrote a little while back about using Microsoft Cognitive Services APIs with R to first of all detect the language of pieces of text and then do sentiment analysis on them. I wasn’t too happy with the some of the code as it was very inelegant. I knew I could code better than I had, especially as I’ve been doing a lot more work with purrr recently. However, it had sat in drafts for a while. Then David Smith kindly posted about the process I used which meant I had to get this nicer version of my code out ASAP!
Get the complete code in this gist.
Prerequisites
Setup
library(httr)
library(jsonlite)
library(dplyr)
library(purrr)
cogapikey<-"XXX"
text=c("is this english?"
,"tak er der mere kage"
,"merci beaucoup"
,"guten morgen"
,"bonjour"
,"merde"
,"That's terrible"
,"R is awesome")
# Put data in an object that converts to the expected schema for the API
data_frame(text) %>%
mutate(id=row_number()) ->
textdf
textdf %>%
list(documents=.) ->
mydata
Language detection
We need to identify the most likely language for each bit of text in order to send this additional bit of info to the sentiment analysis API to be able to get decent results from the sentiment analysis.
cogapi<-"https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/languages?numberOfLanguagesToDetect=1"
cogapi %>%
POST(add_headers(`Ocp-Apim-Subscription-Key`=cogapikey),
body=toJSON(mydata)) ->
response
# Process response
response %>%
content() %>%
flatten_df() %>%
select(detectedLanguages) %>%
flatten_df()->
respframe
textdf %>%
mutate(language= respframe$iso6391Name) ->
textdf
Sentiment analysis
With an ID, text, and a language code, we can now request the sentiment of our text be analysed.
# New info
mydata<-list(documents = textdf)
# New endpoint
cogapi<-"https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment"
# Construct a request
cogapi %>%
POST(add_headers(`Ocp-Apim-Subscription-Key`=cogapikey),
body=toJSON(mydata)) ->
response
# Process response
response %>%
content() %>%
flatten_df() %>%
mutate(id=as.numeric(id))->
respframe
# Combine
textdf %>%
left_join(respframe) ->
textdf
And… et voila! A multi-language dataset with the language identified and the sentiment scored using purrr for easier to read code.
Using purrr with APIs makes code nicer and more elegant as it really helps interact with hierarchies from JSON objects. I feel much better about this code now!
Original
id | language | text | score |
---|---|---|---|
1 | en | is this english? | 0.2852910 |
2 | da | tak er der mere kage | NA |
3 | fr | merci beaucoup | 0.8121097 |
4 | de | guten morgen | NA |
5 | fr | bonjour | 0.8118965 |
6 | fr | merde | 0.0515683 |
7 | en | That’s terrible | 0.1738841 |
8 | en | R is awesome | 0.9546152 |
Revised
text | id | language | score |
---|---|---|---|
is this english? | 1 | en | 0.2265771 |
tak er der mere kage | 2 | da | 0.7455934 |
merci beaucoup | 3 | fr | 0.8121097 |
guten morgen | 4 | de | 0.8581840 |
bonjour | 5 | fr | 0.8118965 |
merde | 6 | fr | 0.0515683 |
That’s terrible | 7 | en | 0.0068665 |
R is awesome | 8 | en | 0.9973871 |
Interestingly the scores for English have not stayed the same – for instance, Microsoft now sees “R is awesome” in a much more positive light. It’s also great to see German and Danish are now supported!
Get the complete code in this gist.