Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.
Due to delays with my stock market payment, if this message is useful for you, I kindly request a minimal donation Buy a coffee for me. It will be used to continue my open source efforts. The complete explanation is here: A personal message from an Open Source employee.
Motivation
My friend Nicolas Didier asked me about reading Public jobs With R or Python. Here is a short example for him and everyone who can benefit from this.
The following steps were adjusted from a tutorial that I had given in 2023 at the University of Michigan (Go Blue!).
Required R -PACKAGES
- Relselenium: R-selenium integration
- Rvest: HTML processing
- DPLYR: To load the pipe operator (can be used later for data cleaning)
- PURRR: ITERATION (ie repeated edits)
I have installed Relenium from the R console:
if (!require(RSelenium)) install.packages("RSelenium")
# or
remotes::install_github("ropensci/RSelenium")For the rest of the packages:
if (!require(rvest)) install.packages("rvest")
if (!require(dplyr)) install.packages("dplyr")
if (!require(purrr)) install.packages("purrr")Installation of selenium and chrome/chromium
Note for Ubuntu/Debian users: we have to check that chrome or chromium is installed in our system. One of the many options is to use the Bash console.
sudo add-apt-repository ppa:savoury1/chromium sudo apt update sudo apt install chromium-browser sudo apt install chromium-chromedriver
Not using the PPA installs the Snap version of chromium, which is not compatible with selenium.
I tried to start Selenium as mentioned in the official guide And it didn’t work.
I had to install chrome. I am on Manjaro and I ran sudo pacman -S chromium. Windows/Mac users can use Google Chrome.
An extra requirement was to download Selenium server. Based on thisI started making a folder to store the data for this message by typing it in the US code terminal:
mkdir -p /tmp/didier-example cd /tmp/didier-example
Then I opened R Witn R and the Jar file downloaded:
url_jar <- "https://github.com/SeleniumHQ/selenium/releases/download/selenium-3.9.1/selenium-server-standalone-3.9.1.jar"
sel_jar <- "selenium-server-standalone-3.9.1.jar"
if (!file.exists(sel_jar)) {
download.file(url_jar, sel_jar)
}I had to lead Selenium from a new terminal:
cd /tmp/didier-example java -jar selenium-server-standalone-3.9.1.jar
Back to the r -terminal I was finally able to r:
library(RSelenium) library(rvest) library(dplyr) library(purrr) rmDr <- remoteDriver(port = 4444L, browserName = "chrome") rmDr$open(silent = TRUE) url <- "https://www.empleospublicos.cl" rmDr$navigate(url)
This should display a new chrome/chrome window that says: “Chrome is controlled by automated test software”.
Scrape the data
With the help of the browser inspector (Ctrl + Shift + I) I investigated the page to see that the search bar matches:
For example, I can search for “Ministerio de Salud” because there were many messages from that organization on the destination page:
search_box <- rmDr$findElement(using = "id", value = "buscadorprincipal")
search_box$sendKeysToElement(list("Ministerio de Salud", key = "enter"))That typed “Ministerio de Salud” and clicked on my behalf on searches. Inspecting the results that I see that every job offer starts
The first offer listed is this:
html <- read_html(rmDr$getPageSource()[[1]]) offers <- html %>% html_nodes("div.items") offers_tbl <- map_df(offers, function(offer) { # Extract position (job title) position <- offer %>% html_node("h3 a") %>% html_text(trim = TRUE) # Extract organization (usually the firstMinisterio de Salud
ConstituciĂłn
No pide experiencia
inside .top) organization <- offer %>% html_node(".top p") %>% html_text(trim = TRUE) # Extract city (the second
inside .cnt) city <- offer %>% html_nodes(".cnt p") %>% .[2] %>% html_text(trim = TRUE) tibble( position = position, organization = organization, city = city ) })
The result has the following structure:
offers_tbl # A tibble: 552 Ă— 3 position organization city1 Medico (a) especialista en AnestesiologĂa 44 horas Servicio de… Cons… 2 Titulares de la Planta Profesional Ley 18.834 Servicio de… Valp… 3 ENFERMERA-O, JORNADA DIURNA, GRADO 12, PARA SERVICIO CLĂŤN… Servicio de… Reco… 4 Psiquiatra infanto-juvenil sistema de atenciĂłn intersecto… Servicio de… La P… 5 NeurĂłlogo(a) adulto GES Alzheimer y otras demencias Servicio de… Puen… 6 MĂ©dico(a) especialista en NeurologĂa Infantil Hospital de… Servicio de… Cast… 7 Arquitecto de Software Central de … Ă‘uñoa 8 TENS OPERADOR DE EQUIPOS DE ESTERILIZACIĂ“N Servicio de… Peña… 9 (850-2892) MĂ©dico Especialista Broncopulmonar o Internist… Servicio de… Talc… 10 Enfermero(a) ClĂnico(a) AtenciĂłn Abierta y Cerrada Servicio de… Huas… glimpse(offers_tbl) > glimpse(offers_tbl) Rows: 552 Columns: 3 $ position "Medico (a) especialista en AnestesiologĂa 44 horas", "Ti… $ organization "Servicio de Salud Maule / Hospital de ConstituciĂłn", "Se… $ city "ConstituciĂłn", "ValparaĂso", "Recoleta", "La Pintana", "… I know this is a simple example, but must allow different types of explorations and data extraction. I hope it helps.
Related
#Step #step #manual #Selenium #scrape #Empleos #Publicos #RBloggers


