Step -by -step manual to use R and Selenium to scrape Empleos Publicos

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report problems here about the content on this page)

Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.

Due to delays with my stock market payment, if this message is useful for you, I kindly request a minimal donation Buy a coffee for me. It will be used to continue my open source efforts. The complete explanation is here: A personal message from an Open Source employee.

Motivation

My friend Nicolas Didier asked me about reading Public jobs With R or Python. Here is a short example for him and everyone who can benefit from this.

The following steps were adjusted from a tutorial that I had given in 2023 at the University of Michigan (Go Blue!).

Required R -PACKAGES

Relselenium: R-selenium integration
Rvest: HTML processing
DPLYR: To load the pipe operator (can be used later for data cleaning)
PURRR: ITERATION (ie repeated edits)

I have installed Relenium from the R console:

if (!require(RSelenium)) install.packages("RSelenium")

# or

remotes::install_github("ropensci/RSelenium")

For the rest of the packages:

if (!require(rvest)) install.packages("rvest")
if (!require(dplyr)) install.packages("dplyr")
if (!require(purrr)) install.packages("purrr")

Installation of selenium and chrome/chromium

Note for Ubuntu/Debian users: we have to check that chrome or chromium is installed in our system. One of the many options is to use the Bash console.

sudo add-apt-repository ppa:savoury1/chromium
sudo apt update
sudo apt install chromium-browser
sudo apt install chromium-chromedriver

Not using the PPA installs the Snap version of chromium, which is not compatible with selenium.

I tried to start Selenium as mentioned in the official guide And it didn’t work.

I had to install chrome. I am on Manjaro and I ran sudo pacman -S chromium. Windows/Mac users can use Google Chrome.

An extra requirement was to download Selenium server. Based on thisI started making a folder to store the data for this message by typing it in the US code terminal:

mkdir -p /tmp/didier-example
cd /tmp/didier-example

Then I opened R Witn R and the Jar file downloaded:

url_jar <- "https://github.com/SeleniumHQ/selenium/releases/download/selenium-3.9.1/selenium-server-standalone-3.9.1.jar"
sel_jar <- "selenium-server-standalone-3.9.1.jar"

if (!file.exists(sel_jar)) {
  download.file(url_jar, sel_jar)
}

I had to lead Selenium from a new terminal:

cd /tmp/didier-example
java -jar selenium-server-standalone-3.9.1.jar

Back to the r -terminal I was finally able to r:

library(RSelenium)
library(rvest)
library(dplyr)
library(purrr)

rmDr <- remoteDriver(port = 4444L, browserName = "chrome")

rmDr$open(silent = TRUE)

url <- "https://www.empleospublicos.cl"

rmDr$navigate(url)

This should display a new chrome/chrome window that says: “Chrome is controlled by automated test software”.

Scrape the data

With the help of the browser inspector (Ctrl + Shift + I) I investigated the page to see that the search bar matches:

For example, I can search for “Ministerio de Salud” because there were many messages from that organization on the destination page:

search_box <- rmDr$findElement(using = "id", value = "buscadorprincipal")
search_box$sendKeysToElement(list("Ministerio de Salud", key = "enter"))

That typed “Ministerio de Salud” and clicked on my behalf on searches. Inspecting the results that I see that every job offer starts


The first offer listed is this:
Ministerio de Salud
Constitución
  No pide experiencia

html <- read_html(rmDr$getPageSource()[[1]])

offers <- html %>%
  html_nodes("div.items")

offers_tbl <- map_df(offers, function(offer) {
  # Extract position (job title)
  position <- offer %>%
    html_node("h3 a") %>%
    html_text(trim = TRUE)
  
  # Extract organization (usually the first  inside .top)
  organization <- offer %>%
    html_node(".top p") %>%
    html_text(trim = TRUE)
  
  # Extract city (the second 
 inside .cnt)
  city <- offer %>%
    html_nodes(".cnt p") %>%
    .[2] %>%
    html_text(trim = TRUE)
  
  tibble(
    position = position,
    organization = organization,
    city = city
  )
})
The result has the following structure:
offers_tbl
# A tibble: 552 × 3
   position                                                   organization city 
                                                                 
 1 Medico (a) especialista en Anestesiología 44 horas         Servicio de… Cons…
 2 Titulares de la Planta Profesional Ley 18.834              Servicio de… Valp…
 3 ENFERMERA-O, JORNADA DIURNA, GRADO 12, PARA SERVICIO CLÍN… Servicio de… Reco…
 4 Psiquiatra infanto-juvenil sistema de atención intersecto… Servicio de… La P…
 5 Neurólogo(a) adulto GES Alzheimer y otras demencias        Servicio de… Puen…
 6 Médico(a) especialista en Neurología Infantil Hospital de… Servicio de… Cast…
 7 Arquitecto de Software                                     Central de … Ñuñoa
 8 TENS OPERADOR DE EQUIPOS DE ESTERILIZACIÓN                 Servicio de… Peña…
 9 (850-2892) Médico Especialista Broncopulmonar o Internist… Servicio de… Talc…
10 Enfermero(a) Clínico(a) Atención Abierta y Cerrada         Servicio de… Huas…
glimpse(offers_tbl)
> glimpse(offers_tbl)
Rows: 552
Columns: 3
$ position      "Medico (a) especialista en Anestesiología 44 horas", "Ti…
$ organization  "Servicio de Salud Maule / Hospital de Constitución", "Se…
$ city          "Constitución", "Valparaíso", "Recoleta", "La Pintana", "…
I know this is a simple example, but must allow different types of explorations and data extraction. I hope it helps.
Related

Motivation

Required R -PACKAGES

Installation of selenium and chrome/chromium

Scrape the data

Related

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply