Skip to the content.

VEUCTOR

Training, Selecting, and Aligning Word Embeddings from European Online Job Advertisements


Embedding selection is not neutral. VEUCTOR provides a fully reproducible, taxonomy-driven framework for building robust multilingual labor market intelligence systems grounded in European statistical infrastructure.


What is VEUCTOR?

VEUCTOR is a reproducible methodological framework for training, selecting, and aligning word embedding models built from European Online Job Advertisements (OJAs).

Unlike standard embedding repositories, VEUCTOR does not treat word embeddings as neutral preprocessing tools. It demonstrates that embedding choice is a methodological decision with measurable empirical consequences for:


Dataset

The VEUCTOR embeddings and associated data are publicly available through the DASSI Archive:

   
Repository DASSI Archive — VEUCTOR Dataset
DOI doi:10.71732/4KFDPX

Methodological Workflow

The entire framework is summarized in the following diagram:

VEUCTOR Pipeline

Figure 1 — End-to-end VEUCTOR pipeline: from OJA data collection to embedding generation, selection via HSS, multilingual alignment with SeNSe, and downstream validation.

The workflow is structured into four main phases:

  1. Data Collection & Pre-processing
  2. Embedding Training & Selection
  3. Multilingual Alignment
  4. Intrinsic & Extrinsic Evaluation

Data Sources

Online Job Advertisements (OJAs)

The corpus comes from the Web Intelligence Hub (WIH) initiative developed by Eurostat and Cedefop under the Trusted Smart Statistics framework.

Detail Value
Sample WIH-OJA-NLPv1 representative sample
Release r20221217
Advertisements 4,610,821
Countries 28 European countries
Stratification Occupation (ISCO-08), contract type, salary, education, working time, economic activity, experience

ESCO Taxonomy

Evaluation is grounded in ESCO — European Skills, Competences, Qualifications and Occupations, which provides multilingual occupation hierarchy, skill-occupation relationships, and ISCO alignment. ESCO is used as a semantic benchmark, not as a normative ground truth.


Methodology

1 — Embedding Pool Generation

For each country, FastText models (Bojanowski et al., 2017) are trained on preprocessed OJA corpora via grid search over:

Hyperparameter Values
Vector size 50, 100, 300
Epochs 10, 50, 100
Algorithm SkipGram, CBOW
Hierarchical Softmax 0, 1
Learning rate 0.01, 0.05, 0.1

This yields 108 models per country. Preprocessing includes HTML stripping, token normalization, language-specific stopword removal, and n-gram detection (Gensim Phrases).

2 — Intrinsic Evaluation: Hierarchical Semantic Similarity (HSS)

Embedding quality is assessed using HSS (Giabelli et al., 2020), based on Information Content (Resnik, 1999), Lowest Common Ancestor computation, and taxonomic probability estimation.

For each occupation pair, the framework computes cosine similarity between vectors, the HSS score, and then derives the Spearman rank correlation (ρ). The best model maximizes ρ(cosine similarity, HSS), ensuring that the embedding geometry respects the ESCO hierarchy.

3 — Multilingual Alignment: SeNSe

Country-specific embeddings are independently trained and therefore not directly comparable. We apply SeNSe Alignment (Malandri et al., 2024), which includes hyperparameter standardization, anchor selection via NDCG thresholds, orthogonal Procrustes transformation, and pairwise alignment to a UK reference space.

Alignment quality is evaluated via the Cross-Lingual Semantic Fitting Score (CLS).

4 — Extrinsic Evaluation

We validate embedding impact on 4-digit ESCO occupation classification, wage prediction, education prediction, experience modeling, and contract-type classification, using Accuracy, RMSE, and R² as metrics.

Best-performing embeddings consistently outperform worst-performing configurations, confirming that embedding selection materially affects empirical outcomes.


Repository Structure

veuctor/
│
├── data/
│   ├── esco/
│   └── supplementary/
│
├── models/
│   ├── fasttext/
│   └── aligned/
│
├── HSS_eval.py
├── alignment_eval.py
├── demo.py
├── requirements.txt
└── README.md

Installation

git clone https://github.com/Crisp-Unimib/veuctor.git
cd veuctor
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Requires Python ≥ 3.12


Citation

If you use VEUCTOR in your research, please cite:

Emilio Colombo, Simone D’Amico, Fabio Mercorio, Mario Mezzanzanica. Training, Selecting, and Aligning Word Embeddings from European Online Job Advertisements. Information Sciences, Volume 741, 2026. https://doi.org/10.1016/j.ins.2026.123274

BibTeX ```bibtex @article{COLOMBO2026123274, title = {VEUCTOR: Training and selecting best vector space models from online job ads for European countries}, journal = {Information Sciences}, volume = {741}, pages = {123274}, year = {2026}, issn = {0020-0255}, doi = {https://doi.org/10.1016/j.ins.2026.123274}, url = {https://www.sciencedirect.com/science/article/pii/S0020025526002057}, author = {Emilio Colombo and Simone D'Amico and Fabio Mercorio and Mario Mezzanzanica}, keywords = {Word embedding, Machine learning, Labor market, NLP} } ```

References


Acknowledgements

VEUCTOR is partially supported within the research activity of the grant “PILLARS — Pathways to Inclusive Labour Markets” under the call H-2020 TRANSFORMATIONS 18-2020 “Technological transformations, skills, and globalization — future challenges for shared prosperity”, grant agreement number 101004703 — PILLARS. See h2020-pillars.eu.


Maintained by CRISP Research Centre — University of Milano-Bicocca