Izvestiya of Saratov University.

Mathematics. Mechanics. Informatics

ISSN 1816-9791 (Print)
ISSN 2541-9005 (Online)


For citation:

Ternikov A. A. Skill-based clustering algorithm for online job advertisements. Izvestiya of Saratov University. Mathematics. Mechanics. Informatics, 2022, vol. 22, iss. 2, pp. 250-265. DOI: 10.18500/1816-9791-2022-22-2-250-265, EDN: QOCNLY

This is an open access article distributed under the terms of Creative Commons Attribution 4.0 International License (CC-BY 4.0).
Published online: 
31.05.2022
Full text:
(downloads: 1512)
Language: 
English
Heading: 
Article type: 
Article
UDC: 
51-77
EDN: 
QOCNLY

Skill-based clustering algorithm for online job advertisements

Autors: 
Ternikov Andrei Alexandrovich, Higher School of Economics – National Research University
Abstract: 

Clustering on the basis of categorical data is one of the challenging problems in data mining. The paper provides the clustering algorithm for job vacancies using information about the skills required. In the first step, the procedure of unstructured textual information standardization is proposed. The resulting procedures include stages of synonyms and general terms identification based on the combination of TF-IDF and $n$-grams approaches for translated and transliterated terms. Then, the algorithm is provided and validated on the data obtained from the cross-regional hiring platform. The algorithm provides validation of clusters’ extraction, including hierarchical cluster analysis and Girvan – Newman coalition search. Output number of clusters is verified with internal validity scores and suggests disjoint sets of terms that describe particular job occupation groups in the IT sector. Based on obtained clusters well-matched and mismatched terms are identified using Silhouette scores. Given procedures allow to minimize human involvement in clustering itself and produce reasonable clusters for the following interpretation and analysis. In general, the approach for clusters identification based on categorical data is provided and tested on a sample of online job advertisements. It has a high potential in use for feature engineering tasks in machine learning research and applied labor market research in economics.

References: 
  1. Bensberg F., Buscher G., Czarnecki C. Digital transformation and IT topics in the consulting industry: A labor market perspective. In: V. Nissen, ed. Advances in Consulting Research. Contributions to Management Science. Springer, Cham, 2019, pp. 341–357. https://dx.doi.org/10.1007/978-3-319-95999-3_16
  2. Kappelman L., Jones M., Johnson V., McLean E., Boonme K. Skills for success at different stages of an IT professional’s career. Communications of the ACM, 2016, vol. 59, iss. 8, pp. 64–70. https://dx.doi.org/10.1145/2888391
  3. Litecky C., Arnett K., Prabhakar B. The paradox of soft skills versus technical skills in IS hiring. Journal of Computer Information Systems, 2004, vol. 45, iss. 1, pp. 69–76. https://doi.org/10.1080/08874417.2004.11645818
  4. Borner K., Scrivner O., Gallant M., Ma S., Liu X., Chewning K., Wu L., Evans J. A. Skill discrepancies between research, education, and jobs reveal the critical need to supply soft skills for the data economy. Proceedings of the National Academy of Sciences, 2018, vol. 115, iss. 50, pp. 12630–12637. https://dx.doi.org/10.1073/pnas.1804247115
  5. Deming D., Kahn L. Skill requirements across firms and labor markets: Evidence from job postings for professionals. Journal of Labor Economics, 2017, vol. 36, iss. 1, pp. 337–369. https://dx.doi.org/10.1086/694106
  6. Sayfullina L., Malmi E., Kannala J. Learning representations for soft skill matching. In: Analysis of Images, Social Networks and Texts. AIST 2018. Lecture Notes in Computer Science (R0), vol. 11179. Springer, Cham, 2018, pp. 141–152. https://doi.org/10.1007/978-3-030-11027-7_15  
  7. Wowczko I. Skills and vacancy analysis with data mining techniques. Informatics, 2015, vol. 2, iss. 4, pp. 31–49. https://dx.doi.org/10.3390/informatics2040031
  8. Bailey J., Mitchell R. Industry perceptions of the competencies needed by computer programmers: Technical, business, and soft skills. Journal of Computer Information Systems, 2006, vol. 47, iss. 2, pp. 28–33.
  9. Brooks N., Greer T., Morris S. Information systems security job advertisement analysis: Skills review and implications for information systems curriculum. Journal of Education for Business, 2018, vol. 93, iss. 5, pp. 213–221. https://doi.org/10.1080/08832323.2018.1446893
  10. Casado-Lumbreras C., Colomo-Palacios R., Soto-Acosta P. A vision on the evolution of perceptions of professional practice: The case of IT. International Journal of Human Capital and Information Technology Professionals, 2015, vol. 6, iss. 2, pp. 65–78. https://doi.org/10.4018/IJHCITP.2015040105
  11. Foll P., Thiesse F. Aligning is curriculum with industry skill expectations: A text mining approach. Proceedings of the 25th European Conference on Information Systems (ECIS), Guimaraes, Portugal, June 5–10, 2017, pp. 2949–2959.
  12. Stal J., Paliwoda-Pekosz G. Fostering development of soft skills in ICT curricula: A case of a transition economy. Information Technology for Development, 2019, vol. 25, iss. 2, pp. 250–274. https://doi.org/10.1080/02681102.2018.1454879
  13. Gurcan F., Cagiltay N. Big data software engineering: Analysis of knowledge domains and skill sets using LDA-based topic modeling. IEEE Access, 2019, vol. 7, pp. 82541–82552. https://doi.org/10.1109/ACCESS.2019.2924075
  14. De Mauro A., Greco M., Grimaldi M., Ritala P. Human resources for Big Data professions: A systematic classification of job roles and required skill sets. Information Processing & Management, 2018, vol. 54, iss. 5, pp. 807–817. https://doi.org/10.1016/j.ipm.2017.05.004
  15. Xu T., Zhu H., Zhu C., Li P., Xiong H. Measuring the popularity of job skills in recruitment market: A multi-criteria approach. Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2018, vol. 32, iss. 1, pp. 2572–2579.
  16. Wu J., Shi H., Yang J. Are big data talents different from business intelligence expertise?: Evidence from text mining using job recruitment advertisements. 2017 International Conference on Service Systems and Service Management, IEEE, 2017, pp. 1–6. https://doi.org/10.1109/ICSSSM.2017.7996289  
  17. Debortoli S., Muller O., vom Brocke J. Comparing business intelligence and big data skills. Business & Information Systems Engineering, 2014, vol. 6, iss. 5, pp. 289–300. https://doi.org/10.1007/s12599-014-0344-2
  18. Radovilsky Z., Hegde V., Acharya A., Uma U. Skills requirements of business data analytics and data science jobs: A comparative analysis. Journal of Supply Chain and Operations Management, 2018, vol. 16, iss. 1, pp. 82–101.
  19. Aken A., Litecky C., Ahmad A., Nelson J. Mining for computing jobs. IEEE Software, 2010, vol. 27, iss. 1, pp. 78–85. https://doi.org/10.1109/MS.2009.150
  20. Pejic-Bach M., Bertoncel T., Mesko M., Krstic Z. Text mining of industry 4.0 job advertisements. International Journal of Information Management, 2020, vol. 50, pp. 416–431. https://dx.doi.org/10.1016/j.ijinfomgt.2019.07.014
  21. Poonnawat W., Pacharawongsakda E., Henchareonlert N. Jobs analysis for business intelligence skills requirements in the ASEAN region: A text mining study. In: Advances in Intelligent Informatics, Smart Technology and Natural Language Processing. iSAI-NLP 2017. Advances in Intelligent Systems and Computing, vol. 807. Springer, Cham, 2019, pp. 187–195. https://dx.doi.org/10.1007/978-3-319-94703-7_17
  22. De Carvalho F., Lechevallier Y., de Melo F. Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recognition, 2012, vol. 45, iss. 1, pp. 447–464. https://doi.org/10.1016/j.patcog.2011.05.016
  23. Pedrycz W. Collaborative fuzzy clustering. Pattern Recognition Letters, 2002, vol. 23, iss. 14, pp. 1675–1686. https://doi.org/10.1016/S0167-8655(02)00130-7
  24. Cleuziou G., Exbrayat M., Martin L., Sublemontier J.-H. CoFKM: A centralized method for multiple-view clustering. ICDM 2009 IEEE 9th International Conference on Data Mining. Miami, USA, IEEE, 2009, pp. 752–757. https://doi.org/10.1109/ICDM.2009.138
  25. Amato F., Boselli R., Cesarini M., Mercorio F., Mezzanzanica M., Moscato V., Persia F., Picariello A. Challenge: Processing web texts for classifying job offers. Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015), 2015, pp. 460–463. https://dx.doi.org/10.1109/ICOSC.2015.7050852
  26. Boselli R., Cesarini M., Mercorio F., Mezzanzanica M. Classifying online job advertisements through machine learning. Future Generation Computer Systems, 2018, vol. 86, pp. 319–328. https://dx.doi.org/10.1016/j.future.2018.03.035
  27. Colombo E., Mercorio F., Mezzanzanica M. AI meets labor market: Exploring the link between automation and skills. Information Economics and Policy, 2019, vol. 47, pp. 27–37. https://dx.doi.org/10.1016/j.infoecopol.2019.05.003
  28. Lovaglio P., Cesarini M., Mercorio F., Mezzanzanica M. Skills in demand for ICT and statistical occupations: Evidence from web-based job vacancies. Statistical Analysis and Data Mining, 2018, vol. 11, iss. 2, pp. 78–91. https://dx.doi.org/10.1002/sam.11372
  29. Karakatsanis I., AlKhader W., MacCrory F., Alibasic A., Omar M. A., Aung Z., Woon W. L. Data mining approach to monitoring the requirements of the job market: A case study. Information Systems, 2017, vol. 65, pp. 1–6. https://doi.org/10.1016/j.is.2016.10.009
  30. Broder A. On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 1997, pp. 21–29. https://doi.org/10.1109/SEQUEN.1997.666900
  31. Murtagh F., Legendre P. Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion? Journal of Classification, 2014, vol. 31, iss. 3, pp. 274–295. https://doi.org/10.1007/S00357-014-9161-Z
  32. Girvan M., Newman M. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America, 2002, vol. 99, iss. 12, pp. 7821–7826. https://doi.org/10.1073/pnas.122653799
  33. Newman M., Girvan M. Finding and evaluating community structure in networks. Physical Review E, 2004, vol. 69, iss. 2, Art. 026113. https://doi.org/10.1103/PhysRevE.69.026113
  34. Milligan G. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 1981, vol. 46, iss. 2, pp. 187–199. https://doi.org/10.1007/BF02293899
  35. Dalrymple-Alford E. Measurement of clustering in free recall. Psychological Bulletin, 1970, vol. 74, iss. 1, pp. 32–34. https://doi.org/10.1037/H0029393
  36. Hubert L., Levin J. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 1976, vol. 83, iss. 6, pp. 1072–1080. https://doi.org/10.1037/0033-2909.83.6.1072
  37. Baker F., Hubert L. Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 1975, vol. 70, iss. 349, pp. 31–38. https://doi.org/10.1080/01621459.1975.10480256
  38. Rohlf F. J. Methods of comparing classifications. Annual Review of Ecology and Systematics, 1974, vol. 5, iss. 1, pp. 101–113. https://dx.doi.org/10.1146/annurev.es.05.110174.000533
  39. Rousseeuw P. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987, vol. 20, pp. 53–65. https://dx.doi.org/10.1016/0377-0427(87)90125-7
Received: 
07.08.2021
Accepted: 
08.02.2022
Published: 
31.05.2022