Для цитирования:
Ternikov A. A. Skill-based clustering algorithm for online job advertisements [Терников А. А. Алгоритм кластеризации на основе навыков для онлайн-объявлений о вакансиях] // Известия Саратовского университета. Новая серия. Серия: Математика. Механика. Информатика. 2022. Т. 22, вып. 2. С. 250-265. DOI: 10.18500/1816-9791-2022-22-2-250-265, EDN: QOCNLY
Skill-based clustering algorithm for online job advertisements
[Алгоритм кластеризации на основе навыков для онлайн-объявлений о вакансиях]
Кластеризация на основе категориальных данных — одна из сложных задач интеллектуального анализа данных. В статье представлен алгоритм кластеризации вакансий с использованием информации о необходимых навыках. На первом этапе предлагается процедура стандартизации неструктурированной текстовой информации. Полученные процедуры включают этапы идентификации синонимов и общих терминов на основе сочетания подходов TF-IDF и $n$-граммов для переведенных и транслитерированных терминов. Затем предложенный алгоритм проверяется на данных, полученных с межрегиональной платформы online-рекрутмента. Алгоритм обеспечивает проверку количества извлеченных кластеров, включая иерархический кластерный анализ и коалиционный поиск Гирвана – Ньюмана. Результирующее количество кластеров проверяется при помощи внутренних оценок достоверности и предлагает непересекающиеся наборы терминов, которые описывают определенные группы профессий в секторе информационных технологий. На основе полученных кластеров хорошо совпадающие и несовпадающие термины идентифицируются с использованием индексов Силуэта (Silhouette Index). Указанные в статье процедуры позволяют минимизировать участие человека в процессе кластеризации и создавать интерпретируемые кластеры для последующего анализа. В целом, подход к идентификации кластеров на основе категориальных данных представлен и протестирован на выборке онлайн-объявлений о вакансиях. Он имеет большой потенциал использования для задач формирования факторов в исследованиях машинного обучения и для прикладных исследований рынка труда в экономике.
- Bensberg F., Buscher G., Czarnecki C. Digital transformation and IT topics in the consulting industry: A labor market perspective. In: V. Nissen, ed. Advances in Consulting Research. Contributions to Management Science. Springer, Cham, 2019, pp. 341–357. https://dx.doi.org/10.1007/978-3-319-95999-3_16
- Kappelman L., Jones M., Johnson V., McLean E., Boonme K. Skills for success at different stages of an IT professional’s career. Communications of the ACM, 2016, vol. 59, iss. 8, pp. 64–70. https://dx.doi.org/10.1145/2888391
- Litecky C., Arnett K., Prabhakar B. The paradox of soft skills versus technical skills in IS hiring. Journal of Computer Information Systems, 2004, vol. 45, iss. 1, pp. 69–76. https://doi.org/10.1080/08874417.2004.11645818
- Borner K., Scrivner O., Gallant M., Ma S., Liu X., Chewning K., Wu L., Evans J. A. Skill discrepancies between research, education, and jobs reveal the critical need to supply soft skills for the data economy. Proceedings of the National Academy of Sciences, 2018, vol. 115, iss. 50, pp. 12630–12637. https://dx.doi.org/10.1073/pnas.1804247115
- Deming D., Kahn L. Skill requirements across firms and labor markets: Evidence from job postings for professionals. Journal of Labor Economics, 2017, vol. 36, iss. 1, pp. 337–369. https://dx.doi.org/10.1086/694106
- Sayfullina L., Malmi E., Kannala J. Learning representations for soft skill matching. In: Analysis of Images, Social Networks and Texts. AIST 2018. Lecture Notes in Computer Science (R0), vol. 11179. Springer, Cham, 2018, pp. 141–152. https://doi.org/10.1007/978-3-030-11027-7_15
- Wowczko I. Skills and vacancy analysis with data mining techniques. Informatics, 2015, vol. 2, iss. 4, pp. 31–49. https://dx.doi.org/10.3390/informatics2040031
- Bailey J., Mitchell R. Industry perceptions of the competencies needed by computer programmers: Technical, business, and soft skills. Journal of Computer Information Systems, 2006, vol. 47, iss. 2, pp. 28–33.
- Brooks N., Greer T., Morris S. Information systems security job advertisement analysis: Skills review and implications for information systems curriculum. Journal of Education for Business, 2018, vol. 93, iss. 5, pp. 213–221. https://doi.org/10.1080/08832323.2018.1446893
- Casado-Lumbreras C., Colomo-Palacios R., Soto-Acosta P. A vision on the evolution of perceptions of professional practice: The case of IT. International Journal of Human Capital and Information Technology Professionals, 2015, vol. 6, iss. 2, pp. 65–78. https://doi.org/10.4018/IJHCITP.2015040105
- Foll P., Thiesse F. Aligning is curriculum with industry skill expectations: A text mining approach. Proceedings of the 25th European Conference on Information Systems (ECIS), Guimaraes, Portugal, June 5–10, 2017, pp. 2949–2959.
- Stal J., Paliwoda-Pekosz G. Fostering development of soft skills in ICT curricula: A case of a transition economy. Information Technology for Development, 2019, vol. 25, iss. 2, pp. 250–274. https://doi.org/10.1080/02681102.2018.1454879
- Gurcan F., Cagiltay N. Big data software engineering: Analysis of knowledge domains and skill sets using LDA-based topic modeling. IEEE Access, 2019, vol. 7, pp. 82541–82552. https://doi.org/10.1109/ACCESS.2019.2924075
- De Mauro A., Greco M., Grimaldi M., Ritala P. Human resources for Big Data professions: A systematic classification of job roles and required skill sets. Information Processing & Management, 2018, vol. 54, iss. 5, pp. 807–817. https://doi.org/10.1016/j.ipm.2017.05.004
- Xu T., Zhu H., Zhu C., Li P., Xiong H. Measuring the popularity of job skills in recruitment market: A multi-criteria approach. Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2018, vol. 32, iss. 1, pp. 2572–2579.
- Wu J., Shi H., Yang J. Are big data talents different from business intelligence expertise?: Evidence from text mining using job recruitment advertisements. 2017 International Conference on Service Systems and Service Management, IEEE, 2017, pp. 1–6. https://doi.org/10.1109/ICSSSM.2017.7996289
- Debortoli S., Muller O., vom Brocke J. Comparing business intelligence and big data skills. Business & Information Systems Engineering, 2014, vol. 6, iss. 5, pp. 289–300. https://doi.org/10.1007/s12599-014-0344-2
- Radovilsky Z., Hegde V., Acharya A., Uma U. Skills requirements of business data analytics and data science jobs: A comparative analysis. Journal of Supply Chain and Operations Management, 2018, vol. 16, iss. 1, pp. 82–101.
- Aken A., Litecky C., Ahmad A., Nelson J. Mining for computing jobs. IEEE Software, 2010, vol. 27, iss. 1, pp. 78–85. https://doi.org/10.1109/MS.2009.150
- Pejic-Bach M., Bertoncel T., Mesko M., Krstic Z. Text mining of industry 4.0 job advertisements. International Journal of Information Management, 2020, vol. 50, pp. 416–431. https://dx.doi.org/10.1016/j.ijinfomgt.2019.07.014
- Poonnawat W., Pacharawongsakda E., Henchareonlert N. Jobs analysis for business intelligence skills requirements in the ASEAN region: A text mining study. In: Advances in Intelligent Informatics, Smart Technology and Natural Language Processing. iSAI-NLP 2017. Advances in Intelligent Systems and Computing, vol. 807. Springer, Cham, 2019, pp. 187–195. https://dx.doi.org/10.1007/978-3-319-94703-7_17
- De Carvalho F., Lechevallier Y., de Melo F. Partitioning hard clustering algorithms based on multiple dissimilarity matrices. Pattern Recognition, 2012, vol. 45, iss. 1, pp. 447–464. https://doi.org/10.1016/j.patcog.2011.05.016
- Pedrycz W. Collaborative fuzzy clustering. Pattern Recognition Letters, 2002, vol. 23, iss. 14, pp. 1675–1686. https://doi.org/10.1016/S0167-8655(02)00130-7
- Cleuziou G., Exbrayat M., Martin L., Sublemontier J.-H. CoFKM: A centralized method for multiple-view clustering. ICDM 2009 IEEE 9th International Conference on Data Mining. Miami, USA, IEEE, 2009, pp. 752–757. https://doi.org/10.1109/ICDM.2009.138
- Amato F., Boselli R., Cesarini M., Mercorio F., Mezzanzanica M., Moscato V., Persia F., Picariello A. Challenge: Processing web texts for classifying job offers. Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015), 2015, pp. 460–463. https://dx.doi.org/10.1109/ICOSC.2015.7050852
- Boselli R., Cesarini M., Mercorio F., Mezzanzanica M. Classifying online job advertisements through machine learning. Future Generation Computer Systems, 2018, vol. 86, pp. 319–328. https://dx.doi.org/10.1016/j.future.2018.03.035
- Colombo E., Mercorio F., Mezzanzanica M. AI meets labor market: Exploring the link between automation and skills. Information Economics and Policy, 2019, vol. 47, pp. 27–37. https://dx.doi.org/10.1016/j.infoecopol.2019.05.003
- Lovaglio P., Cesarini M., Mercorio F., Mezzanzanica M. Skills in demand for ICT and statistical occupations: Evidence from web-based job vacancies. Statistical Analysis and Data Mining, 2018, vol. 11, iss. 2, pp. 78–91. https://dx.doi.org/10.1002/sam.11372
- Karakatsanis I., AlKhader W., MacCrory F., Alibasic A., Omar M. A., Aung Z., Woon W. L. Data mining approach to monitoring the requirements of the job market: A case study. Information Systems, 2017, vol. 65, pp. 1–6. https://doi.org/10.1016/j.is.2016.10.009
- Broder A. On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, 1997, pp. 21–29. https://doi.org/10.1109/SEQUEN.1997.666900
- Murtagh F., Legendre P. Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion? Journal of Classification, 2014, vol. 31, iss. 3, pp. 274–295. https://doi.org/10.1007/S00357-014-9161-Z
- Girvan M., Newman M. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America, 2002, vol. 99, iss. 12, pp. 7821–7826. https://doi.org/10.1073/pnas.122653799
- Newman M., Girvan M. Finding and evaluating community structure in networks. Physical Review E, 2004, vol. 69, iss. 2, Art. 026113. https://doi.org/10.1103/PhysRevE.69.026113
- Milligan G. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 1981, vol. 46, iss. 2, pp. 187–199. https://doi.org/10.1007/BF02293899
- Dalrymple-Alford E. Measurement of clustering in free recall. Psychological Bulletin, 1970, vol. 74, iss. 1, pp. 32–34. https://doi.org/10.1037/H0029393
- Hubert L., Levin J. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 1976, vol. 83, iss. 6, pp. 1072–1080. https://doi.org/10.1037/0033-2909.83.6.1072
- Baker F., Hubert L. Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association, 1975, vol. 70, iss. 349, pp. 31–38. https://doi.org/10.1080/01621459.1975.10480256
- Rohlf F. J. Methods of comparing classifications. Annual Review of Ecology and Systematics, 1974, vol. 5, iss. 1, pp. 101–113. https://dx.doi.org/10.1146/annurev.es.05.110174.000533
- Rousseeuw P. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987, vol. 20, pp. 53–65. https://dx.doi.org/10.1016/0377-0427(87)90125-7
- 1834 просмотра