Preprint / Version 1

Evaluating Semantic Search Versus Keyword Search For Educational Video Retrieval in Technical Subjects

##article.authors##

  • Aiden Christian Polygence

DOI:

https://doi.org/10.58445/rars.3816

Keywords:

Semantic Search, Key word search, Vector database, Artificial Intelligence

Abstract

The rapid expansion of online educational video platforms has transformed how learners engage with technical subjects, yet the search systems that govern content discovery have received comparatively little scrutiny. Traditional keyword-based retrieval depends on exact lexical overlap, systematically failing learners who lack the precise domain vocabulary needed to articulate effective queries—a barrier that falls disproportionately on novices. This paper evaluates whether semantic search, powered by transformer-based language models and vector similarity search, offers a more learner-centered alternative for educational video retrieval in technical subjects. Two retrieval systems—a keyword-based baseline and a semantic search system using dense embeddings stored via pgvector—were implemented over the same dataset of publicly available technical educational videos and evaluated across thirty queries spanning three categories: exact technical terms, natural language questions, and paraphrased conceptual expressions. Retrieval alignment was assessed qualitatively, with illustrative frequency data reported to characterize observed patterns. Results indicate that semantic search demonstrated substantially higher intent alignment for natural language (~85% vs. ~40%) and paraphrased queries (~80% vs. ~35%), while both systems performed comparably on exact technical terms (~92% vs. ~90%). These findings suggest that the choice of retrieval architecture in educational platforms is not merely a technical decision but a pedagogical one, with direct implications for learner equity, content accessibility, and the design of AI-driven educational systems. Key methodological limitations include single-researcher relevance judgment without inter-rater reliability checks, the absence of an annotated gold standard for optimal retrieval, and a non-parallel query design that may conflate query construction effects with retrieval approach effects.

References

References

Ankeny, A. (2024). pgvector (Version 0.8.0) [PostgreSQL extension]. GitHub. https://github.com/pgvector/pgvector

Çakir, H., Acartürk, C., Alaşehir, O., & Çilingir, C. (2018). Improving educational web search for question-like queries through subject classification. Information Processing & Management, 54(6), 1123–1138. https://doi.org/10.1016/j.ipm.2018.06.005

Chawan, P. M., & Malve, A. (2015). A comparative study of keyword-based and semantic-based search engines. International Journal of Innovative Research in Science, Engineering and Technology, 4(6), 4219–4225. https://www.researchgate.net/publication/316514673

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

Dong, J., Li, X., Xu, Y., Ji, S., He, Y., & Yang, Y. (2018). Dual encoding for zero-example video retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9346–9355). IEEE. https://arxiv.org/abs/1809.06181

Eller, D. W. (2022). Transparency and the future of semantic searching. Information Services & Use, 42(4), 389–401. https://doi.org/10.3233/ISU-220175

Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 55–65). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1006

Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572

Jurafsky, D., & Martin, J. H. (2023). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (3rd ed. draft). Stanford University. https://web.stanford.edu/~jurafsky/slp3/

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6769–6781). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.550

Khan, S. (2024). Khan Academy [Video platform]. https://www.khanacademy.org

Li, B., Zhou, H., He, J., Wang, M., Yang, Y., & Li, L. (2020). On the sentence embeddings from pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 9119–9130). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.733

Lin, D., Fidler, S., Kong, C., & Urtasun, R. (2014). Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8). IEEE. https://www.cs.toronto.edu/~fidler/papers/lin_et_al_cvpr14.pdf

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. https://nlp.stanford.edu/IR-book/

Massachusetts Institute of Technology. (2024). MIT OpenCourseWare [Open educational resource]. https://ocw.mit.edu

Merzougui, G., Djoudi, M., & Behaz, A. (2012). Conception and use of ontologies for indexing and searching by semantic contents of video courses. arXiv preprint. https://arxiv.org/abs/1201.5102

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (pp. 3982–3992). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410

Stoica, A. S., Barbu, T., & Breaban, M. (2021). Classification of educational videos using a semi-supervised method. Neural Networks, 144, 487–498. https://doi.org/10.1016/j.neunet.2021.09.019

Toriah, S., Ghalwash, A., & Youssif, A. (2018). Semantic-based video retrieval: A survey. Journal of Computer and Communications, 6(8), 1–15. https://www.scirp.org/journal/paperinformation.aspx?paperid=86488

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998–6008).

Veluru, S. R., Marella, V. C., & Erukude, S. T. (2025). The evolution of search engines: From keyword matching to AI-powered understanding. SSRN Electronic Journal. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5403467

Wu, Y., & Ngo, C. W. (2024). Interpretable embedding for ad-hoc video search. arXiv preprint. https://arxiv.org/abs/2402.11812

Additional Files

Posted

2026-05-28