The Data Science Lab
since 2005
  • Home
  • Research
      • Research grants
      • Research interests
      • Research leadership
      • Student theses
      • Humanoid Ameca
      • AI Server
        • GPU
        • Request
        • Allocation
  • Consultancy
      • Consulting projects
      • Cooperate training
      • Enterprise innovation
      • Impact cases
      • Our clients
      • Partnership
  • People
      • Awards and honors
      • Staff
      • Team members
  • Activities
      • Events and services
      • Talks
      • Tutorials
      • Workshops
  • Publications
  • Communities
      • ACM ANZKDD Chapter
      • Big data summit
      • Data Analytics book series
      • DSAA conferences
      • IEEE TF-DSAA
      • IEEE TF-BESC
      • JDSA Springer
      • DataSciences.Info
      • MQ's DSAI
  • Resources
      • Actionable knowledge discovery
      • Agent mining
      • AI: Artificial-intelligence
      • AI4Tech: AI enabling technologies
      • AI4Finance: AI for FinTech
      • AI robots & humanoid AI
      • Algorithmic trading
      • Banking analytics
      • Behavior analytics, computing, informatics
      • Coupling and interaction learning
      • COVID-19 global research and modeling
      • Data science knowledge map
      • Data science dictionary
      • Data science terms
      • Data science tools
      • Data science thinking
      • Domain driven data mining
      • Educational data mining
      • Large-scale statistical learning
      • Metasynthetic engineering
      • Market surveillance
      • Negative Sequence Analysis
      • Non-IID Learning
      • Pattern relation analysis
      • Recommender systems
      • Smart beach analytics
      • Social security analytics
      • Tax analytics
  • About us
Wenfeng Hou. Semantic Enhancement for Text Representation, Master’s thesis, 2020

Thanks to recent developments in web technology, various textual information can now be found online, including social media, news, product reviews, and instant messages. How to automatically classify and organize such texts is currently a topic of great interest. In Natural Language Processing (NLP), text classification is a traditional task. Text representation is the foundation of text classification. To represent text, we need to obtain a word’s representation. The existing language representation models, including Word2vec, ELMo, GPT, and BERT, are widely used for word representation. These word representation models are highly successful at processing natural languages. However, they mainly capture implicit representations. Other models that analyze a text’s context can potentially capture richer information that can help deep neural networks gain a better understanding of the text. It is valuable and important to incorporate the semantic information in a text because the rich semantics associated with word representations can supplement text representation. New approaches are necessary to represent semantics in combination with existing text representations.

In order to address the above-mentioned research needs, the models presented in this study improve text representation and term weighting by utilizing external knowledge. In contrast to previous work, the models proposed here utilize multi-level knowledge to facilitate the semantic enhancement of text representation by involving external semantic information.

In Chapter 3, we propose an Entity-based Concept Knowledge-Aware (ECKA) representation model to incorporate semantic information into short text representations. ECKA is a multi-level short text semantic enhancement model for short text representation. It extracts the semantic features from the word, entity, concept and knowledge levels by CNN, respectively. Since word, entity, concept and knowledge entity in a same short text have different informativeness for short text classification, attention network are formed to respectively capture aspects-oriented attentive representations from a text’s multi-level textual features. The final multi-level semantic representations are formed by concatenating all of these individual-level representations, which are then used for text classification.

In Chapter 4, we propose a hybrid term-weighting method that works by utilizing frequency and the semantic similarities for the term weighting calculation. When analyzing a term, we first use the Term Frequency-Inverse Document Frequency (TF-IDF) to calculate term weighting. Next, we use a named entity-based concept-sense disambiguation process to obtain concepts. Following that, we calculate the term’s semantic similarity to the document. The TF-IDF weights are then revised according to the term’s semantic similarities for the purpose of reflecting both frequency and semantic similarities of the various terms in the text.

All of these models are applied to the text classification tasks. The performance of the proposed models with the semantic enhancement is compared with different methods, which demonstrates the effectiveness of the proposed models.

About us
School of Computing, Faculty of Science and Engineering, Macquarie University, Australia
Level 3, 4 Research Park Drive, Macquarie University, NSW 2109, Australia
Tel: +61-2-9850 9583
Staff: firstname.surname(a)mq.edu.au
Students: firstname.surname(a)student.mq.edu.au
Contacts@datasciences.org