The course "Text mining and sentiment analysis", consistently with the skills that the course of study intends to achieve, provides students with knowledge related to the use of quantitative and economic statistic tools necessary to carry out a rigorous empirical analysis based on unstructured data. Students will acquire a solid background on text analytics techniques from the theoretical and practical perspective. They will also become familiar with the different types of data sources, with particular reference to unstructured and big data.
At the end of the course, students will be able to process the unstructured information contained in text data in order to make text as informative as standard structured data and allow to investigate relationships and patterns which would otherwise be extremely difficult, if not impossible, to discover. Further, students will be able to categorize and cluster text to provide economic statistic information and to devote attention to the quality of the data sources, in a total quality perspective.
They will be able to address specific problems in the area of text mining and sentiment analysis. In particular, they will know the main notions needed to understand text processing, foundations of natural language processing, text classification, and topic modeling.
The course has the following specific objectives:
- Students will understand what text analytics is and will learn natural language processing techniques, such as sentiment analysis and topic modeling.
- They will learn how to convert unstructured text-based character data into structured numeric data.
- They will become aware of the pros and cons in the use of unstructured data, also in terms of quality.
The organization of the course and assignments will allow students to develop communication skills and the ability to work both in groups and independently and to effectively present the results of their research work and deliver it in the required time.
Ability to process and analyze unstructured data is learned using one of the most widely spread statistical software: R.
The course is fully coherent with the education aims of the EMOS (European Master in Official Statistics) label as well as for the Master course in Economic and Data Analysis.
The course offers a wide overview on text analytics and language processing techniques. Particular attention is devoted to the quality of the data sources, in a total quality perspective. An example of framework criteria for social data quality is introduced.
• Unstructured data and Big data: what they are, how to use them; characteristics of different data sources; Big data and unstructured data as a source for economic analysis in a context of integrated data sources is introduced.
• Working with strings: basic tools to deal with character strings (e.g. length computation, pattern recognition, regular expressions)
• Natural languages: classification techniques and preliminary data processing (pre-processing, tokenization, stemming, lemmatization, and named entity recognition)
• Text mining: introduction and different approaches; document representation, text categorization and clustering (identifying the clustering structure of a corpus of text documents and assigning documents to the identified cluster(s); typical types of clustering algorithms, such as centroid-based clustering (e.g., k-means clustering)); document summarization; string distances and text similarities detection.
• Sentiment analysis: design and develop methods for sentiment classification and polarity detection. Dictionary approach and machine learning approach. Text visualization. The differences between sentiment analysis and emotion detection.
• How to build socio-economic indicators using sentiment analysis.
• Fairness and Errors.
• Quality framework for Twitter Data.
• Empirical Applications.
• Case study: investigation of the communication of Corporate Social Responsibility through Twitter.
The course illustrates several application areas of these techniques: economic, social, business decision making.
• Lab sessions for applications using statistical software: R.
Lectures and lab sessions (students will be stimulated with active discussions and participation to create their own case study).
Individual cases or personal projects developed by students according to the themes proposed by the teachers.
Evaluation will be based on:
• A written final exam entailing theoretical questions and exercises to be solved with the R software.
• Assessments provided by the professor, including case studies, reports and (possibly) ppt presentations can be proposed and will be considered as part of the final evaluation.
• Oral discussion about case studies, research results and/or based on a deeper discussion of the course topics.
Since Eurostat is providing every year new innovative teaching material for EMOS labeled masters, in order to obtain high quality and innovative educational standards (recognized at international levels), the teaching activity will be constantly updated.
Important note: In case of provisions of the competent authority on containment and management of epidemiological emergency, the teaching may be subject to changes from what is declared in the syllabus in order to make the course and examinations in line with the regulations.