Business Intelligence

📊business intelligence review

7.4 Text and Web Mining

Citation:

Text mining extracts valuable insights from unstructured data. It tackles challenges like high dimensionality, language ambiguity, and noisy content. Preprocessing techniques like tokenization and stemming clean and structure text for analysis.

Natural language processing enables sentiment analysis and topic modeling, revealing opinions and themes in text. Web mining concepts like crawling and page ranking help navigate and prioritize online content, uncovering relationships between websites and identifying influential sources.

Text Mining

Challenges of unstructured text mining

High dimensionality and sparsity of text data
- Large vocabulary size leads to high-dimensional feature space
- Most documents contain only a small subset of words resulting in sparse representations (bag-of-words)
Ambiguity and variability in natural language
- Words can have multiple meanings (polysemy) depending on context
- Different words can have similar meanings (synonymy)
- Variations in spelling, abbreviations, and colloquialisms
Handling noise and irrelevant information
- Text data often contains irrelevant or noisy content (ads, headers, footers)
- Misspellings, grammatical errors, and informal language

Text preprocessing techniques

Tokenization
- Splits text into individual words or tokens
- Handles punctuation, special characters, and case sensitivity
Stemming and lemmatization
- Stemming reduces words to their base or root form by removing suffixes (running → run)
  - Common algorithms: Porter stemmer, Snowball stemmer
- Lemmatization reduces words to their dictionary form (lemma) considering context and part-of-speech (better → good)
Stop word removal
- Eliminates common words that carry little meaning (the, and, is)
- Uses predefined stop word lists or frequency-based filtering
- Improves efficiency and reduces noise in text analysis
Handling case sensitivity, numbers, and special characters
- Converting text to lowercase for consistency
- Removing or normalizing numbers and special characters ($, %, @)

NLP for sentiment and topics

Sentiment analysis
- Determines the sentiment or opinion expressed in text (positive, negative, neutral)
- Lexicon-based approaches use predefined sentiment lexicons (VADER, TextBlob)
- Machine learning-based approaches train classifiers on labeled sentiment data (Naive Bayes, SVM)
Topic modeling
- Discovers latent topics in a collection of documents
- Latent Dirichlet Allocation (LDA) is a probabilistic model that assumes documents are mixtures of topics and topics are mixtures of words
- Non-negative Matrix Factorization (NMF) factorizes the document-term matrix into topic-term and document-topic matrices
- Applications include document clustering, summarization, trend analysis, and emerging topic detection

Web mining concepts

Web crawling
- Systematically browses and indexes web pages
- Follows hyperlinks using breadth-first or depth-first traversal strategies
- Handles dynamic content, redirects, and robot exclusion protocols (robots.txt)
Page ranking
- Assigns importance scores to web pages based on link structure
- PageRank algorithm considers the number and quality of incoming links, iteratively calculating scores until convergence
- HITS (Hyperlink-Induced Topic Search) identifies hub and authority pages based on mutual reinforcement
Link analysis
- Analyzes the hyperlink structure of web pages to infer relationships and similarities
- Co-citation analysis measures similarity between pages based on shared incoming links
- Bibliographic coupling measures similarity between pages based on shared outgoing links
- Applications include web search ranking, identifying influential websites and communities, and detecting spam or link farms

Back

Practice Quiz

Table of Contents

📊business intelligence review

7.4 Text and Web Mining

Text Mining

Challenges of unstructured text mining

Text preprocessing techniques

NLP for sentiment and topics

Web mining concepts

Back

Next

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes