Semantic Web Mining Thesis Statements

CSE 591: Semantic Web Mining -- Spring 2012

Assoc. Prof. Hasan Davulcu
Meeting Times: Mon, Wed 2:00 - 3:15 pm     Location: ECGG 347
Office Hours: Mon, Thurs 3:30 - 4:30pm     Location: BY 564


According to a Nature article the World Wide Web doubles in size approximately every 8 months.
There are approximately 20 million content areas in the Web. According to L. Giles and S. Lawrence

"85% of users use search engines to find information. Consumers use search engines to
locate and buy goods or to research many decisions (such as choosing a vacation
destination, medical treatment or election vote). However, the search engines are currently
lacking in comprehensive and timeliness, and do not index sites equally. The current state
of search engines can be compared to a phone book which is updated irregularly, is biased
toward listing more popular information, and has most of the pages ripped out "
Though the Web is rich with information, gathering and making sense of this data is difficult because
the documents of the Web is largely unorganized. The biggest challenge in the next several decades
 is how to effectively and efficiently dig out a machine-understandable and queriable information and
knowledge layer, called Semantic Web , from unorganized, human-readable Web data.

What will you learn in this course ?

  • What is Syntax, Semantics and Structure in HTML, text documents and data
  • What are the computational aspects of Information Extraction (IE)
  • Information Integration with unstructured and semi-structured sources
  • Regular expressions, Regular tree expressions, XPath, XSLT, XQuery
  • Horn Rules, Description Logic, Frame Logic, Topic Maps, Inductive Logic Programming
  • What is Meta-Data, Ontology, XML, RDF, RDFS, DAML+OIL
  • Text classification techniques using statistical and knowledge based techniques
  • Text clustering and Document Summarization
  • Topic Detection, Sentiment Analysis
  • How to build Web Agents and Web Crawlers
  • How to mine ontologies from the Web, and build ontology-directed applications
  • How to build domain-specific Semantic Search Engines to improve Web Search
  • Trend detection in streaming data (such as Twitter)
  • Recommendation Systems and Algorithms
  • Applications in E-Commerce and Bio-Informatics
  • How to do research in Semantic Web Mining


This is an advanced course intended for graduate students with some background in databases,
compilers and automata theory. Some exposure to HTML and XML is also desirable. Also,
very good programming skills in Java, C++ and some scripting languages (such as Perl) is
necessary to complete the course projects. Students with special interest and background in AI,
databases, data mining, information retrieval, machine learning, NLP are encouraged to join. 

Text Book

Other Recommended Books

  • Mining the Web: Analysis of Hypertext and Semi Structured Data by Soumen Chakrabarti. Morgan Kaufmann Publishers; ISBN: 1558607544; 1st edition (August 15, 2002)
  • Pattern Classification (2nd Edition) by Richard O. Duda, Peter E. Hart, David G. Stork. Wiley-Interscience; ISBN: 0471056693; 2nd edition (October 2000)
  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Series in Statistics) by T. Hastie, R. Tibshirani, J. H. Friedman


Each/group of student(s) must complete a project of their choice. I will propose some interesting projects at
MyASU but students are encouraged to come up with relevant project proposals and discuss them with me. The
project should yield a working prototype implementation. The project will be graded in two parts. First,
students must submit a written proposal for the their project detailing the problem formulation and
solution requirements. Next, students must submit a final project report detailing the proposed
solution (an algorithm) and a system architecture. We will also allocate slots for students to present
their project and demonstrate their solutions.


Project proposal
Final Project Report
Homework and Quiz
Final Exam (Open Book)

Course Materials

The course will be conducted as a series of lectures by the instructor to cover the background material
for the following reading list, students' presentations from the following list, invited speakers and
problem solving sessions on the students' projects. Students are encouraged to meet with me as frequent
as they need to make progress on their projects.

An Extended Reading List and Software Packages

Semantic Web and RDF
  • Sean Palmer, "Semantic Web: An Introduction", 2001,
  • Tim Berners-Lee, "Semantic Web Road Map", 1998,
  • Pat Hayes, "Catching the Dreams", IEEE Intelligent Systems, 2002.
  • Resource Description Framework (RDF) and RDFS Model and Syntax Specification, ,
  • The Role of Frame-Based Representation on the Semantic Web
Web Agents - Regular Expressions
  • Generating Finite-State Transducers For Semi-Structured Data Extraction From The Web (1998),
  • A Scalable Comparison-Shopping Agent for the World-Wide Web,
  • Wrapper Induction for Information Extraction (1997) ,
  • Wrapper Generation for Semi-structured Internet Sources - Ashish, Knoblock,
  • Learning Information Extraction Rules for Semi-structured and Free Text (1999),
  • Mastering Regular Expressions, Powerful Techniques for Perl and Other Tools,
  • Regular Expressions for Java,
Information Extraction - Shallow and Deep Parsing, NLP
  • Building Domain-Specific Search Engines with Machine Learning Techniques (1999)- Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore
  • Information Extraction Using Hidden Markov Models - Timothy Robert Leek (1997),
  • Learning Hidden Markov Model Structure for Information Extraction ,
  • Toward General-Purpose Learning for Information Extraction - Freitag (1998) ,
  • Multistrategy Learning for Information Extraction - Freitag (1998),
  • Introduction to Information Extraction Technology,  
  • SPIRIT: Sequential Pattern Mining with Regular Expression Constraints (1999),
Relational Rule Mining, ILP, F-Logic, Description Logic
Information Integration
  • Managing Web Data -- Tutorial by Dan Suciu
  • Linked Data
  • A Technical Introduction to XML,
  • What is XSLT and XPATH ?
  • Extended Path Expressions for XML,
  • Efficiently Mining Frequent Trees in a Forest,
  • Evaluating Structural Similarity in XML Documents,
  • Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems, Journal on Computing, 18(6):1245--1262, Dec. 1989
  • Xerces/Xalan,
  • JTidy,  
  • The HTTP Client Component,
  • XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources (2000),
  • The Semantic Web: The roles of XML and RDF,
  • DAML+OIL Ontology Markup, 2001,
  • DAML+OIL: a Description Logic for the Semantic Web, download/2002/ieeede2002.pdf
  • Sean Bechhofer, Carole Goble, Ian Horrocks, "DAML+OIL is not Enough",
Ontologies - Creation, Mapping, Merging
  • Ontology Development 101,
  • Ontology Learning for the Semantic Web,
  • The Usable Ontology: An environment for Building and Assessing a Domain Ontology, International Semantic Web Conference, Sardinia ISWC, Italy Lecture Notes in Computer Science, SpringerVerlag (2002)
  • Learning to Construct Knowledge Bases from the World Wide Web,
  • Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents,
  • MindNet: Acquiring and Structuring Semantic Information from Text, Proceedings of ACL-Coling 1998, pp. 1098-1102
  • Automatic Acquisition of Domain Knowledge for Information Extraction,
  • Automatic Segmentation of Text into Structured Records,
Text Classification
Text Clustering and Summarization
Topic Detection
Sentiment Analysis
Streaming Data - Trend Detection
Recommendation Sytems
Applications: BioInformatics
Applications: E-Commerce

List of Projects

Please refer to the Course HomePage at MyASU for a list of projects and relevant pointers for reading.

Related Links

Please, wait while we are validating your browser

0 thoughts on “Semantic Web Mining Thesis Statements”


Leave a Comment

Your email address will not be published. Required fields are marked *