Instructor: Dr. Doina Caragea (dcaragea@ksu.edu)
Meeting Time: TU 11:30am-12:45pm in Nichols 236
Prerequisites: Basic knowledge in probability and statistics, data structures and algorithms. Prior knowledge of either Java or C++.
Targeted audience: Graduate students from Computer Science and related areas.
Course Description:
Information Retrieval (IR) refers to the processing, indexing and querying of unstructured or loosely structured information. This course will focus on the theory and practice of search engines for retrieving textual information (including web documents). Basic and advanced topics in IR will be covered, with emphasis on newer technologies that go beyond simple keyword search. Programming assignments will provide hands-on experience with retrieval systems. More advanced research in IR will be stimulated through the means of a class project.
Course Objectives:
Topics:
The topics targeted in this course include: efficient text indexing, latent semantic indexing, Boolean and vector space retrieval models, probabilistic retrieval models (binary independence models and language models), evaluation, web spiders, link analysis (PageRank and HITS), relevance feedback and query expansion, text mining and categorization, topic detection and clustering, query-answering, etc.
Introduction to Information Retrieval
Part of the course will draw on material from other books and recent research papers.
Other Relevant Books:
Information Retrieval: Algorithms and Heuristics
There will be several individual assignments, several reading assignments and paper presentations, and a project (individual or group). There are no exams. Students will be required to write summaries for the assigned readings and to present some of the application papers, in particular papers related to their project topic. Individual assignments and the project will be evaluated based on design and analysis, writing and critical thinking. Projects will be evaluated by the peer students and by the instructor based on a final project presentation and a project report. Students are encouraged to attend every lecture and participate in class discussion.
Grading:
Assignments: 30% (individual)
Presentations: 20% (individual or team)
Project: 40% (individual or team)
Participation: 10%
Text Mining, Web Mining, Information Retrieval and Extraction from the WWW References
Information Retrieval and Text Mining
Text Analysis, Text Mining, and Information Retrieval Software
Video lectures on machine learning
WWW
IJCAI
AAAI
AAAI05 Semantic Web, Information Retrieval, and Extraction
AAAI06 Special Track on Artificial Intelligence and the Web
AAAI07 Special Track on Artificial Intelligence and the Web
AAAI08 Special Track on Artificial Intelligence and the Web.
ISWC
Some Workshops SIGIR Workshop NIPS Workshop
V. Bush, As we may think, Atlantic Monthly, 176 (1945), pp.101-108.
“Recommended reading for IR research students” by Moffat, Zobel, and Hawking. Appeared in the SIGIR Forum, 39(2), 2005. A list of readings collected by attendees at the SWIRL 2004 workshop held in Lorne, Australia. Available on-line in PDF.
Rose, D. E. and Levinson, D. 2004. Understanding user goals in web search In Proceedings of the 13th international Conference on World Wide Web (New York, NY, USA, May 17 - 20, 2004). WWW '04. ACM, New York, NY, 13-19.
Monika R. Henzinger, Hyperlink Analysis for the Web IEEE Internet Computing, vol. 5, no. 1, pp. 45-50, Jan/Feb., 2001.
BrowseRank: Letting Web Users Vote for Page Importance by Yuting Liu*, Bin Gao, Tie-Yan Liu, et al. , SIGIR 2008. (SIGIR Best Student Paper Award).
Using latent semantic analysis to find different names for the same entity in free text by Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas.
Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Gabrilovich, Evgeniy ; Markovitch, Shaul, Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007
Ziv Bar-Yossef, Maxim Gurevich Random Sampling from a Search Engine's Index, Best Paper Award, World Wide Web Conference (WWW) 2007.
S. Pandit, C. Olston Navigation-Aided Retrieval, World Wide Web Conference (WWW) 2007.
Z. Bar-Yossef, M. Gurevich Efficient Search Engine Measurements, World Wide Web Conference (WWW) 2007.
Adomavicius G. and Tuzhilin A., Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, IEEE Trans.on Knowledge and Data Engg., Vol 17, No 6, 2005.
Das A., Datar M, Garg, A, Rajaram S., Google News Personalization: Scalable Online Collaborative Filtering, World Wide Web Conference (WWW) 2007.
Krallinger M, Valencia A: Text-mining and information-retrieval services for molecular biology. Genome Biol 2006, 6:224.
Yamamoto Y, Takagi T: Biomedical knowledge navigation by literature clustering. J Biomed Inform 2007, 40:114-130.
Mary Shultz (2007) Comparing test searches in PubMed and Google Scholar J Med Libr Assoc. 2007 October; 95(4): 442–445.
Hunter, L, Cohen, KB Biomedical Language Processing: What’s Beyond PubMed? Molecular Cell 21(5):589-594 March 3, 2006.
Martin Krallinger, Alfonso Valencia and Lynette Hirschman (2008) Linking genes to literature: text mining, information extraction, and retrieval applications for biology Genome Biology 2008, 9(Suppl 2):S8.
Building Domain-Specific Search Engines with Machine Learning Techniques. Andrew McCallum, Kamal Nigam, Jason Rennie and Kristie Seymore. AAAI-99 Spring Symposium.
Shilpy Sharma (2008) Information Retrieval in Domain Specific Search Engine with Machine Learning Approaches
Mark D. Smucker and James Allan. "Find-Similar: Similarity Browsing as a Search Tool," SIGIR, 2006.
Jimmy Lin and Mark D. Smucker. "How Do Users Find Things with PubMed? Towards Automatic Utility Evaluation with User Simulations," Technical Report LAMP-TR-148/HCIL-2008-07, University of Maryland, College Park, February 2008.
Christopher J. C. Burges, Learning to Rank for Web Search: Some New Directions Keynote talk.
Xing Wei, Bruce Croft and Andrew McCallum (2006) Table extraction for answer retrieval. Information Retrieval Journal (IRJ), volume 9, issue 5, pages 589-611, November 2006.
K Punera, S Rajan and J Ghosh, "Automatic Construction of N-ary Tree Based Taxonomies", MSD Workshop at ICDM '06.
Suju Rajan, Kunal Punera and Joydeep Ghosh, "A Maximum Likelihood Framework for Integrating Taxonomies", AAAI-05, pp. 856-861.
Aarti Gupta, Tim Oates: Using Ontologies and the Web to Learn Lexical Semantics. IJCAI 2007: 1618-1623.
Silva, R., Heller, K.A., and Ghahramani, Z. (2007) Analogical Reasoning with Relational Bayesian Sets. In the Eleventh International Conference on Artifical Intelligence and Statistics (AISTATS-2007). San Juan, Puerto Rico.
Ghahramani, Z. and Heller, K.A. (2006) Bayesian Sets. In Advances in Neural Information Processing Systems 18 (NIPS-2005).
Heller, K.A. and Ghahramani, Z. (2006) A Simple Bayesian Framework for Content-Based Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2006).
Pranam Kolari et al., Detecting Spam Blogs: A Machine Learning Approach, Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), July 2006.
Zareen Syed et al., Wikipedia as an Ontology for Describing Documents, InProceedings, Proceedings of the Second International Conference on Weblogs and Social Media, March 2008.
Haizheng Zhang, W. Bruce Croft, Brian Levine, and Victor Lesser; A Multi-Agent Approach for Peer-to-Peer based Information Retrieval System, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2004)
A. Sieg, B. Mobasher, R. Burke. Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search. In IEEE Intelligent Informatics Bulletin, Vol.8, No. 1, November 2007.
A. Sieg, B. Mobasher, R. Burke. Web Search Personalization with Ontological User Profiles. Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM 2007) Lisboa, Portugal, November 2007.
Xuerui Wang, Andrew McCallum and Xing Wei. Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval. Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007.
Lightly-Supervised Attribute Extraction for Web Search. Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew McCallum and Mark Dredze. NIPS Workshop on Machine Learning for Web Search, (NIPS WS), 2007.
Yi, X. and Allan, J. (2008) "Evaluating Topic Models for Information Retrieval," to appear in the Proceedings of CIKM 2008, Napa Valley, CA, October 26-30, 2008.
Kumaran, G. and Allan, J. (2008) "Effective and Efficient User Interaction for Long Queries," Proceedings of the 31st Annual International ACM SIGIR Conference, pp. 11-18.
R. Nallapati, A. Ahmed, E. P. Xing, and W. Cohen, Sparse Feature Joint Latent Topic Models for text and citations., Proceedings of The Fourteen ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD 2008).
Community-based Link Prediction with Text. David Mimno, Hanna M. Wallach and Andrew McCallum. In Proceedings of the NIPS 2007 Workshop on Statistical Network Modeling (NIPS WS), 2007.
Zhenyu Liu and Wesley W. Chu "Knowledge-Based Query Expansion to Support Scenario-Specific Retrieval of Medical Free Text", Journal of Information Retrieval, special issue on Contextual Information Retrieval, 2007.
Kleinberg J. (1998). Authoritative Sources in a Hyperlinked Environment. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Also appears as IBM Research Report RJ 10076, May 1997.
Brin S. and Page L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the Seventh World Wide Web Conference (WWW7), Brisbane. Also in a special issue of the journal Computer Networks and ISDN Systems, Volume 30, issues 1-7.
Bharat K., & Henzinger M.R. (1998). Improved Algorithms for Topic Distillation in Hyperlinked Environment. Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia.
Chakrabarti S., van der Berg M., Dom B. (1999). Focused crawling: a new approach to topic-specific Web resource discovery. In Proceedings of 8th International World Wide Web Conference (WWW8). Also in Computer Networks & ISDN Systems 31(11-16):1623-1640.