osact2

The 2nd Workshop on Arabic Corpora and Processing Tools

2016 Theme: Social Media

LREC2016

------------------------------------------------

Download Workshop Slides

Dowanload Workshop Proceedings

Workshop description

Given the success of the first Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools in LREC 2014 where three of the presented papers received 15 citations up to now, the second workshop comes to encourage the researchers and practitioners of Arabic computational linguistics (CL) and Arabic Natural language processing (NLP) to share and discuss their research efforts, corpora and tools. In addition to general topics of CL and NLP, the workshop will give special emphasis on Arabic social media text processing and applications.

Motivation

In the NLP and CL communities, Arabic is considered to be relatively resource poor compared to English. This situation was thought to be the reason for the limited number of corpus based studies in Arabic. However, the last years witnessed the emergence of new considerably free Modern Standard Arabic (MSA) corpora and to a lesser extent Arabic processing tools. Over the past few years, the use of Arabic in social media has increased dramatically, leading to an abundance of Arabic content that is either formal or informal, MSA or dialectal, and Arabic script or Arabizi. Other phenomena include the use of emoticons, abbreviated words, decorations, etc. Despite the abundance of such content, there is a severe shortage of annotated corpora and processing tools that are tailored for such content. Available Arabic corpora can be divided into two groups. The first group contains large Arabic texts, which are designed and constructed basically for Arabic linguistic and NLP research activities, and can be useful for a variety of tasks such as language modeling. These corpora are diverse in the genres they cover and their sizes range from one million words to billions of words. The second group contains corpora that were designed basically for Arabic specific NLP tasks such as text classification, clustering, POS tagging, etc., and they typically contain annotations at clitic, word, sentence, paragraph, or document level. Most of the currently available corpora in this group are composed of newspaper articles, and range in size between tens of thousands of words to millions of words. Annotated corpora that are derived from social media continues to be limited, and corpus processing tools for such corpora is lacking. Some of the required tools include corpus exploration tools that provide word/stem frequencies, concordances, collocations, etc. and processing tools such as tokenization, normalization, word segmentation, morphological analysis, and part-of-speech tagging. Having proper exploration and processing tools can open the door for a variety of applications such as machine translation, opinion mining, text classification, and a variety of social applications.

Topics of interest

This half-day-workshop aims to encourage the researchers and developers to foster the utilization of freely available Arabic corpora, including social media corpora, and open source Arabic language processing tools and help in highlighting the drawbacks of these resources and discuss techniques and approaches on how to improve them. The workshop topics include but not limited to:

Corpora

Surveying and criticizing the design of freely available Arabic corpora, their associated tools and stand alone Arabic corpora processing tools.
Availing new annotated corpora for NLP applications such as named entity recognition, machine translation, part-of-speech tagging, sentiment analysis, text classification, and language learning.
Evaluating the use of crowdsourcing platforms (ex. Mechanical Turk, Crowdflower) for Arabic data annotation.

Tools and Technologies

Language education e.g. L1 and L2.
Language modeling and word embeddings.
Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, parsing, diacritization
Sentiment analysis, dialect identification, and text classification
Dialect translation

Social Applications

Trend analysis and opinion mining.
Measuring polarization and opinion shift.
Religious and ideological discourse

Important Dates

Submission deadline: ~~10 February 2016~~ 17 February 2016
Notification of acceptance: 10 March 2016
Final submission of manuscripts: ~~21 March 2016~~ 30 March 2016
Workshop date: Tuesday, 24 May 2016 (Morning session)

Submissions

The language of the workshop is English and submissions should be with respect to LREC 2016 paper submission instructions. All papers will be peer reviewed possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system. When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments, including evaluation ones, etc.

The workshop is running a blind review process. In preparing your manuscript, do not include any information which could reveal your identity, or that of your co-authors. The title section of your manuscript should not contain any author names, email addresses, or affiliation status. If you do include any author names on the title page, your submission will be automatically rejected. In the body of your submission, you should eliminate all direct references to your own previous work. That is, avoid phrases such as "this contribution generalizes our results for XYZ". Also, please do not disproportionately cite your own previous work. In other words, make your submission as anonymous as possible. We need your cooperation in our effort to maintain a fair, blind reviewing process - and to consider all submissions equally.

Please use the following URL for submission

https://www.softconf.com/lrec2016/OSACT2/

Please use the following URL for Authors' Kit

http://lrec2016.lrec-conf.org/en/submission/authors-kit/

Identify, Describe and Share your LRs!

Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about "Sharing LRs" (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new "regular" feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

Organizing Committee

Hend Al-Khalifa, King Saud University, KSA
Abdulmohsen Al-Thubaity, King Abdul Aziz City for Science and Technology, KSA
Walid Magdy, Qatar Computing Research Institute, Qatar
Kareem Darwish, Qatar Computing Research Institute, Qatar

Program Committee

Abdulrhman Almuhareb, KACST, KSA
Abdullah Alfaifi, Imam University, KSA
Abeer ALDayel, King Saud University, KSA
Areeb AlOwisheq, Imam University, KSA
Auhood Alfaries, King Saud University, KSA
Hamdy Mubarak, Qatar Computing Research Institute, Qatar
Hazem Hajj, American University of Beirut, Lebanon
Hind Al-Otaibi, King Saud University, KSA
Houda Bouamor, Carnegie Mellon University, Qatar
Kemal Oflazer, Carnegie Mellon University, Qatar
Khurshid Ahmad, Trinity College Dublin, Ireland
Maha Alrabiah, Imam University, KSA
Mohammad Alkanhal, KACST, KSA
Mohsen Rashwan, Cairo University, Egypt
Mona Diab, George Washington University, US
Muhammad M. Abdul-Mageed, Indiana University, US
Nizar Habash, New York University Abu Dhabi, UAE
Nora Al-Twairesh, King Saud University, KSA
Nouf Al-Shenaifi, King Saud University, KSA
Stephan Vogel, Qatar Computing Research Institute, Qatar
Tamer Elsayed, Qatar University, Qatar
Wajdi Zaghouani, Carnegie Mellon University in Qatar, Qatar

Workshop Programme

09:00 – 09:20 – Welcome and Introduction by Workshop Chairs

09:20 – 10:30 – Session 1 (Keynote speech)

Nizar Habsh, Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

10:30 – 11:00 Coffee break

10:30 – 13:00 – Session 2

Soumia Bougrine, Hadda Cherroun, Djelloul Ziadi, Abdallah Lakhdari and Aicha Chorana, Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
Maha Alamri and William John Teahan, Towards a New Arabic Corpus of Dyslexic Texts
Ossama Obeid, Houda Bouamor, Wajdi Zaghouani, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer, MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
Wajdi Zaghouani and Dana Awad, Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
Muhammad Abdul-Mageed, Hassan Alhuzali, Dua'a Abu-Elhij'a and Mona Diab, DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
Nora Al-Twairesh, Mawaheb Al-Tuwaijri, Afnan Al-Moammar and Sarah Al-Humoud, Arabic Spam Detection in Twitter

Workshop Venue

The workshop will take place at the Grand Hotel Bernardin Conference Center, kindly check the workshop schedual.

Useful Resources

	Free Arabic Corpora	Size
1	King Abdulaziz City for Science and Technology Arabic Corpus	700MB
2	Leeds Internet Arabic Corpora	317MB
3	arabiCorpus	173MB
4	King Saud University Corpus of Classical Arabic	50MB
5	KALIMAT Corpus	18MB
6	A Corpus of Arabic newspapers	2.5MB
7	The corpus of contemporary Arabic	1MB
8	The Open Source Arabic Corpus (OSAC)	18MB
9	KACST text classification corpus	11MB
10	Alwatan-2004	10MB
11	Akhbar Al Khaleej 2004	3MB
	Arabic Corpora Processing Tools
1	aConCorde
2	Khawas

Contact Us

For further information please contact us on osact@kacst.edu.sa

Share This Page