You are here

The 2nd Workshop on Arabic Corpora and Processing Tools

2016 Theme: Social Media

LREC2016

------------------------------------------------

Download Workshop Slides

Dowanload Workshop Proceedings

 

Important Dates SUBMISSIONS COMMITTEES WORKSHOP PROGRAMME WORKSHOP VENUE USEFUL RESOURCES CONTACT US

Workshop description

Given the success of the first Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools in LREC 2014 where three of the presented papers received 15 citations up to now, the second workshop comes to encourage the researchers and practitioners of Arabic computational linguistics (CL) and Arabic Natural language processing (NLP) to share and discuss their research efforts, corpora and tools. In addition to general topics of CL and NLP, the workshop will give special emphasis on Arabic social media text processing and applications.

Motivation

In the NLP and CL communities, Arabic is considered to be relatively resource poor compared to English. This situation was thought to be the reason for the limited number of corpus based studies in Arabic. However, the last years witnessed the emergence of new considerably free Modern Standard Arabic (MSA) corpora and to a lesser extent Arabic processing tools. Over the past few years, the use of Arabic in social media has increased dramatically, leading to an abundance of Arabic content that is either formal or informal, MSA or dialectal, and Arabic script or Arabizi. Other phenomena include the use of emoticons, abbreviated words, decorations, etc. Despite the abundance of such content, there is a severe shortage of annotated corpora and processing tools that are tailored for such content. Available Arabic corpora can be divided into two groups. The first group contains large Arabic texts, which are designed and constructed basically for Arabic linguistic and NLP research activities, and can be useful for a variety of tasks such as language modeling. These corpora are diverse in the genres they cover and their sizes range from one million words to billions of words. The second group contains corpora that were designed basically for Arabic specific NLP tasks such as text classification, clustering, POS tagging, etc., and they typically contain annotations at clitic, word, sentence, paragraph, or document level. Most of the currently available corpora in this group are composed of newspaper articles, and range in size between tens of thousands of words to millions of words. Annotated corpora that are derived from social media continues to be limited, and corpus processing tools for such corpora is lacking. Some of the required tools include corpus exploration tools that provide word/stem frequencies, concordances, collocations, etc. and processing tools such as tokenization, normalization, word segmentation, morphological analysis, and part-of-speech tagging. Having proper exploration and processing tools can open the door for a variety of applications such as machine translation, opinion mining, text classification, and a variety of social applications.

Topics of interest

This half-day-workshop aims to encourage the researchers and developers to foster the utilization of freely available Arabic corpora, including social media corpora, and open source Arabic language processing tools and help in highlighting the drawbacks of these resources and discuss techniques and approaches on how to improve them. The workshop topics include but not limited to:

Corpora

  • Surveying and criticizing the design of freely available Arabic corpora, their associated tools and stand alone Arabic corpora processing tools.
  • Availing new annotated corpora for NLP applications such as named entity recognition, machine translation, part-of-speech tagging, sentiment analysis, text classification, and language learning.

  • Evaluating the use of crowdsourcing platforms (ex. Mechanical Turk, Crowdflower) for Arabic data annotation.

Tools and Technologies

  • Language education e.g. L1 and L2.

  • Language modeling and word embeddings.

  • Tokenization, normalization, word segmentation, morphological analysis, part-of-speech tagging, parsing, diacritization

  • Sentiment analysis, dialect identification, and text classification

  • Dialect translation

Social Applications

  • Trend analysis and opinion mining.

  • Measuring polarization and opinion shift.

  • Religious and ideological discourse


Important Dates

  • Submission deadline: 10 February 2016 17 February 2016

  • Notification of acceptance: 10 March 2016

  • Final submission of manuscripts: 21 March 2016 30 March 2016

  • Workshop date: Tuesday, 24 May 2016 (Morning session)


Submissions

The language of the workshop is English and submissions should be with respect to LREC 2016 paper submission instructions. All papers will be peer reviewed possibly by three independent referees. Papers must be submitted electronically in PDF format to the START system. When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.), to enable their reuse, replicability of experiments, including evaluation ones, etc.

The workshop is running a blind review process. In preparing your manuscript, do not include any information which could reveal your identity, or that of your co-authors. The title section of your manuscript should not contain any author names, email addresses, or affiliation status. If you do include any author names on the title page, your submission will be automatically rejected. In the body of your submission, you should eliminate all direct references to your own previous work. That is, avoid phrases such as "this contribution generalizes our results for XYZ". Also, please do not disproportionately cite your own previous work. In other words, make your submission as anonymous as possible. We need your cooperation in our effort to maintain a fair, blind reviewing process - and to consider all submissions equally.  

Please use the following URL for submission

https://www.softconf.com/lrec2016/OSACT2/

Please use the following URL for Authors' Kit

http://lrec2016.lrec-conf.org/en/submission/authors-kit/

 

Identify, Describe and Share your LRs!

Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about "Sharing LRs" (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new "regular" feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

 


Organizing Committee

  • Hend Al-Khalifa, King Saud University, KSA
  • Abdulmohsen Al-Thubaity, King Abdul Aziz City for Science and Technology, KSA
  • Walid Magdy, Qatar Computing Research Institute, Qatar
  • Kareem Darwish, Qatar Computing Research Institute, Qatar

Program Committee

  • Abdulrhman Almuhareb, KACST, KSA
  • Abdullah Alfaifi, Imam University, KSA
  • Abeer ALDayel, King Saud University, KSA
  • Areeb AlOwisheq, Imam University, KSA
  • Auhood Alfaries, King Saud University, KSA
  • Hamdy Mubarak, Qatar Computing Research Institute, Qatar
  • Hazem Hajj, American University of Beirut, Lebanon
  • Hind Al-Otaibi, King Saud University, KSA
  • Houda Bouamor, Carnegie Mellon University, Qatar
  • Kemal Oflazer, Carnegie Mellon University, Qatar
  • Khurshid Ahmad, Trinity College Dublin, Ireland
  • Maha Alrabiah, Imam University, KSA
  • Mohammad Alkanhal, KACST, KSA
  • Mohsen Rashwan, Cairo University, Egypt
  • Mona Diab, George Washington University, US
  • Muhammad M. Abdul-Mageed, Indiana University, US
  • Nizar Habash, New York University Abu Dhabi, UAE
  • Nora Al-Twairesh, King Saud University, KSA
  • Nouf Al-Shenaifi, King Saud University, KSA
  • Stephan Vogel, Qatar Computing Research Institute, Qatar
  • Tamer Elsayed, Qatar University, Qatar
  • Wajdi Zaghouani, Carnegie Mellon University in Qatar, Qatar

Workshop Programme

09:00 – 09:20 – Welcome and Introduction by Workshop Chairs

09:20 – 10:30 – Session 1 (Keynote speech)

  • Nizar Habsh, Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions

10:30 – 11:00 Coffee break

10:30 – 13:00  – Session 2

  • Soumia Bougrine, Hadda Cherroun, Djelloul Ziadi, Abdallah Lakhdari and Aicha Chorana, Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
  • Maha Alamri and William John Teahan, Towards a New Arabic Corpus of Dyslexic Texts
  • Ossama Obeid, Houda Bouamor, Wajdi Zaghouani, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer, MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
  • Wajdi Zaghouani and Dana Awad, Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
  • Muhammad Abdul-Mageed, Hassan Alhuzali, Dua'a Abu-Elhij'a and Mona Diab, DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
  • Nora Al-Twairesh, Mawaheb Al-Tuwaijri, Afnan Al-Moammar and Sarah Al-Humoud, Arabic Spam Detection in Twitter

 


Workshop Venue

The workshop will take place at the Grand Hotel Bernardin Conference Center, kindly check the workshop schedual.


Useful Resources

 

Free Arabic Corpora Size

1

King Abdulaziz City for Science and Technology Arabic Corpus

 

700MB

2

Leeds Internet Arabic Corpora

317MB

3

arabiCorpus

173MB

4

King Saud University Corpus of Classical Arabic

50MB

5

KALIMAT Corpus

18MB

6

A Corpus of Arabic newspapers

2.5MB

7

The corpus of contemporary Arabic

1MB

8

The Open Source Arabic Corpus (OSAC)

18MB

9

KACST text classification corpus

11MB

10

Alwatan-2004

10MB

11

Akhbar Al Khaleej 2004

3MB

  Arabic Corpora Processing Tools

1

aConCorde

2

Khawas


Contact Us

For further information please contact us on osact@kacst.edu.sa