Skip to content

clarinsi/ParlaMint-Annotation-with-CAP-Topics

Repository files navigation

Annotation of the ParlaMint corpora with the CAP classifier

This repository provides the code for automatic annotation of the corpora in the ParlaMint TXT format (e.g., such as the corpora from ParlaMint 4.1) with the CAP policy agenda labels. An openly-available classifier is used - the ParlaCAP classifier.

For more information on the development of the ParlaCAP classifier, its evaluation and downstream application on the ParlaMint corpora, see the paper here.

This repository was developed as part of the ParlaCAP project.

Authors of this work: Taja Kuzman, Nikola Ljubešić, and Daniela Širinić.

Setting up the environment

  • Python version: Python 3.9.19
  • Requirements: see requirements.txt
conda create --name parlacap_annotation python=3.9.19
conda activate parlacap_annotation
pip install -r requirements. txt

# Install the version of the torch that is suitable with your GPU system
pip install torch==1.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html

# Install the ipykernel for jupter notebook
conda install -n parlacap_annotation ipykernel --update-deps --force-reinstall

For a decent annotation speed, it is neccessary that the annotation is done on a GPU machine (e.g., we used 1 NVidia A100 40GB, 128 cores, 2TB RAM).

Step-by-step Instructions

  1. Create the following directories with subdirectories:

    • datasets
      • annotated
      • final-formats
      • initial-datasets
      • initial-prepared
      • zipped
    • logs
  2. Save the datasets to the datasets/initial-datasets directory (we use the ParlaMint 4.1 from CLARIN.SI repository) - follow the ParlaMint-BA.txt example.

    • Unzip all datasets: for file in *.tgz; do tar -xzf "$file"; done (in command line, when you are inside the datasets/initial directory)
    • Delete all directories that do not end with .txt: rm -rf *.tgz, rm -rf *.TEI, rm -rf *.md, rm -rf ParlaMint-4.1 (in command line, when you are inside the datasets/initial directory)
  3. Prepare the datasets - transform them in a JSONL format with speech ID, text and file path as attributes - see code in 1-prepare-initial-datasets.ipynb. The prepared datasets are in datasets/initial-prepared

  4. Predict CAP topics:

    • CUDA_VISIBLE_DEVICES=0 nohup python -u 2-apply-classifier.py "BA" > logs/cap-prediction-BA.log & (change the BA with other dataset codes for other ParlaMint datasets)
      • as a batch job: nohup bash annotate_pipeline.sh > logs/annotate_pipeline.log &
  5. Analyse the annotated dataset and prepare the final format - see the code in 3-prepare-final-format.py

    • run a batch script: nohup bash prepare-final-datasets.sh > logs/preparing-final-datasets-pipeline.log & inside which you need to define the lang codes (e.g., "BA")

Labels

We use 21 CAP major topics + categories "Other" and "Mix" - 23 labels.

We added an additional label "Other" ("0") to the list of labels and developed the descriptions of the labels based on the list of subtopics and their descriptions from the CAP mastercode, and extended them based on the Croatian CAP guidelines and expert support.

During the annotation, if an instance is predicted with a model's confidence lower than 0.60, it is annotated as "Mix".

Labels:

majortopics = {
 1: 'Macroeconomics',
 2: 'Civil Rights',
 3: 'Health',
 4: 'Agriculture',
 5: 'Labor',
 6: 'Education',
 7: 'Environment',
 8: 'Energy',
 9: 'Immigration',
 10: 'Transportation',
 12: 'Law and Crime',
 13: 'Social Welfare',
 14: 'Housing',
 15: 'Domestic Commerce',
 16: 'Defense',
 17: 'Technology',
 18: 'Foreign Trade',
 19: 'International Affairs',
 20: 'Government Operations',
 21: 'Public Lands',
 23: 'Culture',
 0: 'Other',
 }

Label description:

majortopics_description = {
 'Macroeconomics - issues related to domestic macroeconomic policy, such as the state and prospect of the national economy, economic policy, inflation, interest rates, monetary policy, cost of living, unemployment rate, national budget, public debt, price control, tax enforcement, industrial revitalization and growth.': 1,
 'Civil Rights - issues related to civil rights and minority rights, discrimination towards races, gender, sexual orientation, handicap, and other minorities, voting rights, freedom of speech, religious freedoms, privacy rights, protection of personal data, abortion rights, anti-government activity groups (e.g., local insurgency groups), religion and the Church.': 2,
 'Health - issues related to health care, health care reforms, health insurance, drug industry, medical facilities, medical workers, disease prevention, treatment, and health promotion, drug and alcohol abuse, mental health, research in medicine, medical liability and unfair medical practices.': 3,
 'Agriculture - issues related to agriculture policy, fishing, agricultural foreign trade, food marketing, subsidies to farmers, food inspection and safety, animal and crop disease, pest control and pesticide regulation, welfare for animals in farms, pets, veterinary medicine, agricultural research.': 4,
 'Labor - issues related to labor, employment, employment programs, employee benefits, pensions and retirement accounts, minimum wage, labor law, job training, labor unions, worker safety and protection, youth employment and seasonal workers.': 5,
 'Education - issues related to educational policies, primary and secondary schools, student loans and education finance, the regulation of colleges and universities, school reforms, teachers, vocational training, evening schools, safety in schools, efforts to improve educational standards, and issues related to libraries, dictionaries, teaching material, research in education.': 6,
 'Environment - issues related to environmental policy, drinking water safety, all kinds of pollution (air, noise, soil), waste disposal, recycling, climate change, outdoor environmental hazards (e.g., asbestos), species and forest protection, marine and freshwater environment, hunting, regulation of laboratory or performance animals, land and water resource conservation, research in environmental technology.': 7,
 'Energy - issues related to energy policy, electricity, regulation of electrical utilities, nuclear energy and disposal of nuclear waste, natural gas and oil, drilling, oil spills, oil and gas prices, heat supply, shortages and gasoline regulation, coal production, alternative and renewable energy, energy conservation and energy efficiency, energy research.': 8,
 'Immigration - issues related to immigration, refugees, and citizenship, integration issues, regulation of residence permits, asylum applications; criminal offences and diseases caused by immigration.': 9,
 'Transportation - issues related to mass transportation construction and regulation, bus transport, regulation related to motor vehicles, road construction, maintenance and safety, parking facilities, traffic accidents statistics, air travel, rail travel, rail freight, maritime transportation, inland waterways and channels, transportation research and development.': 10,
 'Law and Crime - issues related to the control, prevention, and impact of crime; all law enforcement agencies, including border and customs, police, court system, prison system; terrorism, white collar crime, counterfeiting and fraud, cyber-crime, drug trafficking, domestic violence, child welfare, family law, juvenile crime.': 12,
 'Social Welfare - issues related to social welfare policy, the Ministry of Social Affairs, social services, poverty assistance for low-income families and for the elderly, parental leave and child care, assistance for people with physical or mental disabilities, including early retirement pension, discounts on public services, volunteer associations (e.g., Red Cross), charities, and youth organizations.': 13,
 'Housing - issues related to housing, urban affairs and community development, housing market, property tax, spatial planning, rural development, location permits, construction inspection, illegal construction, industrial and commercial building issues, national housing policy, housing for low-income individuals, rental housing, housing for the elderly, e.g., nursing homes, housing for the homeless and efforts to reduce homelessness, research related to housing.': 14,
 'Domestic Commerce - issues related to banking, finance and internal commerce, including stock exchange, investments, consumer finance, mortgages, credit cards, insurance availability and cost, accounting regulation, personal, commercial, and municipal bankruptcies, programs to promote small businesses, copyrights and patents, intellectual property, natural disaster preparedness and relief, consumer safety; regulation and promotion of tourism, sports, gambling, and personal fitness; domestic commerce research.': 15,
 'Defense - issues related to defense policy, military intelligence, espionage, weapons, military personnel, reserve forces, military buildings, military courts, nuclear weapons, civil defense, including firefighters and mountain rescue services, homeland security, military aid or arms sales to other countries, prisoners of war and collateral damage to civilian populations, military nuclear and hazardous waste disposal and military environmental compliance, defense alliances and agreements, direct foreign military operations, claims against military, defense research.': 16,
 'Technology - issues related to science and technology transfer and international science cooperation, research policy, government space programs and space exploration, telephones and telecommunication regulation, broadcast media (television, radio, newspapers, films), weather forecasting, geological surveys, computer industry, cyber security.': 17,
 'Foreign Trade - issues related to foreign trade, trade negotiations, free trade agreements, import regulation, export promotion and regulation, subsidies, private business investment and corporate development, competitiveness, exchange rates, the strength of national currency in comparison to other currencies, foreign investment and sales of companies abroad.': 18,
 'International Affairs - issues related to international affairs, foreign policy and relations to other countries, issues related to the Ministry of Foreign Affairs, foreign aid, international agreements (such as Kyoto agreement on the environment, the Schengen agreement), international organizations (including United Nations, UNESCO, International Olympic Committee, International Criminal Court), NGOs, issues related to diplomacy, embassies, citizens abroad; issues related to border control; issues related to international finance, including the World Bank and International Monetary Fund, the financial situation of the EU; issues related to a foreign country that do not impact the home country; issues related to human rights in other countries, international terrorism.': 19,
 'Government Operations - issues related to general government operations, the work of multiple departments, public employees, postal services, nominations and appointments, national mints, medals, and commemorative coins, management of government property, government procurement and contractors, public scandal and impeachment, claims against the government, the state inspectorate and audit, anti-corruption policies, regulation of political campaigns, political advertising and voter registration, census and statistics collection by government; issues related to local government, capital city and municipalities, including decentralization; issues related to national holidays.': 20,
 'Public Lands - issues related to national parks, memorials, historic sites, and protected areas, including the management and staffing of cultural sites; museums; use of public lands and forests, establishment and management of harbors and marinas; issues related to flood control, forest fires, livestock grazing.': 21,
 'Culture - issues related to cultural policies, Ministry of Culture, public spending on culture, cultural employees, issues related to support of theatres and artists; allocation of funds from the national lottery, issues related to cultural heritage': 23,
 'Other - other topics not mentioning policy agendas, including the procedures of parliamentary meetings, e.g., points of order, voting procedures, meeting logistics; interpersonal speech, e.g., greetings, personal stories, tributes, interjections, arguments between the members; rhetorical speech, e.g., jokes, literary references.': 0
 }

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published