Skip to content

Commit d8fffa8

Browse files
authored
Merge pull request #11 from mdredze/jack
Jack
2 parents d369fb9 + 3445d2d commit d8fffa8

22 files changed

+39
-442747
lines changed

README.md

Lines changed: 39 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -18,43 +18,42 @@ To run the Carmen frontend, see:
1818

1919
$ python -m carmen.cli --help
2020

21-
### Geonames Mapping
22-
23-
Alternatively, `locations.json` can be swapped out to use Geonames IDs
24-
instead of arbitrary IDs used in the original version of Carmen. This
25-
JSON file can be found in `carmen/data/new.json`.
26-
27-
Below are instructions on how mappings can be generated.
28-
29-
First, we need to get the data. This can be found at
30-
http://download.geonames.org/export/dump/. The required files are
31-
`countryInfo.txt`, `admin1CodesASCII.txt`, `admin2Codes.txt`, and
32-
`cities1000.txt`. Download these files and move them into
33-
`carmen/data/dump/`.
34-
35-
Next, we need to format our data. We can simply delete the comments in
36-
`countryInfo.txt`. Afterwards, run the following.
37-
38-
$ python3 format_admin1_codes.py
39-
$ python3 format_admin2_codes.py
40-
41-
Then, we need to set up a PostgreSQL database, as this allows finding
42-
relations between the original Carmen IDs and Geonames IDs significantly
43-
easier. To set up the database, create a PostgreSQL database named `carmen`
44-
and reun the following SQL script:
45-
46-
$ psql -f carmen/sql/populate_db.sql carmen
47-
48-
Now we can begin constructing the mappings from Carmen IDs to
49-
Geonames IDs. Run the following scripts.
50-
51-
$ python3 map_cities.py > ../mappings/cities.txt
52-
$ python3 map_regions.py > ../mappings/regions.txt
53-
54-
With the mappings constructed, we can finally attempt to convert the
55-
`locations.json` file into one that uses Geonames IDs. To do this, run
56-
the following.
57-
58-
$ python3 rewrite_json.py
59-
60-
21+
### Carmen 2.0 Improvements
22+
We are excited to release the improved Carmen Twitter geotagger, Carmen 2.0! We have implemented the following improvements:
23+
- A new location database derived from the open-source [GeoNames](https://www.geonames.org/) geographical database. This multilingual database improves the coverage and robustness of Carmen as shown in our analysis paper "[Changes in Tweet Geolocation over Time: A Study with Carmen 2.0](https://aclanthology.org/2022.wnut-1.1/)".
24+
- Compatibility with Twitter API V2.
25+
- An up to 10x faster geocode resolver.
26+
27+
### GeoNames Mapping
28+
29+
We provide two different location databases.
30+
- `carmen/data/geonames_locations_combined.json` is the new GeoNames database introduced in Carmen 2.0. It is derived by swapping out to use GeoNames IDs instead of arbitrary IDs used in the original version of Carmen. This database will be used by default.
31+
- `carmen/data/locations.json` is the default database in original carmen. This is faster but less powerful compared to our new database. You can use the `--locations` flag to switch to this version of database for backward compatibility.
32+
33+
We refer reader to the Carmen 2.0 paper repo for more details of GeoNames mapping: https://github.com/AADeLucia/carmen-wnut22-submission
34+
35+
### Building for Release
36+
37+
1. In the repo root folder, `python setup.py sdist bdist_wheel` to create the wheels in `dist/` directory
38+
2. `python -m twine upload --repository testpypi dist/*` to upload to testpypi
39+
3. **Create a brand new environment**, and do `pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple carmen` to make sure it can be installed correctly from testpypi
40+
4. After checking correctness, use `python -m twine upload dist/*` to publish on actual pypi
41+
42+
### Reference
43+
If you use the Carmen 2.0 package, please cite the following work:
44+
```
45+
@inproceedings{zhang-etal-2022-changes,
46+
title = "Changes in Tweet Geolocation over Time: A Study with Carmen 2.0",
47+
author = "Zhang, Jingyu and
48+
DeLucia, Alexandra and
49+
Dredze, Mark",
50+
booktitle = "Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)",
51+
month = oct,
52+
year = "2022",
53+
address = "Gyeongju, Republic of Korea",
54+
publisher = "Association for Computational Linguistics",
55+
url = "https://aclanthology.org/2022.wnut-1.1",
56+
pages = "1--14",
57+
abstract = "Researchers across disciplines use Twitter geolocation tools to filter data for desired locations. These tools have largely been trained and tested on English tweets, often originating in the United States from almost a decade ago. Despite the importance of these tools for data curation, the impact of tweet language, country of origin, and creation date on tool performance remains largely unknown. We explore these issues with Carmen, a popular tool for Twitter geolocation. To support this study we introduce Carmen 2.0, a major update which includes the incorporation of GeoNames, a gazetteer that provides much broader coverage of locations. We evaluate using two new Twitter datasets, one for multilingual, multiyear geolocation evaluation, and another for usage trends over time. We found that language, country origin, and time does impact geolocation tool performance.",
58+
}
59+
```

0 commit comments

Comments
 (0)