--- title: "Introduction to {photon}" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to {photon}} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, include=FALSE} library(photon) options(photon_movers = FALSE) ``` This vignette is an introduction to the `{photon}` package, an interface to the [photon](https://photon.komoot.io/) geocoder developed by [komoot](https://www.komoot.com/). Photon is open-source, based on [OpenStreetMap](https://www.openstreetmap.org/) data, and powered by the [ElasticSearch](https://www.elastic.co/elasticsearch) search engine. It is -- according to komoot -- fast, scalable, multilingual, typo-tolerant, and up-to-date. Photon can do unstructured geocoding, reverse geocoding, and (under special circumstances) structured geocoding. Komoot offers a public photon API (https://photon.komoot.io/) but you can also set up a photon instance on a local machine. `{photon}` supports both online and offline geocoding. Online geocoding through komoots public API is intriguing because it is convenient and offers up-to-date global coverage. It is appropriately easy to use online geocoding in `{photon}`. First, it is necessary to tell R that you want to use the public API. This can be done using the workhorse function `new_photon()`. To set up online geocoding, simply call it without parameters: ```{r public_api, eval=FALSE} new_photon() #> #> Type : remote #> Server : https://photon.komoot.io/ ``` The created `photon` object is attached to the session and does not have to be stored manually. Now you can geocode. ```{r geocode, eval=FALSE} cities1 <- geocode(c("Sanaa", "Caracas"), osm_tag = ":city") cities1 #> Simple feature collection with 2 features and 12 fields #> Geometry type: POINT #> Dimension: XY #> Bounding box: xmin: -66.9146 ymin: 10.50609 xmax: 44.20588 ymax: 15.35386 #> Geodetic CRS: WGS 84 #> # A tibble: 2 × 13 #> idx osm_type osm_id osm_key osm_value type countrycode name country state county extent geometry #> #> 1 1 N 258013552 place city distr… YE Sana… Yemen Aman… At Ta… (44.20588 15.35386) #> 2 2 R 11219583 place city city VE Cara… Venezu… Capi… Munic… (-66.9146 10.50609) ``` Similarly, you can also reverse geocode. `{photon}` fully supports `sf` objects so that all geocoding functions return `sf` dataframes and `reverse()` accepts `sf` and `sfc` objects as input. ```{r reverse, eval=FALSE} cities2 <- reverse(cities1, osm_tag = ":city") cities2 #> Simple feature collection with 2 features and 12 fields #> Geometry type: POINT #> Dimension: XY #> Bounding box: xmin: -66.9146 ymin: 10.50609 xmax: 44.20588 ymax: 15.35386 #> Geodetic CRS: WGS 84 #> # A tibble: 2 × 13 #> idx osm_type osm_id osm_key osm_value type countrycode name country state county extent geometry #> #>1 1 N 258013552 place city distr… YE Sana… Yemen Aman… At Ta… (44.20588 15.35386) #>2 2 R 11219583 place city city VE Cara… Venezu… Capi… Munic… (-66.9146 10.50609) ``` ```{r compare, eval=FALSE} all.equal(cities1, cities2) #> [1] TRUE ``` Online geocoding is nice and it is most likely what you need for basic tasks. But what if online geocoding is not enough? What if you need to geocode a dataset of 200,000 places? What if you need to geocode sensitive information from survey respondents? And what about structured geocoding? # Offline geocoding The photon backend is freely available on the [photon GitHub repository](https://github.com/komoot/photon). With it, you can set up a local instance of photon. Offline geocoding is nice because it is extremely fast, versatile and it doesn't send your potentially sensitive data around the internet. In a lot of cases, offline geocoding is absolutely imperative, yet usually, setting up an offline geocoder can be quite cumbersome. `{photon}` takes over this task! To run photon, you need the Java Development Kit (JDK) version 11 or higher. To do that, you can use the [`{rJavaEnv}`](https://cran.r-project.org/package=rJavaEnv) package by Egor Kotov. It allows you to easily set up a JDK environment for a local project. ```{r java, eval=FALSE} # pak::pkg_install("rJavaEnv") library(rJavaEnv) # Consent to downloading Java rje_consent() #> Consent for using rJavaEnv has already been provided. # Install and use Corretto JDK 24 use_java(24) # Check if installation was successful java_check_version_cmd() #> JAVA_HOME: /root/.cache/R/rJavaEnv/installed/linux/x64/24 #> Java path: /root/.cache/R/rJavaEnv/installed/linux/x64/24/bin/java #> Java version: "openjdk version \"24.0.1\" 2025-04-15 OpenJDK Runtime Environment Corretto-24.0.1.9.1 (build #> 24.0.1+9-FR) OpenJDK 64-Bit Server VM Corretto-24.0.1.9.1 (build 24.0.1+9-FR, mixed mode, sharing)" #> [1] "24" ``` Setting up local photon also works through `new_photon()`. This time, we pass a path where the necessary files should be stored and a country for which a search index should be downloaded. While global coverage is also possible, the global search index is extremely large (around 80 GB). By default, `new_photon()` downloads a search index tagged with `latest` but it is also possible to query a search index created at a specific date. ```{r local_photon, eval=FALSE} path <- file.path(tempdir(), "photon") photon <- new_photon(path, country = "Monaco") #> ℹ openjdk version "24.0.1" 2025-04-15 #> ℹ OpenJDK Runtime Environment Corretto-24.0.1.9.1 (build 24.0.1+9-FR) #> ℹ OpenJDK 64-Bit Server VM Corretto-24.0.1.9.1 (build 24.0.1+9-FR, mixed mode, sharing) #> ✔ Successfully downloaded OpenSearch photon 0.7.0. [17.9s] #> ✔ Successfully downloaded search index. [854ms] #> • Version: 0.7.4 #> • Coverage: Monaco #> • Time: 2025-09-17 ``` The resulting object is an R6 class with a few methods to control the instance. To start photon, run `$start()`. This starts an external java process which can be accessed using the `$proc` attribute. ```{r start, eval=FALSE} photon$start() #> Cluster Name: photon #> Base Path: /tmp/RtmpLQIqmy/photon/./photon_data #> Num Of Node: 1 #> Node Name: Node 1 #> HTTP Port: 9201 #> Data Directory: /tmp/RtmpLQIqmy/photon/./photon_data/node_1/data #> Log Directory: /tmp/RtmpLQIqmy/photon/./photon_data/node_1/logs #> [2025-06-09T23:54:26,879][INFO ][o.o.n.Node ] version[2.19.1], pid[1896], build[unknown/unknown/unknown], OS[Linux/6.6.87.1-microsoft-standard-WSL2/amd64], JVM[Amazon.com Inc./OpenJDK 64-Bit Server VM/24.0.1/24.0.1+9-FR] #> [2025-06-09T23:54:26,887][INFO ][o.o.n.Node ] JVM home [/root/.cache/R/rJavaEnv/installed/linux/x64/24] #> [2025-06-09T23:54:26,899][INFO ][o.o.n.Node ] JVM arguments [] #> [2025-06-09T23:54:26,932][INFO ][o.o.p.PluginsService ] no modules loaded #> [2025-06-09T23:54:26,934][INFO ][o.o.p.PluginsService ] loaded plugin [org.opensearch.analysis.common.CommonAnalysisPlugin] #> [2025-06-09T23:54:26,934][INFO ][o.o.p.PluginsService ] loaded plugin [org.opensearch.geo.GeoModulePlugin] #> [2025-06-09T23:54:26,934][INFO ][o.o.p.PluginsService ] loaded plugin [org.opensearch.transport.Netty4Plugin] #> [2025-06-09T23:54:27,174][INFO ][o.a.l.s.MemorySegmentIndexInputProvider] Using MemorySegmentIndexInput and native madvise support with Java 21 or later; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false #> [2025-06-09T23:54:27,184][INFO ][o.o.e.NodeEnvironment ] using [1] data paths, mounts [[/ (/dev/sdd)]], net usable_space [222.3gb], net total_space [250.9gb], types [ext4] #> [2025-06-09T23:54:27,190][INFO ][o.o.e.NodeEnvironment ] heap size [2.4gb], compressed ordinary object pointers [true] #> [2025-06-09T23:54:27,384][WARN ][o.a.l.i.v.VectorizationProvider] You are running with Java 23 or later. To make full use of the Vector API, please update Apache Lucene. #> [2025-06-09T23:54:27,603][INFO ][o.o.n.Node ] node name [Node 1], node ID [IFRXmNbwTYi4SLrEQXOn3A], cluster name [photon], roles [data, cluster_manager] #> [2025-06-09T23:54:27,708][INFO ][o.o.e.ExtensionsManager ] ExtensionsManager initialized #> [2025-06-09T23:54:29,302][INFO ][o.o.t.NettyAllocator ] creating NettyAllocator with the following configs: [name=opensearch_configured, chunk_size=512kb, suggested_max_allocation_size=512kb, factors={opensearch.unsafe.use_netty_default_chunk_and_page_size=false, g1gc_enabled=true, g1gc_region_size=2mb}] #> [2025-06-09T23:54:29,499][INFO ][o.o.d.DiscoveryModule ] using discovery type [single-node] and seed hosts providers [settings] #> [2025-06-09T23:54:30,082][WARN ][o.o.g.DanglingIndicesState] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually #> [2025-06-09T23:54:30,572][INFO ][o.o.n.Node ] initialized #> [2025-06-09T23:54:30,572][INFO ][o.o.n.Node ] starting ... #> [2025-06-09T23:54:30,723][INFO ][o.o.t.TransportService ] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300} #> [2025-06-09T23:54:31,016][INFO ][o.o.c.c.Coordinator ] cluster UUID [wbcIe8ZCQw-oX0uvwZgysw] #> [2025-06-09T23:54:31,199][INFO ][o.o.c.s.MasterService ] Tasks batched with key: org.opensearch.cluster.coordination.JoinHelper, count:3 and sample tasks: elected-as-cluster-manager ([1] nodes joined)[{Node 1}{IFRXmNbwTYi4SLrEQXOn3A}{RKCnk_jPQpmMLN2B0oUoaA}{127.0.0.1}{127.0.0.1:9300}{dm}{shard_indexing_pressure_enabled=true} elect leader, _BECOME_CLUSTER_MANAGER_TASK_, _FINISH_ELECTION_], term: 2, version: 11, delta: cluster-manager node changed {previous [], current [{Node 1}{IFRXmNbwTYi4SLrEQXOn3A}{RKCnk_jPQpmMLN2B0oUoaA}{127.0.0.1}{127.0.0.1:9300}{dm}{shard_indexing_pressure_enabled=true}]} #> [2025-06-09T23:54:31,289][INFO ][o.o.c.s.ClusterApplierService] cluster-manager node changed {previous [], current [{Node 1}{IFRXmNbwTYi4SLrEQXOn3A}{RKCnk_jPQpmMLN2B0oUoaA}{127.0.0.1}{127.0.0.1:9300}{dm}{shard_indexing_pressure_enabled=true}]}, term: 2, version: 11, reason: Publication{term=2, version=11} #> [2025-06-09T23:54:31,316][INFO ][o.o.d.PeerFinder ] setting findPeersInterval to [1s] as node commission status = [true] for local node [{Node 1}{IFRXmNbwTYi4SLrEQXOn3A}{RKCnk_jPQpmMLN2B0oUoaA}{127.0.0.1}{127.0.0.1:9300}{dm}{shard_indexing_pressure_enabled=true}] #> [2025-06-09T23:54:31,321][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:31,329][INFO ][o.o.h.AbstractHttpServerTransport] publish_address {127.0.0.1:9201}, bound_addresses {[::1]:9201}, {127.0.0.1:9201} #> [2025-06-09T23:54:31,330][INFO ][o.o.n.Node ] started #> [2025-06-09T23:54:31,345][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:31,402][INFO ][o.o.g.GatewayService ] recovered [1] indices into cluster_state #> [2025-06-09T23:54:31,403][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:31,599][INFO ][o.o.p.PluginsService ] PluginService:onIndexModule index:[photon/3te8gZq4Q-u1LwWbZGtxPA] #> [2025-06-09T23:54:32,346][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:32,519][INFO ][o.o.c.r.a.AllocationService] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[photon][2]]]). #> [2025-06-09T23:54:32,623][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:33,179][INFO ][d.k.p.App ] Make sure that the ES cluster is ready, this might take some time. #> [2025-06-09T23:54:33,447][INFO ][d.k.p.App ] ES cluster is now ready. #> [2025-06-09T23:54:33,637][INFO ][o.o.c.m.MetadataIndexStateService] closing indices [photon/3te8gZq4Q-u1LwWbZGtxPA] #> [2025-06-09T23:54:33,732][INFO ][o.o.c.m.MetadataIndexStateService] completed closing of indices [photon] #> [2025-06-09T23:54:33,733][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:33,837][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:33,936][INFO ][o.o.p.PluginsService ] PluginService:onIndexModule index:[photon/3te8gZq4Q-u1LwWbZGtxPA] #> [2025-06-09T23:54:34,245][INFO ][o.o.c.m.MetadataIndexStateService] opening indices [[photon/3te8gZq4Q-u1LwWbZGtxPA]] #> [2025-06-09T23:54:34,252][INFO ][o.o.p.PluginsService ] PluginService:onIndexModule index:[photon/3te8gZq4Q-u1LwWbZGtxPA] #> [2025-06-09T23:54:34,281][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:34,369][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:34,452][INFO ][o.o.p.PluginsService ] PluginService:onIndexModule index:[photon/3te8gZq4Q-u1LwWbZGtxPA] #> [2025-06-09T23:54:34,741][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:34,880][INFO ][o.o.c.r.a.AllocationService] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[photon][2]]]). #> [2025-06-09T23:54:34,970][WARN ][o.o.c.r.a.AllocationService] Falling back to single shard assignment since batch mode disable or multiple custom allocators set #> [2025-06-09T23:54:35,038][INFO ][o.o.p.PluginsService ] PluginService:onIndexModule index:[photon/3te8gZq4Q-u1LwWbZGtxPA] #> [2025-06-09T23:54:35,165][INFO ][d.k.p.App ] Starting API with the following settings: #> Languages: [en, de, fr] #> Import Date: 2025-06-01T03:00:01.000+0200 #> Support Structured Queries: false #> Support Geometries: false #> ✔ Photon is now running. [12.8s] photon$proc #> PROCESS 'java', running, pid 1896. ``` To check if the service is up and running, you can use `$is_ready()`. ```{r is_ready, eval=FALSE} photon$is_ready() #> [1] TRUE ``` Finally, to properly stop photon after you used it, you can run `$stop()`. You do not actually _need_ to run it manually, because it is (implicitly) executed on two occasions: 1. on garbage collection and 2. when the R session ends and external processes are killed. ```{r stop, eval=FALSE} photon$stop() ``` To compare offline and online geocoding, let's benchmark them by geocoding the Samoan capital Apia: ```{r benchmark1, eval=FALSE} # offline geocoding bench::mark(with_photon(photon, geocode("Monte Carlo")), iterations = 25)$median #> [1] 17.1ms ``` ```{r benchmark2, eval=FALSE} # online geocoding photon_pub <- new_photon(mount = FALSE) bench::mark(with_photon(photon_pub, geocode("Monte Carlo")), iterations = 25)$median #> [1] 1.04s ``` That is a speed increase by a factor of almost 60 (and possibly more on faster machines)! Finally, to clean up photon, i.e. stop the instance and delete the photon directory, run `$purge()`. ```{r purge, eval=FALSE} photon$purge() #> ℹ Purging an instance kills the photon process and removes the photon directory. #> Continue? (y/N/Cancel) y ```