Find publicly exposed source code repositories
Repo Lookout is a large-scale security scanner, with a single purpose: Find source code repositories that have been inadvertently exposed to the public and report them to the domain’s technical contact.
Accidentally exposed source code repositories often contain highly sensitive information that can be used for downstream attacks, such as data leakage and ransomware extortion. While the problem has been known and extensively documented for years,1 2 3 4 5 our findings show that it is still prevalent.
Our goal is to combat this vulnerability by automatically detecting and reporting instances.
Further details and a bit of history can be found in this interview on our web hoster’s blog.
URL index
The URL index for the scanning process is obtained from several sources:
- CommonCrawl builds and maintains an open repository of web crawling data.
- Tranco List is a research-oriented top site ranking hardened against manipulation.
- Chrome UX Report provides metrics for how real-world Chrome users experience the web.
- Certificate Transparency is an Internet security standard for monitoring and auditing the issuance of digital certificates.
Statistics
Including our most recent security scan on November 20, 2024, we have scanned 15,497,283,148 URLs on 1,918,713,372 domains. A total of 1,327,240 publicly exposed source code repositories have been found to date.
As a result, Repo Lookout sent 747,231 emails to the technical contact, of which 520,915 emails were successfully delivered.
Frequently Asked Questions
How to prevent exposed repositories?
It’s recommended to configure the web server to deny access to all “dot folders” (i.e., folders starting
with a “.
”).6 However, to prevent Git respositories from
being exposed, it’s sufficient to deny access to “.git
” folders.
Exactly how to do this depends on the server software used, but here are some configuration examples for nginx, Apache, and Caddy.
How to opt-out of being scanned?
The security scanner is designed to be a good network citizen, so requests are throttled and bandwidth usage is minimal. However, we do understand that not every website may welcome the scanning process.7
There are two ways to opt-out of the scanning process:
Send us an opt-out email with the domain name, IP, IP range(s), or ASN. In the case of IP range(s) or ASN, we will request log entries from previous scan as a means of authentication.
Deny all requests with an HTTP User-Agent prefix of “
RepoLookoutBot
”.
How did you get my personal email address?
Most likely from the inadvertently exposed source code repository itself!
If Repo Lookout is unable to extract relevant emails from the domain’s main page, and if there is no technical contact email in the WHOIS database (e.g. if it is protected by a CAPTCHA or requires other manual action that is not feasible for our automated setup), the system will use the Git author email of the most recent commit in the exposed source code repository.
Why are commits readable when the .git
URL returns 403?
A wide-spread misconfiguration —unfortunately described in many blog posts— prohibits access to the Git folder itself, but not to the files inside the folder. In fact, this problem is so common that it was the primary reason why Repo Lookout was created in the first place.
So to be absolutely sure that the Git repository is no longer exposed, try adding /config
(as in
http://domain.tld/path/.git/config
) or /logs/HEAD
(as in http://domain.tld/path/.git/logs/HEAD
) and see
if those URLs are still world-readable.
Sponsoring
To support this project, consider becoming a sponsor on Ko-fi. All funds will be used for the crawling and email infrastructure.
Thank you very much!
How unprotected .git repositories compromise website security (German, 2015)
Source code disclosure via exposed .git folder (English, 2018)
Pwning eBay - How I dumped eBay Japan’s website source code (English, 2018)
Open .git global scan (English, 2018)
Finding exposed .git repositories (English, 2020)
With the exception of the “
.well-known
” folder, which is defined in RFC 8615E.g. Fail2Ban is sometimes configured to trigger alerts on scanning
.git
folders