Monday, November 20, 2023

Thousands of Australian UFO newsclippings - downloaded, organised and assimilated into wider collection using AI-generated code [AI assisted ufology]

Last year, I uploaded a few hundred scanned Australian newspaper clippings about UFOs. In the last few weeks, I have explored the (impressive) ability of Artificial Intelligence tools to generate code to find further UFO newspaper clippings from various Australian sources and organise them so that they can be assimilated into my wider collection of UFO material. As a result, I added over 3,500 Australian newspaper clippings to my online UFO collection this weekend. More significantly, now that relevant tools have been developed, it should be relatively easy to add thousands more from Australia (or apply the same technique to combine similar UFO newspaper clipping collections from dozens of other countries with previous scans and share them online).

The current Australian collection is at:
https://files.afu.se/Downloads/?dir=.%2FTopics%2F0%20-%20Countries%2FAustralia%2FCuttings

I've previously done various projects in relation to UFO newspaper clippings. For example, a few years ago I shared a basic list of over 60,000 scanned UFO newspaper articles that were then in my collection (and shared a sample of that collection, from the 1980s), More recently, I uploaded - with permission from Rod Dyke - scans of the "UFO Newsclipping Service" (1969-2011) and - with permission from Ron and Richard Smotek - scans of a similar service offered by the Aerial Phenomenon Clipping Information Center ("APCIC") (1970s-1990)

Artificial Intelligence tools now allow very rapid reorganisation of UFO material from numerous sources, so it is now relatively simple to assimilate UFO newspaper clippings from online databases, scrapbooks of UFO researchers, official documents (e.g. the Australian files that I've uploaded as PDFs over the years, working with Keith Basterfield and Paul Dean - see HERE), offline UFO databases, archives of UFO groups/researchers and digitised UFO material. 

These UFO newspaper clippings from various sources can all be organised so that they can, in turn, then be assimilated into a wider collection of UFO case files, official UFO documents, UFO magazines, PhD dissertations regarding UFOs, UFO databases, discussion forum posts and related emails/correspondence.  

As a further little case study, this weekend I uploaded a few thousand further Australian newspaper clippings to my free online UFO archive (kindly hosted by the AFU in Sweden). These are being combined with material from the AFU's offline archive in Sweden plus collections of newspaper clippings from various Australian researchers (including collection put together by Anthony Clarke and Judith Houston McGinness - with their kind permission).

I picked Australia for this little case study due to the existence of a huge, free online database of Australian newspaper stories: Trove.   

Various Australian UFO researchers have previously highlighted the existence of Trove in blog posts, including posts by Keith Basterfield and Paul Dean. The Trove newspaper archive includes a huge number of Australian newspaper stories. Unfortunately, it is not easy to find a comprehensive online collection of the UFO newspaper clippings available on Trove (or any collection supplemented by further UFO newspaper clippings from other sources, such as those mentioned above).

Searching Trove can be slightly frustrating. For example, a search of the content of articles on Trove for "UFO" finds _many_ articles from long before 1947 (i.e. before the modern UFO era, and before the term "UFO" was coined).  Some of those early newspaper articles have been scanned poorly so the text as recognised by the Trove software is basically a collection of random letters. By chance, those random letters includes the letters "UFO" in hundreds of articles (e.g. in a line of text which is recognised by the Trove system [wrongly] as being "adfr AWTA hAWrhyu UFO akaRF jsD AlE").   

Very brief search terms such as "UFO" therefore generate hundreds (if not thousands) of false positive results which would have to be weeded out if the collection is to be limited to just articles relating to UFOs.  

At the other extreme, it is possible to search Trove for articles which readers have tagged with the label "UFO".  The material found using this search has far less irrelevant material (with a relevant material being almost 100% of all the results) - BUT but only a small fraction of relevant material is found (say 1%). At present, most of the articles in Trove that may be of interest to UFO researchers have not been tagged.

So, the challenge is either:
(1)  To find the time to weed out irrelevant results from wider search terms, or
(2) Finding search terms/restrictions which result in only (or at least almost entirely) material which is relevant to UFO research.

I don't have the time for (1), so working on my own the only real option is (2).

Fortunately, it is possible to come up with search terms and restrictions which greatly reduce the amount of irrelevant material while finding far more relevant material than just the articles currently tagged with a label such as "UFO".  

To help find useful search terms (and to archive material which is found), I found it useful to download all the articles found as a result of a search and then to glance through the articles offline (which is much faster than reviewing them online). In particular, I used AI software to generate code to download all the search results, then manually reviewed folder of PDFs for each of those search results, setting the view in Windows Explorer to include a preview pane on the right hand side of the screen - allowing relatively rapid review of the PDFs to determine if the results were largely relevant or whether they included a lot of irrelevant material.  

I think it would be useful to have a discussion of the pros and cons of different potential search terms, with at least a qualitative discussion of actual experiments with those search terms. I don't recall seeing this done within ufology so far.  For example:

The term "UFO", as indicated above, generated far too many irrelevant results due to random letters being perceived by relevant OCR software - particularly in poor scans of earlier newspaper articles (e.g. articles from the 1800s).

The term "UFO" could be combined with another search term or collection of alternative search terms, e.g. searching for "UFO" AND (light OR mysterious OR unidentified OR flying OR sighting OR sighted).  Unfortunately, pre-1952, most of the results were irrelevant (with, say, less than 10% being relevant).   Most were hits for the word "unidentified" in poorly scanned articles from the 1800s with lots of random characters that happen to include the three consecutive letters "UFO".  Post 1952, the percentage rises to, say, about 50% relevant - with fewer poor scans with random characters, but quite a lot of the hits for the keyword "UFO" are in reviews of science fiction books and movies.

Better than "UFO" was a search for "Unidentified" AND "Flying" AND "Object".  I'd estimate that about 95% of results after 1947 were result. However, there were surprisingly lots of results prior to 1947, only about 10% of which were relevant. One possibility would be devising searches that use terms combined with date restrictions e.g. "Unidentified" AND "Flying" AND "Object" but only in relation to articles dating from, say, 1947 or later.

Turning from "UFO", some other terms resulted some interesting articles being found but too many irrelevant articles for the search to be relied upon alone (without manual intervention). For example, Trove had about 200 hits for "strange lights in the sky" (i.e. a much smaller number than for UFO or flying saucer).  This found articles containing both "lights" in the plural ("strange lights in the sky") and light in the singular ("strange light in the sky"). However, a relatively high percentage were irrelevant or uninteresting. 

By a significant margin, the most productive search term was "flying saucer" (the results for which included the plural, "flying saucers", in addition to the singular).  This resulted in about 2,000 hits, mainly from the 1940-1950s, almost all of which were relevant.  Due to the high rate of relevancy, the numerous results of this particular search probably do not require much (if any) manual intervention and is the most promising search term to be rolled out and applied to databases of newspaper articles from other English-speaking countries.





I'll set out below the code I generated using AI tools to download material from Trove (which includes a line stating the URL for the relevant search results to be downloaded, which obviously has to be changed for each different search term used):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import requests
import os
import time
import datetime
import re

MAX_RETRIES = 100  # Number of retries for downloading an article and for loading search result pages

def safe_click(driver, element):
    driver.execute_script("arguments[0].scrollIntoView();", element)
    time.sleep(1)
    try:
        element.click()
    except Exception:
        driver.execute_script("arguments[0].click();", element)

def sanitize_filename(filename):
    invalid_chars = ['<', '>', ':', '"', '/', '\\', '|', '?', '*']
    for char in invalid_chars:
        filename = filename.replace(char, '_')
    return filename

def download_pdf_from_search_result(driver, search_result_url, save_dir, max_retries=MAX_RETRIES):
    retries = 0
    success = False

    while retries < max_retries and not success:
        try:
            driver.get(search_result_url)
            WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="grp2Download"]/span[1]')))
                
            # Construct the filename dynamically
            filename = construct_filename_dynamic(driver)
            
            # The rest of the download steps
            download_icon = driver.find_element(By.XPATH, '//*[@id="grp2Download"]/span[1]')
            safe_click(driver, download_icon)
            
            pdf_option = driver.find_element(By.XPATH, '//*[@id="articlePdfLink"]')
            safe_click(driver, pdf_option)
            
            change_size_btn = driver.find_element(By.XPATH, '//*[@id="articleImageQualityShow"]')
            safe_click(driver, change_size_btn)
            
            largest_checkbox = driver.find_element(By.XPATH, '//*[@id="inlineRadio5"]')
            safe_click(driver, largest_checkbox)
            
            create_pdf_btn = driver.find_element(By.XPATH, '//*[@id="downloadModal"]/div/div/div[3]/a[10]')
            safe_click(driver, create_pdf_btn)
            
            time.sleep(10)  # Wait for the "View PDF" button to appear/change after the "Create PDF" button is clicked
            
            view_pdf_btn = driver.find_element(By.XPATH, '//*[@id="downloadModal"]/div/div/div[3]/a[10]')
            pdf_url = view_pdf_btn.get_attribute('href')
            
            # Fetch the PDF content
            response = requests.get(pdf_url, stream=True, timeout=60)  # Add a timeout for the request
       
            # Sanitize the filename
            safe_filename = sanitize_filename(filename)
            
            # Download the PDF
            with open(os.path.join(save_dir, safe_filename), 'wb') as pdf_file:
                for chunk in response.iter_content(chunk_size=1024):
                    if chunk:
                        pdf_file.write(chunk)

            success = True
        except (TimeoutException, requests.exceptions.ConnectionError) as e:  # Handle both timeout and connection errors
            retries += 1
            print(f"Attempt {retries} failed due to {str(e)}. Retrying...")
            wait_time = min(5 + retries * 2, 600)
            time.sleep(wait_time)
    
    if not success:
        print(f"Failed to download article from {search_result_url} after {max_retries} attempts.")

def construct_filename_dynamic(driver):
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')

    source_element = driver.find_element(By.XPATH, '//*[@id="breadcrumb-c"]/ul/li[4]/a')
    source = source_element.text.strip().split('(')[0].strip() + " (Australia)" if source_element else "Browse (Australia)"

    title_element = driver.find_element(By.XPATH, '//*[@id="breadcrumb-c"]/ul/li[7]/a')
    title = title_element.text.strip().lower().title() if title_element else "UnknownTitle"

    date_pattern = re.compile(r'(\w+ \d{1,2} \w+ \d{4})')
    date_match = date_pattern.search(page_source)

    if date_match:
        date_info = date_match.group(0).split()
        day, month, year = date_info[1].zfill(2), date_info[2], date_info[3]
    else:
        day, month, year = "00", "UnknownMonth", "0000"

    month_mapping = {
        'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 'Jun': '06',
        'Jul': '07', 'Aug': '08', 'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'
    }

    date_string = f"{year} {month_mapping.get(month, '00')} {day}"

    filename = f"{date_string}_{source} - {title}.pdf"
    return filename

def process_search_results_page(driver, search_results_url, save_dir, start_article=1, end_article=397, articles_per_page=20):
    # Calculate starting page and ending page
    start_page = (start_article - 1) // articles_per_page + 1
    end_page = (end_article - 1) // articles_per_page + 1

    # Generate the URL for the starting page
    start_pos = (start_page - 1) * articles_per_page
    initial_url = f"{search_results_url}&startPos={start_pos}"
    driver.get(initial_url)

    # Add a short delay to let the page load and to see if the popup appears
    time.sleep(5)

    # Try to close the popup
    try:
        close_popup = driver.find_element(By.XPATH, '//*[@id="culturalModal___BV_modal_footer_"]/div/div/div[2]/button/span')
        close_popup.click()
        time.sleep(2)  # Give it a moment to close
    except Exception as e:
        print(f"Error closing popup: {e}")

    article_counter = start_article - 1
    page_count = start_page

    while page_count <= end_page:
        # Extract and process articles on the current page
        print(f"Attempting to extract articles from page {page_count}...")
        article_links_elements = driver.find_elements(By.XPATH, "//a[contains(@href, '/newspaper/article/')]")
        article_links = list(set([link_elem.get_attribute('href') for link_elem in article_links_elements]))

        print(f"Found {len(article_links)} articles on the current page.")

        article_num = 0  # Counter to keep track of which article is being processed on the page

        for link in article_links:
            article_num += 1
            print(f"Processing article number {article_num} with URL {link}...")

            try:
                download_pdf_from_search_result(driver, link, save_dir)
            except Exception as e:
                print(f"Error processing article number {article_num} with URL {link}. Error: {e}")

            article_counter += 1
            print(f"Processed {article_counter} articles.")

        print(f"Current article count after processing page: {article_counter}")

        # Check if we need to navigate to the next page
        if article_num % articles_per_page == 0:
            print("Verifying if we need to navigate to the next page...")
            print(f"article_counter: {article_num}, articles_per_page: {articles_per_page}, Modulus result: {article_counter % articles_per_page}")
            
            for i in range(MAX_RETRIES):
                next_page_url = f"{search_results_url}&startPos={article_counter}"
                driver.get(next_page_url)
                time.sleep(min(5 + i*2, 600))  # Wait time increases with each retry, maxing out at 600 seconds (10 minutes)

                # If articles are found on the page, break out of the retry loop
                article_links_elements = driver.find_elements(By.XPATH, "//a[contains(@href, '/newspaper/article/')]")
                if article_links_elements:
                    break
                else:
                    print(f"Retry {i+1}: No articles found on the new page. Retrying...")

            page_count += 1
        else:
            print(f"No more articles to process or reached the limit.")
            print(f"Processed {page_count - start_page + 1} pages.")
            break



# Script execution starts here
if __name__ == '__main__':
    driver = webdriver.Chrome()
    save_directory = 'e:/temp/Trove/testdown'

    # URL of the search results page
    search_results_url = "https://trove.nla.gov.au/search/advanced/category/newspapers?keyword.phrase=unidentified%20flying%20object"

    # For processing articles 
    process_search_results_page(driver, search_results_url, save_directory, start_article=1, end_article=422)

    time.sleep(20)
    driver.quit()


Thursday, November 16, 2023

ATS (AboveTopSecret.com) - 411 selected UFO threads archived as PDFs - some thousands of pages long

I've now archived as PDFs over 400 selected UFO threads from ATS as searchable PDFs.

I have checked the total number of pages in the 411 PDFs uploaded so far, but one thread is over 3,000 pages long in PDF format.

ATS (AboveTopSecret.com) was a popular forum until the last few years. Recently, all users were unable to login until some volunteers (particularly "Djarums") worked to re-admit at least some ATS members (including me). The new owner of ATS did not appear to be able to [or even attempt to] solve the problem, which seems to cast considerable doubt on the future viability of ATS. 

Several other fellow members of ATS are working on archiving at least the text of numerous threads on ATS. I don't plan on uploading such an archive, at least unless and until ATS does go offline - but at least the material should be preserved. ATS had some of the most extensive online discussions regarding UFOs and conspiracy theories in the period from, oh, about 2000-2015 (prior to the current popularity of Facebook and Twitter).

(I'd estimate the total number of pages from ATS archived so far, if I converted them all to PDFs, would be at least several million pages of material).



 

 

Saturday, November 4, 2023

ATS ("AboveTopSecret") dying? Threads by Karl12 - PDFs added to online archive

ATS ("AboveTopSecret") was a very popular UFO / conspiracy discussion forum until the rise of Facebook, Twitter and other modern social media.  ATS has been in a downward trajectory for a few years due to that competition.  During the last week or two, ATS has appeared to teeter on the brink of collapse. All users were locked out of their accounts. The current owner appears to have disappeared and failed to sort things out. It seems that it's only due to the work of a few active volunteers on ATS, particularly "Djarums", that access has been restored for some members. 

It's rather unclear if ATS will survive for much longer.

I've archived about 100 of my more substantial UFO threads from ATS as searchable PDF documents.  I've also archived over 60 UFO threads by Karl12 after he gave me permission to do the same with his UFO threads.  

I'm tempted to widen the archiving effort of UFO threads from ATS given the recent failures in relation to ATS and the risk that it may go down permanently, possibly soon.  I did want permission from the new owner (and I've requested permission on ATS several times during the last few years, without any objection but no clear consent - although some of the moderators of ATS have helped me develop code to archive ATS threads as searchable PDFs...).    

The archiving code that I've developed (with help from ChatGPT and some other members of ATS, particularly "Drewlander") iterates through a list saved in a file called "thread_details.csv" and creates a PDF of each thread in that list (as in the samples at the link above). That file listing threads can be used to store a list of relevant threads, in the format:

THREADNUMBER,NUMBER OF PAGES,AUTHOR - BRIEF TITLE  [with no spaces after each comma]

e.g. :

1308154,14,Karl12 - New And Revised UFO Quote Directory

1278525,3,Karl12 - Highly Dubious USAF UFO Explanations

1231535,4,Karl12 - Early UFO Saucer Reports

841422,5,Karl12 - UFOs and falling leaf or pendulum motion

1171896,3,Karl12 - UFOs and Colour Change

878723,7,Karl12 - Electromagnetic Effects Associated with UFOS

460705,4,Karl12 - UFO OVNI Shapes

1233389,6,Karl12 - UFO Time Anomaly Research

505080,9,Karl12 - Unusual reports of UFOs taking on water

1261532,7,Karl12 - UFO Faerie Lore Connection

1273353,3,Karl12 - UFOs And Stopped Clocks

1271101,2,Karl12 - UFO Animal Reaction Research

1263051,3,Karl12 - UFO  Cryptid Research

898220,6,Karl12 - UFO Light Beam Cases

1286201,3,Karl12 - UFO Pilot Under-Reporting Bias

900175,5,Karl12 - Missing Gun Camera UFO Footage

513308,10,Karl12 - UFO Government Documentary Evidence - Greenewald

I posted the relevant code and a brief bit of background in a blog post a while ago (before the most recent technical/hacking problems with ATS): 

https://isaackoiup.blogspot.com/2023/04/some-abovetopsecret-ufo-threads.html

If other members of ATS want (and, ideally, give their permission), I can expand this archive.  If anyone wants particular threads added, it would be helpful if they provided their list of requests in the same format as that used to list threads above (since I can then quickly paste that list into a file to be iterated by the code I've developed).