- How does SCAN help me? Why should I use SCAN?
- Who are the SCAN users?
- What are the categories of software related to SCAN?
- How does it differ? Why it is “smart”?
- Can SCAN replace a file manager?
- Can SCAN replace a syndication feed reader?
- Where can I read more about SCAN?
- Which document formats are supported by SCAN?
- Where can SCAN retrieve the documents from?
- What is the documents repository?
- Where is the SCAN repository located? How to backup it?
- How to set another location for the repository?
- Is it possible to have multiple repositories?
- What are tags?
- What is autotagging?
- Autotagging results are odd
- What is “tags specificity”?
- What is “tags novelty”?
- How to make autotagging working with a controlled vocabulary of tags?
- Some documents are not autotagged
- What is “tag auto-population”?
- Why should I use SCAN and not a desktop search tool?
- What is “associative search”? How does it work?
- How to disable an installed plugin temporarily?
- Why there are no plugins for images or multimedia files?
- SCAN hangs or crashes with “OutOfMemory exception” error
- “Tagger: Cannot connect to database” error on start
- I want SCAN to write its log to a file
- What “Index Maintenance” button does?
- Running an upgraded SCAN version for the first time (with existing repository), I’ve got a bunch of error messages about missing plugins. After I installed all required plugins and restart SCAN, my repository got corrupted (some parts of it had been lost).
- Where and how SCAN has been tested?
- How can I contribute to SCAN?
- What if I want to run an independent project to develop SCAN plugins?
- Why don’t you provide the sourcecode in the releases?
- What development tools I need to have to work with SCAN sourcecode?
How does SCAN help me? Why should I use SCAN?
- A single place to access information from different sources (local documents, email archives, web bookmarks, news feeds …)
- Organize your document collections.
- Quickly find information you need.
- Discover valuable information in heaps of content.
- Analyze and explore your document archives, converting them into usable electronic libraries.
Who are the SCAN users?
SCAN is intended for all desktop users in general. However, it would be especially helpful for so-called “knowledge workers” who spend the most part of their time for information processing — office workers, researchers, librarians, journalists etc.
What are the categories of software related to SCAN?
- File managers, as long as they are used for documents organization and navigation.
- Desktop search engines
- Metadata managers and collection organizers (e.g. for multimedia libraries or photo albums)
- Syndication feed aggregators
- Web bookmarking and annotation services
How does it differ? Why it is “smart”?
SCAN brings the power of text mining and analysis to help you organize and explore the document collections. This approach reveals itself in a number of automated features — e.g. tags suggestion, automated tagging, guided associative search and finding documents by similarity. Unlike other tools, SCAN cares about document semantics and helps you to realize “what the document is about” and “how it relates to others”.
Can SCAN replace a file manager?
No. It has read-only access to a file system, thus it cannot help you to manage your files and directories. However, if your file manager (like Windows Explorer) is a main interface to your document collections, SCAN may offer richer capabilities of documents organization, navigation and search. Think of it as an interface layer built on the file system (or other physical document source).
Can SCAN replace a syndication feed reader?
Yes! With Syndication Feed plugin, you can add web syndication feeds (RSS or Atom) as SCAN locations thus importing the feed items into the repository along with local documents and other content. New feed items will be fetched at every location update (manually, or with a timer).
Where can I read more about SCAN?
Which document formats are supported by SCAN?
Plain text, HTML and XML are supported out of the box and PDF, OpenDocument and MS Office formats are available with additional plugins.
Where can SCAN retrieve the documents from?
What is the documents repository?
You can think of the repository as a card index in a library. Every library card keeps information about a book (metadata) and refers to a physical location of the book in the library. So, every record in the SCAN repository contains document metadata (title, description, author, creation date etc) and refers to original document location (URL). The repository is integrated with the full-text search index and the tags database.
Where is the SCAN repository located? How to backup it?
By default, SCAN keeps the repository in ‘
.scan/repository‘ subdirectory of your “home directory”. On *NIX systems your home directory path usually is ‘
/home/username” and in the case of Windows - “
C:/Documents and Settings/username“. To be sure, you can see actual SCAN repository path on the “General” tab of the configuration dialog (”Tools ? Configure”). To make a backup of the repository, simply copy this directory.
How to set another location for the repository?
You can change it in SCAN configuration dialog (note that you should set an existing directory there). New repository path will be used after restarting the application.
You also can set a system property “
scan.dir.repository” on application start, e.g. change the line in a startup script:
java -Dscan.dir.repository=my_repository_path -jar scan-launcher.jar
Is it possible to have multiple repositories?
A simplest way to do so is to create multiple copies of startup scripts with different “
scan.dir.repository” property value (as shown above). Each script will launch SCAN with different repository.
What are tags?
Tag is a keyword or label assigned to a document to identify its topic and enable documents classification. You can assign as many tags as you wish to each document in the repository. Tags may contain letters, numbers and punctuation characters but not spaces (two terms delimited by a space are interpreted as two different tags) and single or double quotes. Tags are case-insensitive, that is terms “cats”, “Cats” and “CATS” refer to the same tag.
What is autotagging?
Autotagging is a process of analyzing a document for picking the most relevant terms and assign those terms as tags describing the document. You can apply autotagging either to single document, or to a group of the selected documents. You also can set autotagging to be applied to each new or updated document found in a location.
Autotagging results are odd
1) If you’ve noticed a lot of tags which do not match the content of the documents, try to increase “tags specificity” parameter in the Configuration dialog. It will make tags selection more accurate but at the cost of lesser tags generality and increased taxonomy size. It is recommended to increase specificity if your collection is relatively small (100-200 documents or less).
2) Autotagging works good only if your collection is reasonably large. It makes no sense to run autotagging on collection of 10-20 documents - the results might be interesting, but not very useful.
A recommended strategy is to apply autotagging when your documents collection reaches a relatively stable state and do full re-tagging after every major collection change (a large corpus of documents has been added or removed).
3) After all, look at the document from point of view of a robot. Maybe, you’ll realise why those tags are selected.
What is “tags specificity”?
Specificity is autotagging parameter which defines whether the terms extracted from a document directly must have higher priority than ones picked from a document context (a cluster of similar documents). High specificity leads to large and granulated taxonomies with small numbers of documents sharing the same tags, while low specificity increases the value of “general” terms, thus producing lesser tags number but with more documents per single tag.
It is difficult to provide a common rule of thumb on using this parameter, as it depends on specific tagging purposes and on size and nature of a documents collection. It is recommended to experiment with different values to find the best one for your case.
What is “tags novelty”?
“Tags novelty” controls a tendency of autotagging to invent new tags instead of re-using existing ones. When set to maximum, existing tags takes no priority of new candidates. Otherwise, if everything else is equal, existing tags have more chances to be selected for new documents. Minimal novelty means absolute priority of existing tags.
How to make autotagging working with a controlled vocabulary of tags?
Some documents are not autotagged
It may happen if new tags creation is disabled (see the question above). It simply means that no existing tags matching the document content have been found. It may also result in less tags than the number set in the autotagging dialog.
What is “tag auto-population”?
This is a process of assigning a specific tag to the relevant documents. It finds all documents containing the given term and assign the tag automatically, if the term’s relevancy is reasonable high (above a threshold specified in application preferences).
Why should I use SCAN and not a desktop search tool?
- There is no single silver bullet (like search) against the problems of information overflow. This is a complex issue that needs to be solved with complex integrated tools.
- SCAN provides much more search capabilities than usual full-text search.
- SCAN search is not limited by files
What is “associative search”? How does it work?
After every search request, SCAN analyzes the results to pick the terms you would want to search also. For instance, if you are looking for “coffee”, SCAN would expect you also might be interested into “latte”, “espresso”, “cup” or “bean”. You then can search for these suggested terms, or include them into initial query to refine it.
How to disable an installed plugin temporarily?
Rename a plugin directory in such a way that it would start with two underscore characters (”__”). The plugin will not be loaded on next SCAN start. Remove these characters from directory name to activate the plugin again.
Why there are no plugins for images or multimedia files?
Actually, it is not a problem to develop plugins to add those content types. However, SCAN primary focus is to handle textual content, so everything else is held for the future. One day, maybe.
Does SCAN require a super-pooper-mega-powerful computer?
No. SCAN was tested on a pretty average machine (P-IV 3GHz, 1Gb RAM) managing really large document collections (tens of thousands).
Browsing the repository is slow
Try to reduce a number of documents per single location. For instance, instead of adding a large directories hierarchy as a single location, it is better to add its subdirectories as the separate locations. Note that SCAN performance depends mostly on sizes of separate locations, than on a total repository size.
SCAN hangs or crashes with “OutOfMemory exception” error
SCAN may run out of memory when parsing very large documents (documents number doesn’t seem to be a problem). If it happens, try to increase Java heap size in a startup script (e.g. to 256Mb):
java -Xms256M -Xmx256M -jar scan-launcher.jar
“Tagger: Cannot connect to database” error on start
First, check if SCAN is not already started with the same repository path. If you are sure it is not the case, something bad was happened during previous SCAN running so it was unable to finish the work correctly. Try to remove the “.lck” file mentioned in error message (
~/.scan/repository/db/.lck) manually and run SCAN again.
I want SCAN to write its log to a file
Open ’scan.conf’ file in ‘.scan’ subdirectory of your home space and add an entry:
Instead of ‘ALL’ you can specify the minimal level of messages to be written into the file (’CONFIG’, ‘INFO’, ‘WARNING’ or ‘SEVERE’).
By default, the log file is named ’scan.log’ and located in ‘.scan’ directory of your home space. You can change its name and location by adding an “scan.logging.file.path” entry, e.g.:
What “Index Maintenance” button does?
Sometimes when some SCAN operations (e.g. removing a location) are aborted because of a unpredicted failure, “dead” document and tag entries might appear in the repository. Index Maintenance checks the index and tags database for those entries and removes them.
Running an upgraded SCAN version for the first time (with existing repository), I’ve got a bunch of error messages about missing plugins. After I installed all required plugins and restart SCAN, my repository got corrupted (some parts of it had been lost).
It is always a good idea to backup existing repository directory (see “Where is the SCAN repository located? How to backup it?“) before migration to another version. If you have a backup copy, you can restore it back to an original location after all required plugins are installed (make sure that SCAN is not running when restoring).
Alternatively, you can copy the plugin directories from your old SCAN installation (before first running of the new version) and upgrade the plugin versions with the Plugins manager (if needed).
Where and how SCAN has been tested?
The main tesing platforms were Ubuntu Linux (Feisty) and Windows XP(SP2) running Sun JRE’s SE 1.5 and 1.6
Please note: SCAN has not been tested running on third-party (not Sun) JVM’s and JVM for Mac OS X . You can test it on there and report compatibility issues using the bugtracker. SCAN will not work with Java 1.4 and older versions.
Testing was performed using two major content sources. The first is our corporate electronic library (CS books, papers, tech docs etc.) which is collecting since 1996. It contains nearly 30,000 documents (~2Gb) in different formats (PDF, HTML, DocBook, OpenOffice, Word, and plain text). The second is Reuters-21578 corpus (21,578 text files from Reuters 1987 newswire).
How can I contribute to SCAN?
If you want to join main platform or native plugins development, subscribe to the scan-project mailing list and introduce yourself and your ideas. Participating in project trackers and forums before is a plus. You have to be a registered SF.net user and be fluent in Java programming, as a minimum.
What if I want to run an independent project to develop SCAN plugins?
Great - we encourage anyone to develop his/her own plugins for SCAN. Independent plugin developers can contact us for linking the project web-site to their products.
Why don’t you provide the sourcecode in the releases?
We do not see any reasons for overloading the distribution and confusing end-users, as long as there is open anonymous access to live SVN repository. Please refer to Development section to know how to obtain the sourcecode.
If you need the code of a specific version, look at the ‘tags’ repository area. It contains the repository snapshots for every version release.
What development tools I need to have to work with SCAN sourcecode?
A minimal set is JDK (Java Development Kit), Ant and Subversion client.
You also may need NetBeans IDE for SCAN GUI development. We strongly discourage from commiting manually edited UI code, originally generated by NetBeans UI editor.