21. Darwin Technology

By Juan Chamero




Darwin Technology
Intag, Intelligent Agents Internet Corp
By Dr. Juan Chamero, Principal Architect
April 2008

 


ABSTRACT


    Darwin Technology is a Knowledge Management Multi Agents computing system oriented to “Knowledge Discovery”. It is specially suited to deal with huge and Unstructured Data Reservoirs like for instance the Web. It is based on a New Knowledge Ontology named Darwin Ontology that models the process of knowledge documenting. It states that along language evolution humans learn to document based on a set of rules and principles not formally explicit but culturally agreed and that could be unveiled and formally described whether possible performing data analysis over a huge enough text “corpus”. Darwin ontology is based on a set of Conjectures that are to be tested over this corpus.
    As the whole Web could be considered a huge enough corpus we designed a technology to test these conjectures with the idea of unveiling the hidden intelligence of it under the form of a probabilistic order. Our initial idea was that the whole Web has, as structured databases, an order but hard to see and of course of probabilistic nature. To check this assumption we mapped two major disciplines, Computing and Art thru a Multi Agent Darwin algorithm. All our conjectures probed to work, unveiling at a high resolution level the hidden intelligence of these two disciplines.  Concerning the art prototype from more than a billion of documents dealing or mentioning art (in English) we unveiled the hidden art structure: 7,570 thematic nodes along a tree of 13 levels encompassing about 300,000 concepts. An important feature was that this tree was built starting from “zero ground” with a very basic seed of less than 40 nodes. Next step will be to map the whole Web “knowledge wood” behind its corpus: 200 Major Subjects or Disciplines, 350,000 nodes –themes- and about 10 million concepts. 
    With these maps it is possible to build YGWYN, You Get What You Need in Only One Click Search Engines like retrieving in a perfectly ordered World Virtual Library where everything is semantically indexed. Talking of a generalized networking man-machine interaction Darwin Ontology covers both sides, unveiling and ordering first the machine side and then trying to unveil-infer what man (people) thinks.

Note: This work started in Year 2000 and collaborated in its development, research and implementation 150 persons centered in an American R&D company, Intag, Intelligent Agents Internet Corp and the CAECE University from Argentina.

Web References:
Institutionally, Intag, Intelligent Agents Internet Corp
AI Website, Intag.org
Darwin Blog
Darwin Demo:, for a more complete demo please ask for a Telecom presentation to Juan Chamero
Art Mapping Prototype Sample, see the three sheets!
Art Timeline Series, a Power Point Presentation of a timeline series retrieved by an agent and adjusted by a human.

Darwin Justification

What is it?: It is a Knowledge Management Multi Agent technique based on a New Knowledge Management Ontology.
What for?: To unveil the hidden intelligence of Data Reservoirs.
Why?: Because Information Retrieval and some associated techniques such as Data Mining are insufficient to satisfy actual users needs in terms of information.

How to cope with the Semantic Web

    IR, Information Retrieval has being up to now the science of searching but generally restricted to more or less structured Data realms. By structured we mean n-dimensional sets of arrays, vectors and tables where the pieces of content are located. All databases are somehow ordered and all Search Engines have their own Web database so why we affirm that the Web is not ordered?. Conventional Search Engines SE’s provide at least what we define as a “zero ground” semantic order where all Web pages are indexed by the words of their content but sharing the same semantic level of uncertainty. However this indexing say almost nothing about the specific subjects these pages deal with. If we suppose that the Human Knowledge has at least 10 levels of ordering from general matters to very specific ones this semantic “order” is what is still missing to talk of the “Semantic Web” as a reality, the Tim Berners Lee utopia.
    Darwin provides this order unveiling the inherent order hidden in the Web text corpus.

Note: In fact the Web is a multilingual text corpus.

Towards “meaning” discovery

    In the Web space, daily, openly, freely, and virtually unformatted millions of documents and messages are hosted, queried, edited, downloaded and interchanged. You know that a data set may have from zero to too much knowledge depending on the “meaning” of its content. We may also argue that this meaning was a creation of an intelligent being, individual or collective and it could be eventually “explained and documented”. Why if we define this meaning, a little more than information, as the inherent intelligence of the data set?. And what would happens if at large in huge data sets the universe of meanings were somehow related to specific content patterns?. Darwin technology as Data Mining are Knowledge Discovery Tools, but Darwin technology with a “Choice Modeling” approach goes a little farther pointing to the best possible meanings, namely “authoritative meanings”.
    Two way network communications are possible when all connected members know their respective languages and codes. Data Mining is insufficient to precisely point to specific meanings in the Web because it only unveils main tracks and patterns being unable to answer questions like:
Why are these patterns produced?;
Who generated these patterns?; and  specially
How these patterns were generated?.  

Note: Data Mining can not solve the problem of collinearity: Its intrinsic weakness is that critical data that may explain patterns is never observed.

Towards the best query
    Usually IR provides the best outcome to a given query but say nothing about “the best query”. This is logic because to do that one step is still missing: to know as much as possible about the user needs, let’ say the “other” extreme of the man-machine communication. We may imagine a facilitator link between the user and the IR tool that aid he/she to build the best query in order to obtain what he/she is really looking for!.  Not trivial mainly because users may express what they need in a wide variety of forms, languages and codes. If this link were a wise enough human to master the cognitive offer hosted in the “machine” side and at the same time endowed with a vast culture about people needs in terms of information we may then imagine his/her task via s smart dialog. Darwin ontology takes into account these two “realms”: man-machine trying to replace the hypothetical human link by a smart “e-membrane” managed by agents. 


Index of Darwin Blog
Recommended Sequence of Reading

1. What’s in a document? – Part I
How are documents “seen” and indexed by conventional Search Engines
2. What’s in a document? – Part II
Trying to “see” better along the Web text corpus: human “reading” discrimination between two types of semantic particles: Common Words and “keywords”.
3. What’s in a keyword?
We go deep analyzing different types of keywords and entering into the more elaborated term “semantic chain” and “concept”.
4. What a concept is
We give by provisionally accepted a milestone Darwin Conjecture: that knowledge structures as a tree, a tree by “discipline” and where each node corresponds with a “subject” identified by the semantic chain of n-links that goes thru n levels from the root of the tree to the subject node. Any subject is identified by a set of specific concepts –Specificity Rule- and where each concept is also identified by a semantic chain of n+1 links, being the last the conventional “keyword”. 
5. Let´s play a little with Google
We play with Google, one of the best and more complete actual Search Engines to learn about the new concepts. Now playing we are in conditions to appreciate the Web as_it_is.
6. How users search and how they “discover” their own keywords
We enter now into the “users’ realm to see how humans look for what they “need”. We introduce here the symmetric “user keyword” and the ideas of “People Knowledge” and “People’s concepts”. From the man-machine matchmaking we introduce the idea of “e-membranes”.
7. Toward e-Libraries
We post here how Darwin imagines the Semantic Web structured with knowledge trees, subjects and concepts. New terms are introduced like Web Thesauruses, “semantic fingerprint” and “authorities”. It’s an introduction to Knowledge Mapping.
8. The Web as a semantic hypercube
We are now ready to imagine how to either re-structure the whole Web by semantically index all documents properly or just keep everything as it is now but implementing “semantic glasses” to see everything as ordered!. Semantic glasses behave like intelligent e-membranes matchmaking users versus the Web. We introduce here a strong Conjecture: “Authorities” tend to write well: introducing the idea of WDD’s, Well Written Documents.
9. How do e-membranes work to build Thesauruses?
We explain Darwin as an industrial process similar to oil distillation in refineries. How we star from “zero ground” semantically (as Conventional Search Engines are actually structured, all themes scrambled) and proceed to build the semantic virtual hypercube, level by level from a “seed” at root level. It explains how agents explore the node neighborhoods looking for semantic tree consistency. This process computes for each node its “semantic fingerprint”, in fact a coded meaning.
10. A Reflection stop
At this stadium of our long lasting semantic trip -since 2001 year- we have been stimulated by the outcomes of two prototypes that prima facie confirmed all our suppositions -see Darwin Conjectures-. So I consider this is a good time to document some basic reflections: how humans build concepts and why our knowledge structures itself as a tree!.
11. Darwin Conjectures
We are now in condition to understand the meaning and scope of our Ten Conjectures.
1. About Generalized man-machine dialog.
2. About the two types of semantic particles in that dialog
3. About the two types of semantic particles in any document
4. The communication as virtually performed thru e-membranes
5. Knowledge structures itself as a set of trees -wood-
6. About this tree structure
7. About “authorities” and authoritativeness
8. About semantic fingerprints and Web Thesaurus
9. About People’s Thesaurus
10. About a generalized semantic man-machine dialog
12. Art Map Prototype – I
We describe here how we proceed to make the art tree grow from a basic semantic seed and how going deeper and how the retrieved tree is harmonized as much as possible in order to facilitate the ulterior use for information retrieval. In numbers 7,570 nodes, 350,000 concepts distributed along a tree of 13 levels.
13. Art Map Prototype – II
It details how fingerprints and specific Virtual Libraries of Authorities are built. It envisages the possibility to implement a system of semantic indexing via document fingerprints building.
14. Keywords detection
It details here the process of keywords detection out of the Web text corpus. This step is similar to some data mining techniques used in linguistic.
15. Darwin Ontology, The Web as_it_is
Darwin Ontology imagines the Web man-machine interaction as one performed between the “Established Order” and the Multitude. It presents the Web as a social model at planetary scale using at full the powerful Internet technology enabling the open, free and spontaneous flow of information and intelligence between those two realms.
16. K-side dissection
A micro sample dissection of the Web content as_it_is. This dissection was performed by humans, people of our staff. However this dissection could be performed by agents in a near future. For many this dissection may look very different as they expected.
17. K’- side dissection
The same applied to the people’ side as_it_is seen as a Multitude by our staff. Take a look to New, Libertarian, Popular and Bizarre activities!.
18. A little more about Darwin Ontology I
It is another reflection about the work performed by our staff along the last five years, how our two prototypes evolved and the near future of applications. Initially we unveiled the Computing discipline starting with a given semantic skeleton of 5 levels. In our second prototype we unveiled the Art discipline from a very primitive seed but a seed at last. Our third prototype probably will unveil an important discipline without seed, equivalent to start with zero information.
19. A little more about Darwin Ontology I
We present here the main differences between Darwin technology and Data Mining and envisages the next step: to model the people reasoning as per their free, open and spontaneous interaction with the Established side!
20. Darwin Applications scope
The scope of Darwin technology is presented here. In our criterion Darwin will make digital knowledge something absolutely manageable at least textually. Another thing is knowledge expressed in images, sounds, smells, and tactile forms of information. We are studying how Darwin Ontology could be extended to deal with these forms.





20. Darwin Applications scope

By Juan Chamero





    Darwin algorithms may deal with a vast scope of Applications mainly those derived from IR, Information Retrieval and IIR, the same IR faced and empowered with AI, Artificial Intelligence tools. Content Generation is a leading application of today as it is supposed we are living in the Content Era. The actual state of the art of it is restricted to aids and facilitations of Content Generation performed by humans, like Specialized Editors and Compilers with access to Tutorials, Guides, Quotations, Cites, References, Similar, Main Thematic Trends, Thesauruses, Glossaries and Dictionaries.
    If you imagine a Control Panel where from you may access all these resources and you are an expert writer as well your intellectual production will be many times empowered, However in order to transfer this talent to agents it would be necessary to execute and save all possible sequences of actions and of their corresponding outcomes and all steps clearly understood and explained at its deepest detail. Believe me this is still a cyclopean task!. As an example of this talent steps and micro-steps please go back to see our analysis about K-side dissection where we present examples of querying Search Engines, one of the simplest tasks onto the above mentioned Control Panel. As we will see agents could perform any computable task as long it is perfectly and clearly understood and explained.
    Darwin may behave in some instances like an efficient Differential Data Mining implying that once you have unveiled the inherent and most times hidden intelligence of a huge data reservoir there is no more need of “mining”. Darwin agents only need to process the last shell of aggregated data!. Darwin processes by “de facto” evolve by themselves.


 
  The other types of applications are those that can be performed by anthropic algorithms where A, Agents perform their tasks slightly supervised and sometimes complemented by H, Humans. These algorithms are designed in a way that once agents may perform a given Human task efficiently they will replace humans. One Darwin example was keywords detection in our first prototype: It was initially performed by university students until the discrimination talent was defined as computable. The first 18,000 keywords were unveiled by students in three months meanwhile the remaining 36,000 were unveiled by agents and with a better quality in a couple of hours. You may see a Darwin prototype of e-procurement in our Website www.procurebot.com
    Let’s analyze Darwin scope within both realms of its ontology: The Established Knowledge Realm K and the People’s Realm K’.

Established Knowledge Realm
 
   We may find examples of this side in all type of data reservoirs either concentrated or distributed. Within the first type we have Catalogs and within the second the Web. However Search Engines provide databases to facilitate the search process. Darwin Ontology is well suited to work on any of these scenarios. Its acronym that stands for Distributed Agents to Retrieve the Web Intelligence tells us about its versatility.

Cataloging and Catalogs Harmonization
    Its first aim is to facilitate the semantic ordering of the established side building natural hierarchic indexes. Giving for instance a Catalog of Materials it may check its consistency and harmony, suggesting changes to improve both. Giving a list of materials needed Darwin agents may be committed to go to permitted databases and data reservoirs to extract those Catalog trees and sub trees that match needs in order to build feasible Catalog models.

Information Offer versus Demand Optimization
    Given a well defined information “offer” database it could be cloned in as many market front ends exist and study the different Offer versus Demand matchmaking interfaces in order to: a) have a better knowledge of the true demand and b) adjust its global and distributed offer accordingly!.

Potential Demand
 
   The potential demand could be also tracked in the absence of a particular offer. Darwin algorithm could be settled and adjusted to study demand without offer. For example in the Cyber space every day almost one thousand million users from all over the world are continuously trying to learn, to teach and to broadcast their ideas and opinions. We may tune up this info brewing to focus on a small window of the demand spectrum in a “hearing at” attitude. We may also then browse the whole Web space to account for the available offer to satisfy this potential demand. Non Satisfied Demand could be then next Darwin step.

SSSE
    The WWW, World Wide Web is a network of routers, servers, and databases created to satisfy the World Information Demand. Search Engines facilitate this complex task by concentrating in their databases meaningful summaries of all documents dispersed along the Web. Is supposed that there are about 20 billion documents located in this net and its number grows at a terrific pace. From this data asset only about 10 billion are actually “classified” by the most powerful search engines. This classification is rather primitive, mostly by “words”, with all documents sharing only one semantic level of thematic specificity. In order to facilitate and make more efficient the search process the trivial solution would be to classify documents thematically. Darwin algorithm may perform this task uprising the thematic diversity from “zero ground” to as much as thirteen levels. Darwin may use the same information search engines have but providing users with a sort of “semantic glasses” ® to “see” the Web better, as ordered in up to thirteen levels instead of a fuzzy and noisy zero ground. .

People's Knowledge Realm

People’s Behavior Patterns Inferences

 
   The simplest scenario is users trying to learn as much as possible from a given Information Offer as it is the case of people using search engines. In the near future search engines will make sound actual and potential Information Demand inferences as long as people interact with their services freely. With SSSE’s users will be encouraged/induced to question via concepts (semantic chains in our approach), and as concepts match specific subjects it is relatively easy to make probabilistic inferences about what users are trying to find out.
   Users’ will question via ”established concepts” or via their own “people concepts”. People concepts play a similar role to established concepts, and should match their corresponding people’ subjects. These subjects are strongly related to “people’ information needs”. So “people information’ needs” are in some extent equivalent to “established subjects”. People query established information sources trying to satisfy their information needs. These needs are also matter of analysis within Darwin Ontology that states as a strong Conjecture that people tend to specialize in Major Activities and presupposes that these major activities, similar to disciplines in the established side, structure themselves as trees. Querying chains as semantic chains tend to structure probabilistically onto these trees. Most frequent querying chains suggest significant activities perhaps a Major Activity. It is supposed that people that belong to a given Major Activity when querying search engines will use, in the long range, a similar kit of querying chains.
    Darwin algorithms are anthropic, that is some steps are necessarily performed by humans until the retrieving talent is precisely defined and then transferred to agents. For example Darwin algorithm that work onto K-side has about 80 steps being most of them (75) performed by agents. Darwin algorithms working in the next future work onto K’-side will initially have steps that necessarily must be performed by humans because it is a realm less known than K. .


Some Darwin byproducts



    Eventually a Darwin algorithm may be located as a smart interface of a SDI, Selective Distribution of Information system as in the figure above. The main outcomes, The Hot information is secure and selectively distributed in time and form over the Organization hierarchy creating by de facto an Organization Strategic Infoduct.

    Darwin byproducts are many and fruitful. Along the distilling process mentioned above as in the oil industry appear byproducts out coming from intermediate steps. For example to build the Art Map it was necessary to process about 500,000 documents, many of them authorities.  The pure textual sequences needed to unveil keywords is a corpus a huge and linguistically structured set of texts that could be used for linguistic statistical analysis, text of validity of linguistic rules, concordance analysis, etc.  (in English in this case). To give an idea of its importance BYE, the Corpus of American English: It has more than 360 million words in nearly 150,000 texts, including 20 million words each year from 1990-2007. The “Art Corpus” that could be delivered by Darwin may have about 1,000 million words corresponding to 500,000 texts.
    Another important byproduct would be K’-side versus K-side rotation. If in K-side we suppose hosted a meaningful sample of the actual Human Knowledge, in K’-side we may imagine the actual People Knowledge flowing and we may ask ourselves: how many times K’-side must “rotate” to change K-side content?. At large how to know more about the life cycles of both realms?. As a gym about it some speculations follows. 

     People brewing in K’: 500,000 daily in the average, with an average bandwidth of 100 KB each which gives us a flow of  5x10**10 Bytes per day. As we do not have any metric yet we may dare a figure that estimate the number of days to change K significantly, for example 100,000 days. We ignore how is the relative “volatility of the disciplines of the human knowledge, some extremely volatile like news, fashion, computing and some other extremely stable and inert to changes like religion and to take into account this inertia we have to estimate a redundancy factor, which would stand for cycling insistence to effectively generate a change. If we bid for a redundancy equal 10 we finally would need an accumulatde traffic of 5x10**16 Bytes in order to change K completely.
    Now let’s get a reasonable estimation for K-side size concerning this problem. Out of the whole Website documents we should only take into consideration authorities that we estimate in 1000,000,000 with 2,000 words each in the average that taking 5 characters per word and a byte per character gives us a Web volume size of 10**13 Bytes. Finally we are going to need 5,000 rotations of K’ over K in order to change K completely. Of course this calculation is only a speculation to test order of magnitudes because real rotation has to be measured.







19. A little more about Darwin Ontology II

ByJuan Chamero

Darwin Differences with Data Mining
Darwin Raw Data Distillation Analogy
Towards People Behavior understanding

 

Darwin versus Data Mining
    Data mining is a sorting process over huge amounts of data (the mine) trying to extract significant “data patterns”. It pre supposes no previous knowledge about data.  It belongs to the field of “Knowledge Discovery” and their procedures root deep into statistics.  It also pre supposes that is being used as an analytical tool for data analysis in man-machine environments.  Patterns then may lead to discover “behaviors”, in this case users’ behaviors. Its main field of application rests on real world data with unknown interrelations. This characteristic carries an unavoidable weakness: the critical data that allegedly built detected patterns could never be observed. Concerning this weakness data mining always require ulterior and costly “data dredging” studies.
    However documents are human generated data and as such they are somehow intelligently structured and suited to a better approach pre supposing that we know something about their inner structure that is to say we may choice a model about its inner structure. So Choice Modeling is an adequate tool for making probabilistic predictions about how humans decide and document. In this sense Darwin technology may be considered within the Choice Modeling realm.

The Darwin distilling analogy


 
    From Art raw data we may imagine a distillation process that proceeds along distillation paths (instead distillation steps and towers) In this figure the product “Rigoletto” experiment 11 distillation steps. In this case Rigoletto is a “leave” of the tree and at the same time a subject and a keyword. Some other like “Fantasy novel” and “Paella” are subjects that hare not leaves but roots of sub-trees like types of Fantasy novels within the literature and types of Paella by cooking and regions of Spain within gastronomy (these sub-trees are not represented here). Node Keywords are not shown here, only nodes that are subjects and remember that all subjects are keywords but not all keywords are or become subjects. For the Art map In the average we may find about 50 keywords per node.
    

    Darwin Ontology Conjectures intent to model: a) how new concepts are generated; b) how knowledge hierarchy evolves, and; c) the documentation process. With this model “in mind” Darwin anthropic algorithm proceeds to unveil along a sort of industrial distilling process the semantic hidden structure and authoritativeness behind data. As in the oil industry raw data is, like crude oil, thousand trillions of words accumulated here and there in billions of semantic containers (documents) and generated by humans following established and well known literary routines (chemical processes).
    First we have to eliminate wastes and heavy by-products, leaving only distillable oil (Web pages   are transformed in text strings). Then Darwin algorithm has to process those text strings in order to separate two groups of semantic particles: Common Words and Expressions from Specific Concepts (and of course as in any distillation process wastes appear). Then all concepts are “purified” and reclassified within their specific clusters, namely the core of the industrial semantic distillation process.
    To complete the analogy it is important to emphasize the role of Chemical Analysis.  We may distill crude oil because we know “in advance” its chemical analysis at its deepest detail. What we know in advance as TRUE apart from our Darwin Conjectures are Conventional Search Engines Indexes. Some Search Engines like Google index all Web documents by word, no matter if they are well or wrongly written!. So we may know “in advance” crucial information about their semantic composition, something equivalent to the crude oil chemical analysis.        
    Summarizing Darwin may upgrade Search Engines indexes to become directly retrievable in only one click. We may also apply a version of this algorithm working within the same ontology to properly index all documents at registering time, something equivalent to books classification. For each Web document we may automatically and autonomously compute its “semantic fingerprint”, something equivalent to its semantic spectrum described in concepts dealt with. Next generation search engines will have incorporated this feature as standard.  From this moment onwards we may sustain that “established knowledge is known” and may be properly used by all people. 
We may also say from now onwards that the Web content has been tamed and we may talk of it as the Semantic Web
    Paraphrasing Tim Berners Lee the Web creator this is equivalent to know the Web Thesaurus, a whole meaningful Web mapping; and as in a huge Virtual Library documents will be perfectly classified and retrieved directly at will. We may also argue that the Web version of the Established Human Knowledge is perfectly identified. However this identification strongly depends on “authorities”, those entities that rules what the established knowledge is at a given moment. But why?: Because via their authoritativeness authorities rule both meanings and names.

Towards the second Semantic Big Step

 
    We have worked with this image before: It depicts the K-K’ equilibrium. Initially the Web could be considered semantically “flat”, a huge reservoir of almost 12 billion documents not thematically indexed. First step would be to build the Semantic Web that is the whole Web mapped as depicted in the lower part of the figure. Initially the intelligent interface between these two realms whether existent (in yellow) would work at half of its possibilities broadcasting information and intelligence (in yellow arrow) on way from K to K’ but only receiving not analyzed users’ queries (in white arrow)). Once K is mapped both sides of an intelligent interface would work and enabling, at large, the K’ mapping.

    In Darwin Ontology the Web Thesaurus is only the first big semantic step to pursue the truth. To obtain the best approximation to it we need to perform the second big step: to know as much as possible the People’s Truth and People’ Thesaurus
    We may talk of a sort of collective brewing truth in one side trying to change in some extent the Established Truth existing at the other side in a process that resembles thermodynamic equilibrium.
    We may also imagine this equilibrium like a man-machine game where:

The established side try to broadcast its truth to the users’ side and at the same time trying to make inferences (to “learn”) about next users’ moves and;

The users’ side users trying to “learn” as much as possible about the established knowledge and at the same time trying to broadcast individually their pieces of truth.





Page :  1 2 3 4 5 6 7