Genealogies of knowledge - developing anthropological middleware to support fieldwork-based social scienceAn E-science middleware project Directors: David Zeitlyn, Michael Fischer and Nick Ryan SummaryThe main aim of the project is to design, implement and deploy support for components of key research processes in fieldwork based social science. In particular, we will address general support of bibliographic research (references and full text extracts), interactive collection and aggregation of data within the fieldwork segment, access to and aggregation of external data from the field site, and aspects of consolidation, analysis, modelling and dissemination from the institutional base. We will accomplish this by developing gridable XML resources for project description, logic to interpret these and applications based on the logic and data structures. In particular we are interested in extending the grid concept to all data sources to provide more flexible tools for accessing and utilising these data in a wide range of contexts. BackgroundResearch in many scientific disciplines is based on a cycle of activities including bibliographic research, fieldwork, consolidation of results, analysis, modelling and synthesis. This project aims to develop, deploy and evaluate grid-based middleware to support and coordinate the anthropological research process. Although the work will focus on a relatively small discipline, the results will be generalisable to any fieldbased research process, with anthropology situated as one of the more extreme cases of field-based activity in terms of isolation, limited access to local infrastructure, and often difficult environmental circumstances. Progress is being made in the conceptualisation and implementation of the Grid concept with respect to e- Science for the research process in natural and physical sciences. Although there are elements of the research process in common, the social sciences have a different set of problems that must be addressed. Social scientists produce and use highly varied information about people, which must be available to different researchers at different levels of detail, even within the same project. When social scientists work together in teams they are markedly asynchronous. The data they use is often highly-varied and difficult to fit into simple templates or formats. The data typically requires a high degree of human interpretation to be useful for analysis. The relationships within the data are dense, and various levels of interpretation interact. The need to explore putative relationships with the ability to back out of these is critical. The analysis of a data set often represents the accretion of layer upon layer of interpretation. However one might want to question the efficacy of some of these practices, they represent the starting position for many social sciences, and for e-Science to make an impact on the social sciences we must meet these practices head-on. Fieldwork places extra demands on the research process, since data collection is largely based on observation even when measurements are involved in the form of GPS readings, diagnostic instruments, censuses, or recordings such as photographs, video or sound. Linking observation to these measurements or recordings is a difficult task under the best of conditions, and fieldwork is rarely carried out under such conditions. Such data is difficult for anyone other than the active researcher to reuse. Although such projects as the Globus Toolkit and the NCSA GPL make the implementation and use of a grid much more tractable at the system level, more abstract layers are required to accomplish the goals of the e-Science project. One outcome of this need has been a focus on XML as a middleware data transport, such as those under development in [1] and [2]. If not yet in fact, in principle XML-based applications can plugged in to the grid in a manner largely transparent to the users of these applications. Technologies such as XPATH provide means for low-, middle- and high-level access to data distributed across the grid. More important, an XML interface facilitates the application of the grid concept at a micro-level as well as a macro- level; the individual data resources within a single project can be implemented as grid nodes, very useful within the asynchronous, granular and multi-layered approaches used by social scientists. Perhaps more importantly, this helps us to incorporate and manage data that may require a great deal of translation or layering to be incorporated into a particular analysis, similar to facilities provided by the Apache Cocoon project whereby grid points deliver not the original data but the data tree as transformed by some logical resource (e.g. XSLT, FOP or customised reader-writer for virtual XML ). In this manner a particular data resource can be presented as text, a partial result that corresponds to the privileges of the user, a rendering into HTML or PDF, a summary table, an index with hyperlinks, results of a query, an alternative XML format, or even just a canonical copy. At the same time we can keep track of this derivational history, the layering process, and use this information to better manage and contextualise the analytic process, since the ancestral versions can be recovered (and other relatives consulted), and we have documentation on the history of the transformational process, and can thus easily reproduce this set of transformations on other (similar) data resources. Local interpretive and analytic transformations can also be applied, either on-grid (as a micro-grid point) or off-grid (e.g. an arbitrary analytic process). None of this precludes using local off-grid facilities that are the researcher s option. Using these concepts, in this project we will research, design and implement middleware that can, in principle, support the research process in social science, and in particular fieldwork-based social science. We will focus on components of three of these processes, the bibliographic process, the fieldwork process and the analytic process. Programme and methodologyMain Aims and ObjectivesThe main aim is to design, implement and deploy support for components of key research processes in fieldwork based social science. In particular, we will address general support of bibliographic research (references and full text extracts), interactive collection and aggregation of data within the fieldwork segment, access to and aggregation of external data from the field site, and aspects of consolidation, analysis, modelling and dissemination from the institutional base. Our main objectives are to: 1) develop resources to support bibliographic research. This will involve creating grid resources to interpret and consolidate existing data resources, specifically the Anthropological Index (AIO), JSTOR and the Human Relations Areas Files (HRAF), as well as other on-line resources (e.g. web search engines and their reference endpoints). The deliverables will include the grid resources, an XML Schema to consolidate these into documents (which can themselves be grid resources), and a reference browser to render these. 2) develop resources to support fieldwork-based research. This will involve facilities for creating custom editors (using technologies similar to XFORM), creating local grid resources to link these to instruments such as GPS receivers and video, still image and sound recording devices, references/wrappers for specimen collections, grid resources to extend knowledge-based expertise and support of methods to researchers in the field, and grid resources to assist in appropriate field-based analysis (indicative rather than conclusive to support the interactive data collection process). Of particular interest to anthropologists and other field-based social scientists is support for the creation, maintenance and reference to field notes integrated with references to other resources. Other important resources are access to specific data resources from the home base, and Development of suitable security protocols to deal with these resources is critical. The deliverables will include reference implementations of custom editor factories, XML schema to support a range of common field data types and references, gridable resources for shared access, XML schema for methods templates, instances of a number of methods templates, Java/Jini classes to implement gridable protocols and reference browser(s) to render these. 3) develop resources to support consolidation, analysis and dissemination of research results. This will include both customisable analyses deployed as web resources, and wrappers for procedures and reports involving existing software (such as SPSS, Statview or Excel), and means of controlling both in project and post project dissemination of data and results. The deliverables will include XML schema to support a range of common field data types and references, gridable resources for shared access, XML schema for analytic templates, instances of a number of analytic templates, Java/Jini classes to implement gridable protocols and reference browser(s) to render these. 4) within the context of 1-3 above develop methods for tracking of the layers that emerge from application of different resources as these are abstracted and derived. The project contains within it a number of nodes that may depend on different document definitions that are either retrieved over the network from one or more sources, more or less conforming directly to the grid model, or virtual document types that are the results of transformations of other document types or virtual types. Deliverables will include a browser for examining nodes and their layering (or lineage) based on Ryan s jnet software, a tool for developing and analysing terminological algebras (based on Read and Fischer s Kinship Algebra Expert System v2) and development of analytic procedures for working with information in this form based on information theory and cladistics. Methods1) The grid framework will be based on JINI/Java, with awareness of requirements for bridging this to other grid frameworks. 2) Where possible all data, logic and application contexts will be transported in XML. 3) We have previously defined a number of independent XML schema for bibliographic references, fulltext, fieldnotes, other field media, field-based measurements, knowledge-based guidance, and analytic templates. We will use these as a starting point, identifying common segments, namespace candidates, and new capabilities required to support the grid model. We expect to do a lot of modification relative to working in a more integrated form. 4) We will develop a pluggable XML schema framework into which arbitrary XML schema can be embedded. 5) We will do exploratory work on derivational layering, and the use of this layering in both mining of resources, project analysis and project management. This will be based on prior work on genealogy and algebraic formulations of kinship terminologies, where the genealogical process is effectively a tree integrator, and the terminological process is a tree-reducer. We will base the browser for this on Ryan's jnet software. 6) We will evaluate the effectiveness of different levels of research process integration within an existing multi-researcher ethnographic project. The latter project has been funded by the ESRC Research Methods programme for a three year period beginning in November to develop new methods of ethnobotanical and environmental research. Fischer is the PI for this grant. Zeitlyn is a CI. Although this project will benefit that project, only complementary resources have been requested. This project will not fund additional research under the other projects aims and objectives, but will likely improve our ability to satisfy these aims and objectives. Funds requested for travel relating to the existing project are for the purpose of research relating to support, implementing support and evaluation of that support. One of the two projects will involve a researcher familiar with XML technologies, the second a researcher with little or no experience. Field methods.Projects can include dynamic data sources that are updated in asynchronous fashion with different resolutions and levels and new research applications defined entirely in XML, from data reports to simulations and interactive VR applications. Field survey collection by different collectors, tracking networks of informants over time, creation of virtual walk-throughs in conjunction with GPS, direct publishing to web, direct entry into analytic monitoring, expert system data bases, data mining resources etc. Special problems of fieldwork. Many applications if linux hand-held are used as grid devices. Grid Framework. We will in the first instance base our approach on the JINI/Java model, where possible using XML as a transport for data, logic and applications in conjunction with a model similar to that of the Apache Cocoon project for separating data from logic in transforming XML documents or document fragments. The foundation is to use XPath as a first layer that can be easily interfaced to a JINI model, which itself can, in principle, be integrated into the more general Globus-style grid. This layer must be somewhat aware of the basic grid model as it evolves. The next layer of abstraction defines a project which is a very broad category - a single application, or a complex multi-node project consisting of static data resources, dynamically interpreted data sources and interactively with one or more clients. The project layer should be defined as non-prescriptively as possible, while still be interpretable into the top XML schema layer. Describing the research process.We propose to implement an interface for XML schemas for research project description, within which more specific schemas can be embedded to suit the needs of different aspects of the research process. One of our principal research threads over the past six years has been the dynamic representation of research data and the extraction of analytic views, including data driven simulation models. We propose to extend this in particular to bibliographic research, including the base literature derived by conventional bibliographic sources and data mining, the integration of both data reference sources and interactive data collection while in the field, providing access to this data on an ongoing basis, the definition of analytic templates for extraction/aggregation of information, and provision for visualisation of dynamic data resources through modelling templates. Some work has been done in this area by Bagg and Fischer for historical documents (http://braudel.ukc.ac.uk/xml/), and Fischer and Zeitlyn has done basic work on translating XML based simulation templates into prolog and java for execution (http://era.anthropology. ac.uk). Fieldnotes. We have designed, implemented, field tested and refined XML schema for fieldnotes and other field data (http://anthropology.ac.uk/Bhalot, http://lucy.ukc.ac.uk/Stirling), together with metadata for defining the context within which note fragments are embedded. This is currently deployed using a simple reference editor (http://csac.anthropology.ac.uk/XML/Tools) or a Framemaker application. Fischer and Zeitlyn have also implemented some major retrospective fieldnote projects. Ryan has also done independent work integrating fieldnotes editors and mapping software deployed on PDAs, which are integrated with GPS receivers (http://www.cs.ukc.ac.uk/people/staff/nsr/mobicomp/Fieldwork/Software/index.html). Bibliographic research. Here we will enable a researcher using the AIO as a conventional bibliographic database to be made aware of full text sources for the references listed (depending on different institutional access arrangements) and also to be made aware of related or cousin-like material through a combination of data mining and kinship algebraic operations (see later discussion). This will also enable users to choose the different kinship terminological varieties which will result in different sets of related cousins and so on (e.g. we are not restricting these definitions to pedigrees). JSTOR and the Human Relations Area Files will provide a concrete case in point acting as a relatively large full text data base so being certain to contain material relevant to references in AIO. One of our Visiting Fellows, Professor Russ Bernard in on the HRAF Management Board and so will be able to provide an informed liaison about future possibilities that this exercise will open up. Since we have already made the entire work of Prof. Paul Stirling available (including both his fieldworks and published works), we will consider that collection and other larger sub-sections of CSAC web sites as other candidates for linkage with AIO records. We will discuss such developments with ILRT since there is a clear connection with the work of SOSIG (with which we are already involved) and with staff at the ESRC Data Archive. As part of the enabling work for this we will reprocess the AIO data ensuring Unicode conformity and using subject headings as metadata. Some additional correspondence will be required with publishers of the journals indexed by AIO in order to establish likely means of access to full text. Genealogies of Knowledge. Although system implementors can make good use of XML representations of complex metadata using such technologies as XINDICE and even simple XSLT templates, it is more difficult to deliver these benefits to end users. We have done some recent work with partially specified XPATH expressions, but still far too much knowledge of the technical structuring of the document set is required. Although for specific applications it is fairly easy to set up a one-off interface, this is not as useful in a research process because of the number of one-off solutions that are necessary, and the dynamic requirements of research, especially interactive research such as anthropological fieldwork based research. We are thus proposing as a more speculative project defining another layer (modelled on genealogical lineage and kinship terminologies) over the conventional XPATH representation, in which complex tree representations can be replaced by a terminology that defines an self-limiting but flexible representation. In this layer knowledge of the tree structure can be replaced by knowledge of the referential status of the terms themselves. We, and others, have had this idea before; the use of the genealogical model to describe relations between nodes. Indeed the fundamental descriptors used to describe trees in computer science in terms of siblings, child and parent date back to the 1950s. However, research by Dwight Read (with some recent contributions by Fischer) establishes an algorithmatic method for identifying an algebra associated with a specific terminology (Read and Fischer forthcoming) and demonstrates that we can describe terminological relationships between terms without reference to an underlying genealogical tree/graph (Read 1984). That is, we can describe the syntax and semantics of terminologies in terms of themselves without external reference. Through genealogical tracing the result of a particular terminological reduction can be related to an external tree to identify actual paths between two candidate terms (which may result in many paths against an actual pedigree as an instance of a tree). Thus we can use such an algebraically structured terminology as a tree pruner or schematiser while retaining the properties of genealogical tree representations of data as integrators of multiple other trees into a single relative tree. It is at this level we want to explore and implement the genealogies of knowledge among other possible protocols. See summary points for partial explanation. But effectively we use genealogical models as semantic tree integrators, and the terminological model as a semantic tree limiter. This can be navigated as using a data/resource browser that is based first on traditional trees of related information, then the tree integration properties of genealogical trees, and then on top of that the use of terminologies as a means of problem-centring the semantics of a set of terms and rules for combination that can be instantiated in the resulting trees. Although one way of doing this would be to log each transaction, and we will explore this, much of the useful aspects of this can be accomplished by simply defining an algebraic set of relations over a set of terms to describe relationships, which provides a node specific way of referencing immediately related nodes, and nodes that are immediately related to those, etc., while still defining closure over a finite set. Additionally we will assign a digital- DNA signature to each node as created, based on its parental DNA . This will assist in identifying similar nodes that would not be easily or efficiently located conventionally, or trivially a partial lineage between two or more nodes in hand. It can also be the source representation for the specification of new complex documents capable of manipulating data from different sources, and with different views of the data by employing different levels of abstraction. Although speculative, we believe from our experience with layering in more restricted contexts that even partial success will provide a valuable tool. If we fully achieve this aim we can offer a whole new way to visualise and evaluate relations between data. The jnet system is based around a graph editor with extensive facilities for tailoring graph visualisation to the specific needs and practices of the target application areas. Written in Java, it provides facilities for exploring and editing graphs across a wide range of computing platforms and environments. As well as standalone use on a single user machine whether desktop, laptop or hand-held, it is designed to support collaborative work in networked environments including wireless. The origins lie both in earlier graph browsers (Ryans gtree and gnet 1985-1995, and more recent work in the graph visualisation community) and in an application to test ideas about collaborative working in a mobile environment. Servlet configurations display graphs using XML/SVG for 2D or VRML for 3D and provide full support for creation, editing and manipulation. Graphs are interactive and can act as an interface to the underlying data which may exist in database tables or in XML format. A prototype was described and demonstrated at the Workshop 6 Archaologie und Computer, Vienna 2001 URLs: http://www.cs.ukc.ac.uk/people/staff/nsr/arch/jnet/ http://www.cs.ukc.ac.uk/people/staff/nsr/arch/gnet/ Case for SupportThis research is timely because it permits us to build on past research in conjunction with an operating multiresearcher project in anthropology. Team based social-science research is necessary to address the larger problems that an increasingly complex world is generating, and this will require a change in style from the more common lone-researcher model. It would be preferable (and much more likely to be achieved) if this process is evolutionary rather than revolutionary. Although this project involves a lot of technologies that are not as yet common in the social sciences, the processes these are applied to are relatively conventional. Given that these technologies will be easier to deploy in a few years time than they presently are, establishing this in a model project such as the ESRC Research Methods programme funded research will give these technologies a more visible showcase than they would receive otherwise. References
[1] Krause, Amy, Kira Smyllie, Rob Baxter. Grid Data Service Speci cation for XML Data bases . EPCCGDS- WP3-GXDS1.0, OGSA-DAI GridServe. Edinburgh Parallel Computing Centre, Edinburgh. June 2002. [2] Baxter, Rob, Stephen Booth, Neil Chue Hong, Matt Egbert, Amy Krause, Andy Murdoch, Charaka Palansuriya, Kira Smyllie, Martin Westhead. Presentation: XML Database Technologies for the Grid. Edinburgh Parallel Computing Centre, Edinburgh. June 2002. Funded by the EPSRC and the ESRC as part of the E-science programme in collaboration with the Royal Anthropological Institute |
|||